xfs_binval aka xfs_flush_buftarg is the first thing done in
xfs_free_buftarg, so there is no need to have duplicated calls just before
xfs_free_buftarg in the mount failure path.
SGI-PV: 981951
SGI-Modid: xfs-linux-melb:xfs-kern:31197a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
xfs_mount_init is inlined into xfs_fs_fill_super and allocation switched
to kzalloc. Plug a leak of the mount structure for most early mount
failures. Move xfs_icsb_init_counters to as late as possible in the mount
path and make sure to undo it so that no stale hotplug cpu notifiers are
left around on mount failures.
SGI-PV: 981951
SGI-Modid: xfs-linux-melb:xfs-kern:31196a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Split setting the block and sector size out of xfs_fs_fill_super into a
small helper to make xfs_fs_fill_super more readable.
SGI-PV: 981951
SGI-Modid: xfs-linux-melb:xfs-kern:31194a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Currently closing the rt/log block device is done in the wrong spot, and
far too early. So revampt it:
- xfs_blkdev_put moved out of xfs_free_buftarg into the caller so that
it is done after tearing down the buftarg completely.
- call to xfs_unmountfs_close moved from xfs_mountfs into caller so
that it's done after tearing down the filesystem completely.
- xfs_unmountfs_close is renamed to xfs_close_devices and made static
in xfs_super.c
- opening of the block devices is split into a helper xfs_open_devices
that is symetric in use to xfs_close_devices
- xfs_unmountfs can now lose struct cred
- error handling around device opening sanitized in xfs_fs_fill_super
SGI-PV: 981951
SGI-Modid: xfs-linux-melb:xfs-kern:31193a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Freeing of the superblock is already handled in the caller, and that is
more symmetric with the mount path, too.
SGI-PV: 981951
SGI-Modid: xfs-linux-melb:xfs-kern:31192a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
xfs_mount is already pretty linux-specific so merge it into
xfs_fs_fill_super to allow for a more structured mount code in the next
patches. xfs_start_flags and xfs_finish_flags also move to xfs_super.c.
SGI-PV: 981951
SGI-Modid: xfs-linux-melb:xfs-kern:31189a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
xfs_unmount is small and already pretty Linux specific, so merge it into
the callers. The real unmount path is simplified a little by doing a
WARN_ON on the xfs_unmount_flush retval directly instead of propagating
the error back to the caller, and the mout failure case in simplified
significantly by removing the forced shutdown case and all the dmapi
events that shouldn't be sent because the dmapi mount event hasn't been
sent by that time either.
SGI-PV: 981951
SGI-Modid: xfs-linux-melb:xfs-kern:31188a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
xfs_igrow_start just expands to xfs_zero_eof with two asserts that are
useless in the context of the only caller and some rather confusing
comments.
xfs_igrow_finish is just a few lines of code decorated again with useless
asserts and confusing comments.
Just kill those two and merge them into xfs_setattr.
SGI-PV: 981498
SGI-Modid: xfs-linux-melb:xfs-kern:31186a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
xfs_mntupdate already is completely Linux specific due to the VFS flags
passed in, so it might aswell be merged into xfs_fs_remount.
SGI-PV: 981498
SGI-Modid: xfs-linux-melb:xfs-kern:31185a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Quite useless wrapper that doesn't help making the code more readable.
SGI-PV: 981498
SGI-Modid: xfs-linux-melb:xfs-kern:31184a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Recent changes to update the version number during mount (attr2 stuff)
failed to change the assert that checked for calid flags being changed on
mount. Clearly this path hasn't been exercised by the test code....
SGI-PV: 981950
SGI-Modid: xfs-linux-melb:xfs-kern:31183a
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
No need for addition permission checks in the xattr handler,
fs/xattr.c:xattr_permission() already does them, and in fact slightly more
strict then what was in the attr_capable handlers.
SGI-PV: 981809
SGI-Modid: xfs-linux-melb:xfs-kern:31164a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Tim Shimmin <tes@sgi.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
The l_flushsema doesn't exactly have completion semantics, nor mutex
semantics. It's used as a list of tasks which are waiting to be notified
that a flush has completed. It was also being used in a way that was
potentially racy, depending on the semaphore implementation.
By using a sv_t instead of a semaphore we avoid the need for a separate
counter, since we know we just need to wake everything on the queue.
Original waitqueue implementation from Matthew Wilcox. Cleanup and
conversion to sv_t by Christoph Hellwig.
SGI-PV: 981507
SGI-Modid: xfs-linux-melb:xfs-kern:31059a
Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
We found this while experimenting with 2GiB xfs logs. The previous code
never assumed that xfs logs would ever get so large.
SGI-PV: 981502
SGI-Modid: xfs-linux-melb:xfs-kern:31058a
Signed-off-by: Michael Nishimoto <miken@agami.com>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
kmem_free() function takes (ptr, size) arguments but doesn't actually use
second one.
This patch removes size argument from all callsites.
SGI-PV: 981498
SGI-Modid: xfs-linux-melb:xfs-kern:31050a
Signed-off-by: Denys Vlasenko <vda.linux@googlemail.com>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
features2 fields.
Previously, mounting with noattr2 failed to achieve anything because
although it cleared the attr2 mount flag, it would set it again as soon as
it processed the superblock fields. The fix now has an explicit noattr2
flag and uses it later to fix up the versionnum and features2 fields.
SGI-PV: 980021
SGI-Modid: xfs-linux-melb:xfs-kern:31003a
Signed-off-by: Tim Shimmin <tes@sgi.com>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
For small file block allocations, mballoc uses per cpu prealloc
space. Use goal block when searching for the right prealloc
space. Also make sure ext4_da_writepages tries to write
all the pages for small files in single attempt
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
The write_cache_pages() function uses the mapping->writeback_index as
the starting index to write out when range_cyclic is set. Properly
initialize writeback_index so that we start the writeout at index 0.
This was found when debugging the small file fragmentation on ext4.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Fix ext4_has_free_blocks() to return 0 when we don't have enough space.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Previous delalloc writepages implementation started a new transaction
outside of a loop which called get_block() to do the block allocation.
Since we didn't know exactly how many blocks would need to be allocated,
the estimated journal credits required was very conservative and caused
many issues.
With the reworked delayed allocation, a new transaction is created for
each get_block(), thus we don't need to guess how many credits for the
multiple chunk of allocation. We start every transaction with enough
credits for inserting a single exent. When estimate the credits for
indirect blocks to allocate a chunk of blocks, we need to know the
number of data blocks to allocate. We use the total number of reserved
delalloc datablocks; if that is too big, for non-extent files, we need
to limit the number of blocks to EXT4_MAX_TRANS_BLOCKS.
Code cleanup from Aneesh.
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Reviewed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
With the below changes we reserve credit needed to insert only one
extent resulting from a call to single get_block. This makes sure we
don't take too much journal credits during writeout. We also don't
limit the pages to write. That means we loop through the dirty pages
building largest possible contiguous block request. Then we issue a
single get_block request. We may get less block that we requested. If
so we would end up not mapping some of the buffer_heads. That means
those buffer_heads are still marked delay. Later in the writepage
callback via __mpage_writepage we redirty those pages.
We should also not limit/throttle wbc->nr_to_write in the filesystem
writepages callback. That cause wrong behaviour in
generic_sync_sb_inodes caused by wbc->nr_to_write being <= 0
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Reviewed-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
DIO and fallocate credit calculation is different than writepage, as
they do start a new journal right for each call to ext4_get_blocks_wrap().
This patch uses the helper function in DIO and fallocate case, passing
a flag indicating that the modified data are contigous thus could account
less indirect/index blocks.
This patch also fixed the journal credit reservation for direct I/O
(DIO). Previously the estimated credits for DIO only was calculated for
non-extent files, which was not enough if the file is extent-based.
Also fixed was fallocate double-counting credits for modifying the the
superblock.
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
This patch modified the writepage/write_begin credit calculation for
extent files, to use the credits caculation helper function.
The current calculation of how many index/leaf blocks should be
accounted is too conservetive, it always considered the worse case,
where the tree level is 5, and in the case of multiple chunk
allocations, it always assumed no blocks were dirtied in common across
the allocations. This path uses the accurate depth of the inode with
some extras to calculate the index blocks, and also less conservative in
the case of multiple allocation accounting.
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
When considering how many journal credits are needed for modifying a
chunk of data, we need to account for the super block, inode block,
quota blocks and xattr block, indirect/index blocks, also, group bitmap
and group descriptor blocks for new allocation (including data and
indirect/index blocks). There are many places in ext4 do the calculation
on their own and often missed one or two meta blocks, and often they
assume single block allocation, and did not considering the multile
chunk of allocation case.
This patch is trying to cleanup current journal credit code, provides
some common helper funtion to calculate the journal credits, to be used
for writepage, writepages, DIO, fallocate, migration, defrag, and for
both nonextent and extent files.
This patch modified the writepage/write_begin credit caculation for
nonextent files, to use the new helper function. It also fixed the
problem that writepage on nonextent files did not consider the case
blocksize <pagesize, thus could possibelly need multiple block
allocation in a single transaction.
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
The find_group_flex() function starts with best_flex as the
parent_fbg_group, which happens to have 0 inodes free. Some of the
flex groups searched have free blocks and free inodes, but the
flex_freeb_ratio is < 10, so they're skipped. Then when a group is
compared to the current "best" flex group, it does not have more free
blocks than "best", so it is skipped as well.
This continues until no flex group with free inodes is found which has
a proper ratio or which has more free blocks than the "best" group,
and we're left with a "best" group that has 0 inodes free, and we
return -ENOSPC.
We fix this by changing the logic so that if the current "best" flex
group has no inodes free, and the current one does have room, it is
promoted to the next "best."
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
When trying to resize an ext4 fs and you run out of reserved gdt blocks,
you get an error that doesn't actually tell you what went wrong, it just
says that the gdb it picked is not correct, which is the case since you
don't have any reserved gdt blocks left. This patch adds a check to make
sure you have reserved gdt blocks to use, and if not prints out a more
relevant error.
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Cc: <linux-ext4@vger.kernel.org>
Cc: Andreas Dilger <adilger@sun.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
In ext4_ext_truncate(), we should use the more generic
ext4_discard_reservations() call so we do the right thing when the
filesystem is mounted with the nomballoc option.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Mingming Cao <cmm@us.ibm.com>
This fixes a bug where readdir() would return a directory entry twice
if there was a hash collision in an hash tree indexed directory.
Signed-off-by: Eugene Dashevsky <eugene@ibrix.com>
Signed-off-by: Mike Snitzer <msnitzer@ibrix.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Ext4 will release the reserved blocks for delayed allocations when
inode is truncated/unlinked. If there is no reserved block at all, we
shouldn't need to do so. But current code still tries to release the
reserved blocks regardless whether the counters's value is 0.
Continue to do that causes the later calculation to go wrong and a
kernel BUG_ON() caught that. This doesn't happen for extent-based
files, as the calculation for 0 reserved blocks was right for extent
based file.
This patch fixed the kernel BUG() due to above reason. It adds checks
for 0 to avoid unnecessary release and fix calculation for non-extent
files.
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
We need to call ext4_discard_reservation() earlier in ext4_truncate(),
to avoid a BUG() in ext4_mb_return_to_preallocation(), which is called
(ultimately) by ext4_free_blocks(). So we must ditch the blocks on
i_prealloc_list before we start freeing the data blocks.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
When using fallocate the buffer_heads are marked unwritten and unmapped.
We need to map them in the writepages after a get_block. Otherwise we
split the uninit extents, but never write the content to disk.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
* 'hotfixes' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6:
NFS: Ensure we call nfs_sb_deactive() after releasing the directory inode
nfs_remount oops when rebooting + possible fix
Simplify the code of include/linux/task_io_accounting.h.
It is also more reasonable to have all the task i/o-related statistics in a
single struct (task_io_accounting).
Signed-off-by: Andrea Righi <righi.andrea@gmail.com>
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In order to avoid the "Busy inodes after unmount" error message, we need to
ensure that nfs_async_unlink_release() releases the super block after the
call to nfs_free_unlinkdata().
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Put all i/o statistics in struct proc_io_accounting and use inline functions to
initialize and increment statistics, removing a lot of single variable
assignments.
This also reduces the kernel size as following (with CONFIG_TASK_XACCT=y and
CONFIG_TASK_IO_ACCOUNTING=y).
text data bss dec hex filename
11651 0 0 11651 2d83 kernel/exit.o.before
11619 0 0 11619 2d63 kernel/exit.o.after
10886 132 136 11154 2b92 kernel/fork.o.before
10758 132 136 11026 2b12 kernel/fork.o.after
3082029 807968 4818600 8708597 84e1f5 vmlinux.o.before
3081869 807968 4818600 8708437 84e155 vmlinux.o.after
Signed-off-by: Andrea Righi <righi.andrea@gmail.com>
Acked-by: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* git://git.infradead.org/mtd-2.6: (57 commits)
[MTD] [NAND] subpage read feature as a way to increase performance.
CPUFREQ: S3C24XX NAND driver frequency scaling support.
[MTD][NAND] au1550nd: remove unused variable
[MTD] jedec_probe: Fix SST 16-bit chip detection
[MTD][MTDPART] Fix a division by zero bug
[MTD][MTDPART] Cleanup and document the erase region handling
[MTD][MTDPART] Handle most checkpatch findings
[MTD][MTDPART] Seperate main loop from per-partition code in add_mtd_partition
[MTD] physmap: resume already suspended chips on failure to suspend
[MTD] physmap: Fix suspend/resume/shutdown bugs.
[MTD] [NOR] Fix -ETIMEO errors in CFI driver
[MTD] [NAND] fsl_elbc_nand: fix section mismatch with CONFIG_MTD_OF_PARTS=y
[JFFS2] Use .unlocked_ioctl
[MTD] Fix const assignment in the MTD command line partitioning driver
[MTD] [NOR] gen_probe: No debug message when debugging is disabled
[MTD] [NAND] remove __PPC__ hardcoded address from DiskOnChip drivers
[MTD] [MAPS] Remove the bast-flash driver.
[MTD] [NAND] fsl_elbc_nand: ecclayout cleanups
[MTD] [NAND] fsl_elbc_nand: implement support for flash-based BBT
[MTD] [NAND] fsl_elbc_nand: fix OOB workability for large page NAND chips
...
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: (39 commits)
[PATCH] fix RLIM_NOFILE handling
[PATCH] get rid of corner case in dup3() entirely
[PATCH] remove remaining namei_{32,64}.h crap
[PATCH] get rid of indirect users of namei.h
[PATCH] get rid of __user_path_lookup_open
[PATCH] f_count may wrap around
[PATCH] dup3 fix
[PATCH] don't pass nameidata to __ncp_lookup_validate()
[PATCH] don't pass nameidata to gfs2_lookupi()
[PATCH] new (local) helper: user_path_parent()
[PATCH] sanitize __user_walk_fd() et.al.
[PATCH] preparation to __user_walk_fd cleanup
[PATCH] kill nameidata passing to permission(), rename to inode_permission()
[PATCH] take noexec checks to very few callers that care
Re: [PATCH 3/6] vfs: open_exec cleanup
[patch 4/4] vfs: immutable inode checking cleanup
[patch 3/4] fat: dont call notify_change
[patch 2/4] vfs: utimes cleanup
[patch 1/4] vfs: utimes: move owner check into inode_change_ok()
[PATCH] vfs: use kstrdup() and check failing allocation
...
Oleg Nesterov points out that we should check that the task is still alive
before we iterate over the threads. This patch includes a fixup for this.
Also simplify do_io_accounting() implementation.
Signed-off-by: Andrea Righi <righi.andrea@gmail.com>
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* dup2() should return -EBADF on exceeded sysctl_nr_open
* dup() should *not* return -EINVAL even if you have rlimit set to 0;
it should get -EMFILE instead.
Check for orig_start exceeding rlimit taken to sys_fcntl().
Failing expand_files() in dup{2,3}() now gets -EMFILE remapped to -EBADF.
Consequently, remaining checks for rlimit are taken to expand_files().
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Since Ulrich is OK with getting rid of dup3(fd, fd, flags) completely,
to hell the damn thing goes. Corner case for dup2() is handled in
sys_dup2() (complete with -EBADF if dup2(fd, fd) is called with fd
that is not open), the rest is done in dup3().
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
fs.h needs path.h, not namei.h; nfs_fs.h doesn't need it at all.
Several places in the tree needed direct include.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
make it atomic_long_t; while we are at it, get rid of useless checks in affs,
hfs and hpfs - ->open() always has it equal to 1, ->release() - to 0.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Al Viro notice one cornercase that the new dup3() code. The dup2()
function, as a special case, handles dup-ing to the same file
descriptor. In this case the current dup3() code does nothing at
all. I.e., it ingnores the flags parameter. This shouldn't happen,
the close-on-exec flag should be set if requested.
In case the O_CLOEXEC bit in the flags parameter is not set the
dup3() function should behave in this respect identical to dup2().
This means dup3(fd, fd, 0) should not actively reset the c-o-e
flag.
The patch below implements this minor change.
[AV: credits to Artur Grabowski for bringing that up as potential subtle point
in dup2() behaviour]
Signed-off-by: Ulrich Drepper <drepper@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* do not pass nameidata; struct path is all the callers want.
* switch to new helpers:
user_path_at(dfd, pathname, flags, &path)
user_path(pathname, &path)
user_lpath(pathname, &path)
user_path_dir(pathname, &path) (fail if not a directory)
The last 3 are trivial macro wrappers for the first one.
* remove nameidata in callers.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Almost all users __user_walk_fd() and friends care only about struct path.
Get rid of the few that do not.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
On Mon, May 19, 2008 at 12:01:49AM +0200, Marcin Slusarz wrote:
> open_exec is needlessly indented, calls ERR_PTR with 0 argument
> (which is not valid errno) and jumps into middle of function
> just to return value.
> So clean it up a bit.
Still looks rather messy. See below for a better version.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Move the immutable and append-only checks from chmod, chown and utimes
into notify_change(). Checks for immutable and append-only files are
always performed by the VFS and not by the filesystem (see
permission() and may_...() in namei.c), so these belong in
notify_change(), and not in inode_change_ok().
This should be completely equivalent.
CC: Ulrich Drepper <drepper@redhat.com>
CC: Michael Kerrisk <mtk.manpages@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
The FAT_IOCTL_SET_ATTRIBUTES ioctl() calls notify_change() to change
the file mode before changing the inode attributes. Replace with
explicit calls to security_inode_setattr(), fat_setattr() and
fsnotify_change().
This is equivalent to the original. The reason it is needed, is that
later in the series we move the immutable check into notify_change().
That would break the FAT_IOCTL_SET_ATTRIBUTES ioctl, as it needs to
perform the mode change regardless of the immutability of the file.
[Fix error if fat is built as a module. Thanks to OGAWA Hirofumi for
noticing.]
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Acked-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Untange the mess that is do_utimes(). Add kerneldoc comment to
do_utimes().
CC: Ulrich Drepper <drepper@redhat.com>
CC: Michael Kerrisk <mtk.manpages@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Add a new ia_valid flag: ATTR_TIMES_SET, to handle the
UTIMES_OMIT/UTIMES_NOW and UTIMES_NOW/UTIMES_OMIT cases. In these
cases neither ATTR_MTIME_SET nor ATTR_ATIME_SET is in the flags, yet
the POSIX draft specifies that permission checking is performed the
same way as if one or both of the times was explicitly set to a
timestamp.
See the path "vfs: utimensat(): fix error checking for
{UTIME_NOW,UTIME_OMIT} case" by Michael Kerrisk for the patch
introducing this behavior.
This is a cleanup, as well as allowing filesystems (NFS/fuse/...) to
perform their own permission checking instead of the default.
CC: Ulrich Drepper <drepper@redhat.com>
CC: Michael Kerrisk <mtk.manpages@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
- use kstrdup() instead of kmalloc() + memcpy()
- return NULL if allocating ->mnt_devname failed
- mnt_devname should be const
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Acked-by: Cyrill Gorcunov <gorcunov@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* MAY_CHDIR is redundant - it's an equivalent of MAY_ACCESS
* MAY_ACCESS on fuse should affect only the last step of pathname resolution
* fchdir() and chroot() should pass MAY_ACCESS, for the same reason why
chdir() needs that.
* now that we pass MAY_ACCESS explicitly in all cases, LOOKUP_ACCESS can be
removed; it has no business being in nameidata.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
... so we ought to pass MAY_CHDIR to vfs_permission() instead of having
it triggered on every step of preceding pathname resolution. LOOKUP_CHDIR
is killed by that.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Remove the unused mode parameter from vfs_symlink and callers.
Thanks to Tetsuo Handa for noticing.
CC: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Why not reuse "inode" which is assigned as
struct inode *inode = old_dentry->d_inode;
in the beginning of vfs_link() ?
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
All calls to remove_suid() are made with a file pointer, because
(similarly to file_update_time) it is called when the file is written.
Clean up callers by passing in a file instead of a dentry.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
vfs_permission(MAY_WRITE) already checked for the inode being
immutable, so no need to repeat it.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Acked-by: Christoph Hellwig <hch@infradead.org>
* kill nameidata * argument; map the 3 bits in ->flags anybody cares
about to new MAY_... ones and pass with the mask.
* kill redundant gfs2_iop_permission()
* sanitize ecryptfs_permission()
* fix remaining places where ->permission() instances might barf on new
MAY_... found in mask.
The obvious next target in that direction is permission(9)
folded fix for nfs_permission() breakage from Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
hpfs_unlink() calls permission() prior to truncating the file. HPFS
doesn't define a .permission method, so replace with explicit call to
generic_permission().
This is equivalent, except that devcgroup_inode_permission() and
security_inode_permission() are not called.
The truncation is just an implementation detail of the unlink, so
these security checks are unnecessary.
I suspect that even calling generic_permission() is unnecessary, since
we shouldn't mind if the file isn't writable. But I leave that to the
maintainer to decide.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
CC: Mikulas Patocka <mikulas@artax.karlin.mff.cuni.cz>
* keep references to ctl_table_head and ctl_table in /proc/sys inodes
* grab the former during operations, use the latter for access to
entry if that succeeds
* have ->d_compare() check if table should be seen for one who does lookup;
that allows us to avoid flipping inodes - if we have the same name resolve
to different things, we'll just keep several dentries and ->d_compare()
will reject the wrong ones.
* have ->lookup() and ->readdir() scan the table of our inode first, then
walk all ctl_table_header and scan ->attached_by for those that are
attached to our directory.
* implement ->getattr().
* get rid of insane amounts of tree-walking
* get rid of the need to know dentry in ->permission() and of the contortions
induced by that.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
hppfs_permission() is equivalent to the '.permission == NULL' case.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Lookup can install a child dentry for a deleted directory. This keeps
the directory dentry alive, and the inode pinned in the cache and on
disk, even after all external references have gone away.
This isn't a big problem normally, since memory pressure or umount
will clear out the directory dentry and its children, releasing the
inode. But for UBIFS this causes problems because its orphan area can
overflow.
Fix this by returning ENOENT for all lookups on a S_DEAD directory
before creating a child dentry.
Thanks to Zoltan Sogor for noticing this while testing UBIFS, and
Artem for the excellent analysis of the problem and testing.
Reported-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
Tested-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6:
When verifying the decoded header before decoding the object identifier
[CIFS] Fix warnings from checkpatch
[CIFS] Fix improper endian conversion of ACL subauth field
[CIFS] Fix possible double free if search immediately after search rewind fails
[CIFS] remove checkpatch warning
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
cifs: assorted endian annotations
[CIFS] break ATTR_SIZE changes out into their own function
lockdep: annotate cifs in-kernel sockets
[CIFS] Fix compiler warning on 64-bit
This adds /proc/PID/syscall and /proc/PID/task/TID/syscall magic files.
These use task_current_syscall() to show the task's current system call
number and argument registers, stack pointer and PC. For a task blocked
but not in a syscall, the file shows "-1" in place of the syscall number,
followed by only the SP and PC. For a task that's not blocked, it shows
"running".
Signed-off-by: Roland McGrath <roland@redhat.com>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Reviewed-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This adds the tracehook_tracer_task() hook to consolidate all forms of
"Who is using ptrace on me?" logic. This is used for "TracerPid:" in
/proc and for permission checks. We also clean up the selinux code the
called an identical accessor.
Signed-off-by: Roland McGrath <roland@redhat.com>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Reviewed-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This moves all the ptrace hooks related to exec into tracehook.h inlines.
This also lifts the calls for tracing out of the binfmt load_binary hooks
into search_binary_handler() after it calls into the binfmt module. This
change has no effect, since all the binfmt modules' load_binary functions
did the call at the end on success, and now search_binary_handler() does
it immediately after return if successful. We consolidate the repeated
code, and binfmt modules no longer need to import ptrace_notify().
Signed-off-by: Roland McGrath <roland@redhat.com>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Reviewed-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Use WARN() instead of a printk+WARN_ON() pair; this way the message
becomes part of the warning section for better reporting/collection.
This way, the entire if() {} section can collapse into the WARN() as well.
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Acked-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Use WARN() instead of a printk+WARN_ON() pair; this way the message becomes
part of the warning section for better reporting/collection. Also, with this,
one fo the if() sections collapses entirely into the WARN().
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Use WARN() instead of a printk+WARN_ON() pair; this way the message
becomes part of the warning section for better reporting/collection.
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Kmem cache passed to constructor is only needed for constructors that are
themselves multiplexeres. Nobody uses this "feature", nor does anybody uses
passed kmem cache in non-trivial way, so pass only pointer to object.
Non-trivial places are:
arch/powerpc/mm/init_64.c
arch/powerpc/mm/hugetlbpage.c
This is flag day, yes.
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Acked-by: Pekka Enberg <penberg@cs.helsinki.fi>
Acked-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Jon Tollefson <kniht@linux.vnet.ibm.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Matt Mackall <mpm@selenic.com>
[akpm@linux-foundation.org: fix arch/powerpc/mm/hugetlbpage.c]
[akpm@linux-foundation.org: fix mm/slab.c]
[akpm@linux-foundation.org: fix ubifs]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
mapping->tree_lock has no read lockers. convert the lock from an rwlock
to a spinlock.
Signed-off-by: Nick Piggin <npiggin@suse.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Hugh Dickins <hugh@veritas.com>
Cc: "Paul E. McKenney" <paulmck@us.ibm.com>
Reviewed-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Use get_user_pages_fast in splice. This reverts some mmap_sem batching
there, however the biggest problem with mmap_sem tends to be hold times
blocking out other threads rather than cacheline bouncing. Further: on
architectures that implement get_user_pages_fast without locks, mmap_sem
can be avoided completely anyway.
Signed-off-by: Nick Piggin <npiggin@suse.de>
Cc: Dave Kleikamp <shaggy@austin.ibm.com>
Cc: Andy Whitcroft <apw@shadowen.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Dave Kleikamp <shaggy@austin.ibm.com>
Cc: Badari Pulavarty <pbadari@us.ibm.com>
Cc: Zach Brown <zach.brown@oracle.com>
Cc: Jens Axboe <jens.axboe@oracle.com>
Reviewed-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Use get_user_pages_fast in the common/generic block and fs direct IO paths.
Signed-off-by: Nick Piggin <npiggin@suse.de>
Cc: Dave Kleikamp <shaggy@austin.ibm.com>
Cc: Andy Whitcroft <apw@shadowen.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Dave Kleikamp <shaggy@austin.ibm.com>
Cc: Badari Pulavarty <pbadari@us.ibm.com>
Cc: Zach Brown <zach.brown@oracle.com>
Cc: Jens Axboe <jens.axboe@oracle.com>
Reviewed-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Adds OMFS to the fs Kconfig and Makefile
Signed-off-by: Bob Copeland <me@bobcopeland.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add functions for reading and manipulating the storage of file data in
the extent-based OMFS.
Signed-off-by: Bob Copeland <me@bobcopeland.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add lookup and directory management routines for OMFS. The filesystem uses
hashing based on the filename and stores collisions, unordered, in siblings
of files' inode structures. To support telldir, the current position in
the hash table is encoded in fpos.
Signed-off-by: Bob Copeland <me@bobcopeland.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Replace the BKL-based locking scheme used in the bfs driver by a private
filesystem-wide mutex.
Signed-off-by: Dmitri Vorobiev <dmitri.vorobiev@movial.fi>
Cc: Tigran Aivazian <tigran_aivazian@symantec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch makes the following cleanups:
o removing an unused variable from bfs_fill_super();
o removing unneeded blank spaces from pointer
definitions.
Signed-off-by: Dmitri Vorobiev <dmitri.vorobiev@movial.fi>
Cc: Tigran Aivazian <tigran_aivazian@symantec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The semaphore s_bmlock is used as a mutex. Convert it to the mutex API.
Signed-off-by: Matthias Kaehlcke <matthias@kaehlcke.net>
Cc: Roman Zippel <zippel@linux-m68k.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'merge' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc: (34 commits)
powerpc: Wireup new syscalls
Move update_mmu_cache() declaration from tlbflush.h to pgtable.h
powerpc/pseries: Remove kmalloc call in handling writes to lparcfg
powerpc/pseries: Update arch vector to indicate support for CMO
ibmvfc: Add support for collaborative memory overcommit
ibmvscsi: driver enablement for CMO
ibmveth: enable driver for CMO
ibmveth: Automatically enable larger rx buffer pools for larger mtu
powerpc/pseries: Verify CMO memory entitlement updates with virtual I/O
powerpc/pseries: vio bus support for CMO
powerpc/pseries: iommu enablement for CMO
powerpc/pseries: Add CMO paging statistics
powerpc/pseries: Add collaborative memory manager
powerpc/pseries: Utilities to set firmware page state
powerpc/pseries: Enable CMO feature during platform setup
powerpc/pseries: Split retrieval of processor entitlement data into a helper routine
powerpc/pseries: Add memory entitlement capabilities to /proc/ppc64/lparcfg
powerpc/pseries: Split processor entitlement retrieval and gathering to helper routines
powerpc/pseries: Remove extraneous error reporting for hcall failures in lparcfg
powerpc: Fix compile error with binutils 2.15
...
Fixed up conflict in arch/powerpc/platforms/52xx/Kconfig manually.
If fuse filesystem doesn't define it's own lock operations, then allow the
lock manager to work with fuse.
Adding lockd support for remote locking is also possible, but more rarely
used, so leave it till later.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Cc: Matthew Wilcox <matthew@wil.cx>
Cc: David Teigland <teigland@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Implement the get_parent export operation by sending a LOOKUP request with
".." as the name.
Implement looking up an inode by node ID after it has been evicted from
the cache. This is done by seding a LOOKUP request with "." as the name
(for all file types, not just directories).
The filesystem can set the FUSE_EXPORT_SUPPORT flag in the INIT reply, to
indicate that it supports these special lookups.
Thanks to John Muir for the original implementation of this feature.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Cc: Matthew Wilcox <matthew@wil.cx>
Cc: David Teigland <teigland@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add a new helper function which sends a LOOKUP request with the supplied
name. This will be used by the next patch to send special LOOKUP requests
with "." and ".." as the name.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Implement export_operations, to allow fuse filesystems to be exported to
NFS. This feature has been in the out-of-tree fuse module, and is widely
used and tested.
It has not been originally merged into mainline, because doing the NFS
export in userspace was thought to be a cleaner and more efficient way of
doing it, than through the kernel.
While that is true, it would also have involved a lot of duplicated effort
at reimplementing NFS exporting (all the different versions of the
protocol). This effort was unfortunately not undertaken by anyone, so we
are left with doing it the easy but less efficient way.
If this feature goes in, the out-of-tree fuse module can go away,
which would have several advantages:
- not having to maintain two versions
- less confusion for users
- no bugs due to kernel API changes
Comment from hch:
- Use the same fh_type values as XFS, since we use the same fh encoding.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Use d_splice_alias() instead of d_add() in fuse lookup code, to allow NFS
exporting.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Allow filesystem's ->lock() method to call posix_lock_file() instead of
posix_lock_file_wait(), and return FILE_LOCK_DEFERRED. This makes it
possible to implement a such a ->lock() function, that works with the lock
manager, which needs the call to be asynchronous.
Now the vfs_lock_file() helper can be used, so this is a cleanup as well.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Cc: Matthew Wilcox <matthew@wil.cx>
Cc: David Teigland <teigland@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Extract common code into a function.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Cc: Matthew Wilcox <matthew@wil.cx>
Cc: David Teigland <teigland@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Use a special error value FILE_LOCK_DEFERRED to mean that a locking
operation returned asynchronously. This is returned by
posix_lock_file() for sleeping locks to mean that the lock has been
queued on the block list, and will be woken up when it might become
available and needs to be retried (either fl_lmops->fl_notify() is
called or fl_wait is woken up).
f_op->lock() to mean either the above, or that the filesystem will
call back with fl_lmops->fl_grant() when the result of the locking
operation is known. The filesystem can do this for sleeping as well
as non-sleeping locks.
This is to make sure, that return values of -EAGAIN and -EINPROGRESS by
filesystems are not mistaken to mean an asynchronous locking.
This also makes error handling in fs/locks.c and lockd/svclock.c slightly
cleaner.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Cc: Matthew Wilcox <matthew@wil.cx>
Cc: David Teigland <teigland@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fix nlm_fopen() to return NLM_FAILED (or NLM_LCK_DENIED_NOLOCKS) instead
of NLM_LCK_DENIED. The latter means the lock request failed because of a
conflicting lock (i.e. a temporary error), which is wrong in this case.
Also fix the client to return ENOLCK instead of EAGAIN if a blocking lock
request returns with NLM_LOCK_DENIED.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Cc: Matthew Wilcox <matthew@wil.cx>
Cc: David Teigland <teigland@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Report per-thread I/O statistics in /proc/pid/task/tid/io and aggregate
parent I/O statistics in /proc/pid/io. This approach follows the same
model used to account per-process and per-thread CPU times.
As a practial application, this allows for example to quickly find the top
I/O consumer when a process spawns many child threads that perform the
actual I/O work, because the aggregated I/O statistics can always be found
in /proc/pid/io.
[ Oleg Nesterov points out that we should check that the task is still
alive before we iterate over the threads, but also says that we can do
that fixup on top of this later. - Linus ]
Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Signed-off-by: Andrea Righi <righi.andrea@gmail.com>
Cc: Matt Heaton <matt@hostmonster.com>
Cc: Shailabh Nagar <nagar@watson.ibm.com>
Acked-by-with-comments: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Current two-stage scheme of removing PDE emphasizes one bug in proc:
open
rmmod
remove_proc_entry
close
->release won't be called because ->proc_fops were cleared. In simple
cases it's small memory leak.
For every ->open, ->release has to be done. List of openers is introduced
which is traversed at remove_proc_entry() if neeeded.
Discussions with Al long ago (sigh).
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch moves the extern of struct proc_kmsg_operations to
fs/proc/internal.h and adds an #include "internal.h" to fs/proc/kmsg.c
so that the latter sees the former.
Signed-off-by: Adrian Bunk <bunk@kernel.org>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
convert the local Dprintk() compile time debug printk wrappers to the
generic pr_debug() wrapper.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Matt Domsch <Matt_Domsch@dell.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
ELF_CORE_EFLAGS is already used by the binfmt_elf coredumper to set correct
arch specific ELF header flags on coredumps. Use it for kcore dumps as well.
At the moment, this affects the CRIS and the H8300 arch.
Signed-off-by: Edgar E. Iglesias <edgar@axis.com>
Cc: Mikael Starvik <starvik@axis.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: Ralf Baechle <ralf@linux-mips.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
I don't understand why the multi-thread coredump implies the core_uses_pid
behaviour, but we shouldn't use mm->mm_users for that. This counter can
be incremented by get_task_mm(). Use the valued returned by
coredump_wait() instead.
Also, remove the "const char *pattern" argument, format_corename() can use
core_pattern directly.
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Roland McGrath <roland@redhat.com>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Cc: Andi Kleen <andi@firstfloor.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Now that we have core_state->dumper list we can use it to wake up the
sub-threads waiting for the coredump completion.
This uglifies the code and .text grows by 47 bytes, but otoh mm_struct
lessens by sizeof(struct completion). Also, with this change we can
decouple exit_mm() from the coredumping code.
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Kill the nasty rcu_read_lock() + do_each_thread() loop, use the list
encoded in mm->core_state instead, s/GFP_ATOMIC/GFP_KERNEL/.
This patch allows futher cleanups in binfmt_elf_fdpic.c.
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Acked-by: Roland McGrath <roland@redhat.com>
Cc: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Kill the nasty rcu_read_lock() + do_each_thread() loop, use the list
encoded in mm->core_state instead, s/GFP_ATOMIC/GFP_KERNEL/.
This patch allows futher cleanups in binfmt_elf.c, in particular we can
kill the parallel info->threads list.
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Acked-by: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
binfmt->core_dump() has to iterate over the all threads in system in order
to find the coredumping threads and construct the list using the
GFP_ATOMIC allocations.
With this patch each thread allocates the list node on exit_mm()'s stack and
adds itself to the list.
This allows us to do further changes:
- simplify ->core_dump()
- change exit_mm() to clear ->mm first, then wait for ->core_done.
this makes the coredumping process visible to oom_kill
- kill mm->core_done
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Acked-by: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Move the "struct core_state core_state" from coredump_wait() to
do_coredump(), this makes mm->core_state visible to binfmt->core_dump().
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Acked-by: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Turn core_state->nr_threads into atomic_t and kill now unneeded
down_write(&mm->mmap_sem) in exit_mm().
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Change zap_process() to return int instead of incrementing
mm->core_state->nr_threads directly. Change zap_threads() to set
mm->core_state only on success.
This patch restores the original size of .text, and more importantly now
->nr_threads is used in two places only.
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Move mm->core_waiters into "struct core_state" allocated on stack. This
shrinks mm_struct a little bit and allows further changes.
This patch mostly does s/core_waiters/core_state. The only essential
change is that coredump_wait() must clear mm->core_state before return.
The coredump_wait()'s path is uglified and .text grows by 30 bytes, this
is fixed by the next patch.
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
mm->core_startup_done points to "struct completion startup_done" allocated
on the coredump_wait()'s stack. Introduce the new structure, core_state,
which holds this "struct completion". This way we can add more info
visible to the threads participating in coredump without enlarging
mm_struct.
No changes in affected .o files.
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
linux_binfmt->core_dump() runs before the process does exit_aio(), this
means that we can hit the kernel thread which shares the same ->mm.
Afaics, nothing really bad can happen, but perhaps it makes sense to fix
this minor bug.
It is sad we have to iterate over all threads in system and use
GFP_ATOMIC. Hopefully we can kill theses ugly do_each_thread()s, but this
needs some nontrivial changes in mm_struct and do_coredump.
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The main loop in zap_threads() must skip kthreads which may use the same
mm. Otherwise we "kill" this thread erroneously (for example, it can not
fork or exec after that), and the coredumping task stucks in the
TASK_UNINTERRUPTIBLE state forever because of the wrong ->core_waiters
count.
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Kill PF_BORROWED_MM. Change use_mm/unuse_mm to not play with ->flags, and
do s/PF_BORROWED_MM/PF_KTHREAD/ for a couple of other users.
No functional changes yet. But this allows us to do further
fixes/cleanups.
oom_kill/ptrace/etc often check "p->mm != NULL" to filter out the
kthreads, this is wrong because of use_mm(). The problem with
PF_BORROWED_MM is that we need task_lock() to avoid races. With this
patch we can check PF_KTHREAD directly, or use a simple lockless helper:
/* The result must not be dereferenced !!! */
struct mm_struct *__get_task_mm(struct task_struct *tsk)
{
if (tsk->flags & PF_KTHREAD)
return NULL;
return tsk->mm;
}
Note also ecard_task(). It runs with ->mm != NULL, but it's the kernel
thread without PF_BORROWED_MM.
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Introduce the new PF_KTHREAD flag to mark the kernel threads. It is set
by INIT_TASK() and copied to the forked childs (we could set it in
kthreadd() along with PF_NOFREEZE instead).
daemonize() was changed as well. In that case testing of PF_KTHREAD is
racy, but daemonize() is hopeless anyway.
This flag is cleared in do_execve(), before search_binary_handler().
Probably not the best place, we can do this in exec_mmap() or in
start_thread(), or clear it along with PF_FORKNOEXEC. But I think this
doesn't matter in practice, and if do_execve() fails kthread should die
soon.
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
No changes in fs/exec.o
The for_each_process() loop in zap_threads() is very subtle, it is not
clear why we don't race with fork/exit/exec. Add the fat comment.
Also, change the code to use while_each_thread().
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Acked-by: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Sometimes it may be useful for userspace to know (e.g. for some hosting
guys) that some user stopped exceeding his hardlimit or softlimit in
quotas. Implement sending of such events to userspace via quota netlink
protocol so that they don't have to poll for such events. Based on idea
and initial implementation by Vladislav Bogdanov.
Cc: Vladislav Bogdanov <slava@nsys.by>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Move declarations of some macros, which should be in fact functions to
quotaops.h. This way they can be later converted to inline functions
because we can now use declarations from quota.h. Also add necessary
includes of quotaops.h to a few files.
[akpm@linux-foundation.org: fix JFS build]
[akpm@linux-foundation.org: fix UFS build]
[vegard.nossum@gmail.com: fix QUOTA=n build]
Signed-off-by: Jan Kara <jack@suse.cz>
Cc: Vegard Nossum <vegard.nossum@gmail.com>
Cc: Arjen Pool <arjenpool@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Make loop in sync_dquots() checking whether there's something to write
more readable, remove useless variable and macro info_any_dirty() which
is used only in this place.
Signed-off-by: Jan Kara <jack@suse.cz>
Cc: "Vegard Nossum" <vegard.nossum@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Cleanup quotaops.h: Rename functions from uppercase to lowercase (and
define backward compatibility macros), move larger functions to dquot.c
and make them non-inline.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When quota structure is going to be dropped and it is dirty, quota code tries
to write it. If the write fails for some reason (e. g. transaction cannot
be started because the journal is aborted), we try writing again and again and
again... Fix the problem by clearing the dirty bit even if the write failed.
(akpm: for 2.6.27, 2.6.26.x and 2.6.25.x)
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: dingdinghua <dingdinghua85@gmail.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Provide a new mount option ("tz=UTC") for DOS (vfat/msdos) filesystems,
allowing timestamps to be in coordinated universal time (UTC) rather than
local time in applications where doing this is advantageous.
In particular, portable devices that use fat/vfat (such as digital
cameras) can benefit from using UTC in their internal clocks, thus
avoiding daylight saving time errors and general time ambiguity issues.
The user of the device does not have to worry about changing the time when
moving from place or when daylight saving changes.
The new mount option, when set, disables the counter-adjustment that Linux
currently makes to FAT timestamp info in anticipation of the normal
userspace time zone correction. When used in this new mode, all daylight
saving time and time zone handling is done in userspace as is normal for
many other filesystems (like ext3). The default mode, which remains
unchanged, is still appropriate when mounting volumes written in Windows
(because of its use of local time).
I originally based this patch on one submitted last year by Paul Collins,
but I updated it to work with current source and changed variable/option
naming. Ogawa Hirofumi (who maintains these filesystems) and I discussed
this patch at length on lkml, and he suggested using the option name in
the attached version of the patch. Barry Bouwsma pointed out a good
addition to the patch as well.
Signed-off-by: Joe Peterson <joe@skyrush.com>
Signed-off-by: Paul Collins <paul@ondioline.org>
Acked-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Cc: Barry Bouwsma <free_beer_for_all@yahoo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
It has been impossible to set the option 'atari' of the MSDOS filesystem
for several years. Since nobody seems to have missed it, let's remove its
remains.
Signed-off-by: Rene Scharfe <rene.scharfe@lsrfire.ath.cx>
Acked-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This removes unnecessary parsing for directory entries.
If short_only, we don't need to parse longname. And if !both and it found
the longname, we don't need shortname.
Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This uses uses stack for shortname, and uses __getname() for longname in
fat_search_long() and __fat_readdir(). By this, it removes unneeded
__getname() for shortname.
Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This is no logic changes, just cleans fs/fat/dir.c up.
Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
struct __fat_dirent is what was formerly the kernel struct dirent (that
was different from the userspace struct dirent).
Converting all fat users to struct __fat_dirent will allow us to get rid
of the conflicting struct dirent definition.
Signed-off-by: Adrian Bunk <bunk@kernel.org>
Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Current parse_options() exits too early. We need to run the code of
bottom in this function even if users doesn't specify options.
Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
remove the definitions of macros:
XATTR_SECURITY_PREFIX
XATTR_TRUSTED_PREFIX
XATTR_USER_PREFIX
since they are defined in linux/xattr.h
Signed-off-by: Shen Feng <shen@cn.fujitsu.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
j_commit_lock is a semaphore but uses it as if it were a mutex. This patch
converts it to a mutex.
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Cc: Matthew Wilcox <matthew@wil.cx>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: Edward Shishkin <edward.shishkin@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
j_flush_sem is a semaphore but uses it as if it were a mutex. This patch
converts it to a mutex.
[akpm@linux-foundation.org: fix mutex_trylock retval treatment]
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Cc: Matthew Wilcox <matthew@wil.cx>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: Edward Shishkin <edward.shishkin@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
j_lock is a semaphore but uses it as if it were a mutex. This patch converts
it to a mutex.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Cc: Matthew Wilcox <matthew@wil.cx>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: Edward Shishkin <edward.shishkin@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We should not allow user to change quota mount options when quota is just
suspended. It would make mount options and internal quota state inconsistent.
Also we should not allow user to change quota format when quota is turned on.
On the other hand we can just silently ignore when some option is set to the
value it already has (some mount versions do this on remount). Finally, we
should not discard current quota options if parsing of mount options fails.
Cc: <reiserfs-devel@vger.kernel.org>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: <reiserfs-devel@vger.kernel.org>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In journal=data mode, it is not enough to do write_inode_now() as done in
vfs_quota_on() to write all data to their final location (which is needed for
quota_read to work correctly). Calling journal_end_sync() before calling
vfs_quota_on() does it's job because transactions are committed to the journal
and data marked as dirty in memory so write_inode_now() writes them to their
final locations.
Cc: <reiserfs-devel@vger.kernel.org>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Apple Extended HFS file system: The semaphore extents lock is used as a
mutex. Convert it to the mutex API.
Signed-off-by: Matthias Kaehlcke <matthias@kaehlcke.net>
Cc: Roman Zippel <zippel@linux-m68k.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Apple Macintosh file system: The semaphore extens_lock is used as a mutex.
Convert it to the mutex API
Signed-off-by: Matthias Kaehlcke <matthias@kaehlcke.net>
Cc: Roman Zippel <zippel@linux-m68k.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Apple Macintosh file system: The semaphore bitmap_lock is used as a mutex.
Convert it to the mutex API
Signed-off-by: Matthias Kaehlcke <matthias@kaehlcke.net>
Cc: Roman Zippel <zippel@linux-m68k.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
While fixing CONFIG_ leakages to the userspace kernel headers I ran into
CODA_FS_OLD_API.
After five years, are there still people using the old API left?
Especially considering that you have to choose at compile time which API
to support in the kernel (and distributions tend to offer the new API for
some time).
Jan: "The old API can definitely go. Around the time the new
interface went in there were some non-Coda userspace file system
implementations that took a while longer to convert to the new API,
but by now they all switched to the new interface or in some cases
to a FUSE-based solution."
Signed-off-by: Adrian Bunk <bunk@kernel.org>
Acked-by: Jan Harkes <jaharkes@cs.cmu.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Some iso9660 images contain files with rockridge data that is either
incorrect or incompletely parsed. Prior to commit
f2966632a1 ("[PATCH] rock: handle directory
overflows") (included with kernel 2.6.13) the kernel ignored the rockridge
data for these files, while still allowing the files to be accessed under
their non-rockridge names. That commit inadvertently changed things so
that files with invalid rockridge data could not be accessed at all. (I
ran across the problem when comparing some old CDs with hard disk copies I
had made long ago under kernel 2.4: a few of the files on the hard disk
copies were no longer visible on the CDs.)
This change reverts to the pre-2.6.13 behavior.
Signed-off-by: Adam Greenblatt <adam.greenblatt@gmail.com>
Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: <stable@kernel.org> [2.6.25.x, 2.6.26.x]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
ext3_dx_find_entry uses ext3_next_entry without verifying that the entry
is valid. If its rec_len == 0 this causes an infinite loop. Refactor the
loop to check the validity of entries before checking whether they match
and moving onto the next one.
There are other uses of ext3_next_entry in this file which also look
problematic. They should be reviewed and fixed if/when we have a
test-case that triggers them.
This patch fixes the first case (image hdb.25.softlockup.gz) reported in
http://bugzilla.kernel.org/show_bug.cgi?id=10882.
Signed-off-by: Duane Griffin <duaneg@dghda.com>
Cc: <linux-ext4@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In ordered mode, the current jbd aborts the journal if a file data buffer
has an error. But this behavior is unintended, and we found that it has
been adopted accidentally.
This patch undoes it and just calls printk() instead of aborting the
journal. Additionally, set AS_EIO into the address_space object of the
failed buffer which is submitted by journal_do_submit_data() so that
fsync() can get -EIO.
Missing error checkings are also added to inform errors on file data
buffers to the user. The following buffers are targeted.
(a) the buffer which has already been written out by pdflush
(b) the buffer which has been unlocked before scanned in the
t_locked_list loop
[akpm@linux-foundation.org: improve grammar in a printk]
Signed-off-by: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com>
Acked-by: Jan Kara <jack@suse.cz>
Cc: <linux-ext4@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
dx_root_limit() will never return 20, and I can't figure out what 20
stands for. This function has never changed since htree directory
indexing was merged.
Similar for dx_node_limit() and the magic 22.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Acked-by: Andreas Dilger <adilger@sun.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
After ext3-ordered files are truncated, there is a possibility that the
pages which cannot be estimated still remain. Remaining pages can be
released when the system has really few memory. So, it is not memory
leakage. But the resource management software etc. may not work
correctly.
It is possible that journal_unmap_buffer() cannot release the buffers, and
the pages to which they belong because they are attached to a commiting
transaction and journal_unmap_buffer() cannot release them. To release
such the buffers and the pages later, journal_unmap_buffer() leaves it to
journal_commit_transaction(). (journal_unmap_buffer() puts the mark
'BH_Freed' to the buffers so that journal_commit_transaction() can
identify whether they can be released or not.)
In the journalled mode and the writeback mode, jbd does with only metadata
buffers. But in the ordered mode, jbd does with metadata buffers and also
data buffers.
Actually, journal_commit_transaction() releases only the metadata buffers
of which release is demanded by journal_unmap_buffer(), and also releases
the pages to which they belong if possible.
As a result, the data buffers of which release is demanded by
journal_unmap_buffer() remain after a transaction commits. And also the
pages to which they belong remain.
Such the remained pages don't have mapping any longer. Due to this fact,
there is a possibility that the pages which cannot be estimated remain.
The metadata buffers marked 'BH_Freed' and the pages to which
they belong can be released at 'JBD: commit phase 7'.
Therefore, by applying the same code into 'JBD: commit phase 2' (where the
data buffers are done with), journal_commit_transaction() can also release
the data buffers marked 'BH_Freed' and the pages to which they belong.
As a result, all the buffers marked 'BH_Freed' can be released, and also
all the pages to which these buffers belong can be released at
journal_commit_transaction(). So, the page which cannot be estimated is
lost.
<<Excerpt of code at 'JBD: commit phase 7'>>
> spin_lock(&journal->j_list_lock);
> while (commit_transaction->t_forget) {
> transaction_t *cp_transaction;
> struct buffer_head *bh;
>
> jh = commit_transaction->t_forget;
>...
> if (buffer_freed(bh)) {
> ^^^^^^^^^^^^^^^^^^^^^^^^
> clear_buffer_freed(bh);
> ^^^^^^^^^^^^^^^^^^^^^^^^
> clear_buffer_jbddirty(bh);
> }
>
> if (buffer_jbddirty(bh)) {
> JBUFFER_TRACE(jh, "add to new checkpointing trans");
> __journal_insert_checkpoint(jh, commit_transaction);
> JBUFFER_TRACE(jh, "refile for checkpoint writeback");
> __journal_refile_buffer(jh);
> jbd_unlock_bh_state(bh);
> } else {
> J_ASSERT_BH(bh, !buffer_dirty(bh));
> ...
> JBUFFER_TRACE(jh, "refile or unfile freed buffer");
> __journal_refile_buffer(jh);
> if (!jh->b_transaction) {
> jbd_unlock_bh_state(bh);
> /* needs a brelse */
> journal_remove_journal_head(bh);
> release_buffer_page(bh);
> ^^^^^^^^^^^^^^^^^^^^^^^^
> } else
> }
****************************************************************
* Apply the code of "^^^^^^" lines into 'JBD: commit phase 2' *
****************************************************************
At journal_commit_transaction() code, there is one extra message in the
series of jbd debug messages. ("JBD: commit phase 2") This patch fixes
it, too.
Signed-off-by: Toshiyuki Okajima <toshi.okajima@jp.fujitsu.com>
Acked-by: Jan Kara <jack@suse.cz>
Cc: <linux-ext4@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
While freeing indirect blocks we attach a journal head to the parent
buffer head, free the blocks, then journal the parent. If the indirect
block list is corrupted and points to the parent the journal head will be
detached when the block is cleared, causing an OOPS.
Check for that explicitly and handle it gracefully.
This patch fixes the third case (image hdb.20000057.nullderef.gz)
reported in http://bugzilla.kernel.org/show_bug.cgi?id=10882.
Immediately above the change, in the ext3_free_data function, we call
ext3_clear_blocks to clear the indirect blocks in this parent block. If
one of those blocks happens to actually be the parent block it will clear
b_private / BH_JBD.
I did the check at the end rather than earlier as it seemed more elegant.
I don't think there should be much practical difference, although it is
possible the FS may not be quite so badly corrupted if we did it the other
way (and didn't clear the block at all). To be honest, I'm not convinced
there aren't other similar failure modes lurking in this code, although I
couldn't find any with a quick review.
[akpm@linux-foundation.org: fix printk warning]
Signed-off-by: Duane Griffin <duaneg@dghda.com>
Cc: <linux-ext4@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
A transient I/O error can corrupt inode data. Here is the scenario:
(1) update inode_A at the block_B
(2) pdflush writes out new inode_A to the filesystem, but it results
in write I/O error, at this point, BH_Uptodate flag of the buffer
for block_B is cleared and BH_Write_EIO is set
(3) create new inode_C which located at block_B, and
__ext3_get_inode_loc() tries to read on-disk block_B because the
buffer is not uptodate
(4) if it can read on-disk block_B successfully, inode_A is
overwritten by old data
This patch makes __ext3_get_inode_loc() not read the inode block if the
buffer has BH_Write_EIO flag. In this case, the buffer should have the
latest information, so setting the uptodate flag to the buffer (this
avoids WARN_ON_ONCE() in mark_buffer_dirty().)
According to this change, we would need to test BH_Write_EIO flag for the
error checking. Currently nobody checks write I/O errors on metadata
buffers, but it will be done in other patches I'm working on.
Signed-off-by: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com>
Cc: sugita <yumiko.sugita.yf@hitachi.com>
Cc: Satoshi OSHIMA <satoshi.oshima.fk@hitachi.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Jan Kara <jack@ucw.cz>
Cc: <linux-ext4@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If the orphan node list includes valid, untruncatable nodes with nlink > 0
the ext3_orphan_cleanup loop which attempts to delete them will not do so,
causing it to loop forever. Fix by checking for such nodes in the
ext3_orphan_get function.
This patch fixes the second case (image hdb.20000009.softlockup.gz)
reported in http://bugzilla.kernel.org/show_bug.cgi?id=10882.
[akpm@linux-foundation.org: coding-style fixes]
[akpm@linux-foundation.org: printk warning fix]
Signed-off-by: Duane Griffin <duaneg@dghda.com>
Cc: <linux-ext4@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
remove the definitions of macros:
XATTR_TRUSTED_PREFIX
XATTR_USER_PREFIX
since they are defined in linux/xattr.h
Signed-off-by: Shen Feng <shen@cn.fujitsu.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
journal_try_to_free_buffers() could race with jbd commit transaction when
the later is holding the buffer reference while waiting for the data
buffer to flush to disk. If the caller of journal_try_to_free_buffers()
request tries hard to release the buffers, it will treat the failure as
error and return back to the caller. We have seen the directo IO failed
due to this race. Some of the caller of releasepage() also expecting the
buffer to be dropped when passed with GFP_KERNEL mask to the
releasepage()->journal_try_to_free_buffers().
With this patch, if the caller is passing the __GFP_WAIT and __GFP_FS to
indicating this call could wait, in case of try_to_free_buffers() failed,
let's waiting for journal_commit_transaction() to finish commit the
current committing transaction, then try to free those buffers again.
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Reviewed-by: Badari Pulavarty <pbadari@us.ibm.com>
Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
- remove unnecessary code in free_rb_tree_fname
- rename free_rb_tree_fname to ext3_htree_create_dir_info
since it and ext3_htree_free_dir_info are a pair
- replace kmalloc with kzalloc in ext3_htree_free_dir_info
Signed-off-by: Shen Feng <shen@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Make revocation cache destruction safe to call if initialisation fails
partially or entirely. This allows it to be used to cleanup in the case
of initialisation failure, simplifying that code slightly.
Signed-off-by: Duane Griffin <duaneg@dghda.com>
Cc: <linux-ext4@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The revocation table initialisation/destruction code is repeated for each
of the two revocation tables stored in the journal. Refactoring the
duplicated code into functions is tidier, simplifies the logic in
initialisation in particular, and slightly reduces the code size.
There should not be any functional change.
Signed-off-by: Duane Griffin <duaneg@dghda.com>
Cc: <linux-ext4@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If an error occurs during jbd cache initialisation it is possible for the
journal_head_cache to be NULL when journal_destroy_journal_head_cache is
called. Replace the J_ASSERT with an if block to handle the situation
correctly.
Note that even with this fix things will break badly if jbd is statically
compiled in and cache initialisation fails.
Signed-off-by: Duane Griffin <duaneg@dghda.com
Cc: <linux-ext4@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We should not allow user to change quota mount options when quota is just
suspended. I would make mount options and internal quota state inconsistent.
Also we should not allow user to change quota format when quota is turned on.
On the other hand we can just silently ignore when some option is set to the
value it already has (mount does this on remount).
Cc: <linux-ext4@vger.kernel.org>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: <linux-ext4@vger.kernel.org>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In journal=data mode, it is not enough to do write_inode_now as done in
vfs_quota_on() to write all data to their final location (which is needed for
quota_read to work correctly). Calling journal_flush() does its job.
Reported-by: Nick <gentuu@gmail.com>
Cc: <linux-ext4@vger.kernel.org>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
remove the definitions of macros:
XATTR_TRUSTED_PREFIX
XATTR_USER_PREFIX
since they are defined in linux/xattr.h
Signed-off-by: Shen Feng <shen@cn.fujitsu.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch removes the !NO_TRUNCATE code that anyway required a manual
editing of the code for being used.
Signed-off-by: Adrian Bunk <bunk@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
fs/exec.c used to need mman.h pagemap.h swap.h and rmap.h when it did
mm-ish stuff in install_arg_page(); but no need for them after 2.6.22.
[akpm@linux-foundation.org: unbreak arm]
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Replace the private BE16/BE32/BE64 macros with direct calls to
get_unaligned_be16/32/64.
Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Some IBM POWER-based platforms have the ability to run in a
mode which mostly appears to the OS as a different processor from the
actual hardware. For example, a Power6 system may appear to be a
Power5+, which makes the AT_PLATFORM value "power5+". This means that
programs are restricted to the ISA supported by Power5+;
Power6-specific instructions are treated as illegal.
However, some applications (virtual machines, optimized libraries) can
benefit from knowledge of the underlying CPU model. A new aux vector
entry, AT_BASE_PLATFORM, will denote the actual hardware. For
example, on a Power6 system in Power5+ compatibility mode, AT_PLATFORM
will be "power5+" and AT_BASE_PLATFORM will be "power6". The idea is
that AT_PLATFORM indicates the instruction set supported, while
AT_BASE_PLATFORM indicates the underlying microarchitecture.
If the architecture has defined ELF_BASE_PLATFORM, copy that value to
the user stack in the same manner as ELF_PLATFORM.
Signed-off-by: Nathan Lynch <ntl@pobox.com>
Acked-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
This fixes the following compile error caused by commit
f9247273cb ("UFS: add const to parser
token table"):
CC fs/nfs/nfsroot.o
/home/bunk/linux/kernel-2.6/git/linux-2.6/fs/nfs/nfsroot.c:130: error: tokens causes a section type conflict
make[3]: *** [fs/nfs/nfsroot.o] Error 1
Signed-off-by: Adrian Bunk <bunk@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(expecting a SPNEGO pseudo-mechanism oid), the test to verify it is a
primitive encoding is compared against the asn1 class. Primitive is not a
class. This brings check in line with similar check for krb/ntlmssp oid.
Signed-off-by: Chris Wright <chrisw@sous-sol.org>
Cc: Steven French <sfrench@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Steve French <sfrench@us.ibm.com>
This patch adds a "const" to the parser token table. I've done an
allmodconfig build to see if this produces any warnings/failures and the
patch includes a fix for the only warning that was produced.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Acked-by: Alexander Viro <aviro@redhat.com>
Acked-by: Evgeniy Dushistov <dushistov@mail.ru>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The ioctls AUTOFS_IOC_TOGGLEREGHOST and AUTOFS_IOC_ASKREGHOST were added
several years ago but what they were intended for has never been
implemented (as far as I'm aware noone uses them) so remove them.
Signed-off-by: Ian Kent <raven@themaw.net>
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch re-orgnirzes the checking for and waiting on active expires and
elininates redundant checks.
Signed-off-by: Ian Kent <raven@themaw.net>
Cc: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Appologies, somehow I seem to have sent an out dated version of this
patch. Here is an additional patch that brings the patch up to date.
Signed-off-by: Ian Kent <raven@themaw.net>
Cc: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
For direct and offset type mounts that are covered by another mount we
cannot check the AUTOFS_INF_EXPIRING flag during a path walk which leads
to lookups walking into an expiring mount while it is being expired.
For example, for the direct multi-mount map entry with a couple of
offsets:
/race/mm1 / <server1>:/<path1>
/om1 <server2>:/<path2>
/om2 <server1>:/<path3>
an autofs trigger mount is mounted on /race/mm1 and when accessed it is
over mounted and trigger mounts made for /race/mm1/om1 and /race/mm1/om2.
So it isn't possible for path walks to see the expiring flag at all and
they happily walk into the file system while it is expiring.
When expiring these mounts follow_down() must stop at the autofs mount and
all processes must block in the ->follow_link() method (except the daemon)
until the expire is complete. This is done by decrementing the d_mounted
field of the autofs trigger mount root dentry until the expire is
completed. In ->follow_link() all processes wait on the expire and the
mount following is completed for the daemon until the expire is complete.
Signed-off-by: Ian Kent <raven@themaw.net>
Cc: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The selection of a dentry for expiration and the setting of the
AUTOFS_INF_EXPIRING flag isn't done atomically which can lead to lookups
walking into an expiring mount.
What happens is that an expire is initiated by the daemon and a dentry is
selected for expire but, since there is no lock held between the selection
and setting of the expiring flag, a process may find the flag clear and
continue walking into the mount tree at the same time the daemon attempts
the expire it.
Signed-off-by: Ian Kent <raven@themaw.net>
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There are two cases for which a dentry that has a pending mount request
does not wait for completion. One is via autofs4_revalidate() and the
other via autofs4_follow_link().
In revalidate, after the mount point directory is created, but before the
mount is done, the check in try_to_fill_dentry() can can fail to send the
dentry to the wait queue since the dentry is positive and the lookup flags
may contain only LOOKUP_FOLLOW. Although we don't trigger a mount for the
LOOKUP_FOLLOW flag, if ther's one pending we might as well wait and use
the mounted dentry for the lookup.
In autofs4_follow_link() the dentry is not checked to see if it is pending
so it may fail to call try_to_fill_dentry() and not wait for mount
completion.
A dentry that is pending must always be sent to the wait queue.
Signed-off-by: Ian Kent <raven@themaw.net>
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The mount triggering functionality of readdir and related functions is no
longer used (and is quite broken as well). The unused portions have been
removed.
Signed-off-by: Ian Kent <raven@themaw.net>
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We have been seeing mount requests comming to the automount daemon for
keys of the form "<map key>/<non key directory>" which are lookups for
invalid map keys. But we can check for this in the kernel module and
return a fail immediately, without having to send a request to the daemon.
It is possible to recognise these requests are invalid based on whether
the request dentry is negative and its relation to the autofs file system
root.
For example, given the indirect multi-mount map entry:
idm1 \
/mm1 <server>:/<path1>
/mm2 <server>:/<path2>
For a request to mount idm1, IS_ROOT((idm1)->d_parent) will be always be
true and the dentry may be negative. But directories idm1/mm1 and
idm1/mm2 will always be created as part of the mount request for idm1. So
any mount request within idm1 itself must have a positive dentry otherwise
the map key is invalid.
In version 4 these multi-mount entries are all mounted and umounted as a
single request and in version 5 the directories idm1/mm1 and idm1/mm2 are
created and an autofs fs mounted on them to act as a mount trigger so the
above is also true.
This also holds true for the autofs version 4 pseudo direct mount feature.
When this feature is used without the "--ghost" option automount(8) will
create internal submounts as we go down the map key paths which are
essentially normal indirect mounts for which the above holds. If the
"--ghost" option is given the directories for map keys are created at
daemon startup so valid map entries correspond to postive dentries in the
autofs fs.
autofs version 5 direct mount maps are similar except that the IS_ROOT
check is not needed. This has been addressed in a previous patch tittled
"autofs4 - detect invalid direct mount requests".
For example, given the direct multi-mount map entry:
/test/dm1 \
/mm1 <server>:/<path1>
/mm2 <server>:/<path2>
An autofs fs is mounted on /test/dm1 as a trigger mount and when a mount
is triggered for /test/dm1, the multi-mount offset directories
/test/dm1/mm1 and /test/dm1/mm2 are created and an autofs fs is mounted on
them to act as mount triggers. So valid direct mount requests must always
have a positive dentry if they correspond to a valid map entry.
Signed-off-by: Ian Kent <raven@themaw.net>
Acked-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
autofs v5 direct and offset mounts within an autofs filesystem are
triggered by existing autofs triger mounts so the mount point dentry must
be positive. If the mount point dentry is negative then the trigger
doesn't exist so we can return fail immediately.
Signed-off-by: Ian Kent <raven@themaw.net>
Cc: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If an autofs mount becomes catatonic before autofs4_wait_release() is
called the wait queue counter will not be decremented down to zero and the
entry will never be freed. There are also races decrementing the wait
counter in the wait release function. To deal with this the counter needs
to be updated while holding the wait queue mutex and waiters need to be
woken up unconditionally when the wait is removed from the queue to ensure
we eventually free the wait.
Signed-off-by: Ian Kent <raven@themaw.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
It is possible for an autofs mount to become catatonic (and for the daemon
communication pipe to become NULL) after a wait has been initiallized but
before the request has been sent to the daemon. We need to check for this
before sending the request packet.
Signed-off-by: Ian Kent <raven@themaw.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
It see that the patch tittled "autofs4 - fix pending mount race" is
missing a change that I had recently made.
It's missing a kfree for the case mutex_lock_interruptible() fails
to aquire the wait queue mutex.
Signed-off-by: Ian Kent <raven@themaw.net>
Cc: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Close a race between a pending mount that is about to finish and a new
lookup for the same directory.
Process P1 triggers a mount of directory foo. It sets
DCACHE_AUTOFS_PENDING in the ->lookup routine, creates a waitq entry for
'foo', and calls out to the daemon to perform the mount. The autofs
daemon will then create the directory 'foo', using a new dentry that will
be hashed in the dcache.
Before the mount completes, another process, P2, tries to walk into the
'foo' directory. The vfs path walking code finds an entry for 'foo' and
calls the revalidate method. Revalidate finds that the entry is not
PENDING (because PENDING was never set on the dentry created by the
mkdir), but it does find the directory is empty. Revalidate calls
try_to_fill_dentry, which sets the PENDING flag and then calls into the
autofs4 wait code to trigger or wait for a mount of 'foo'. The wait code
finds the entry for 'foo' and goes to sleep waiting for the completion of
the mount.
Yet another process, P3, tries to walk into the 'foo' directory. This
process again finds a dentry in the dcache for 'foo', and calls into the
autofs revalidate code.
The revalidate code finds that the PENDING flag is set, and so calls
try_to_fill_dentry.
a) try_to_fill_dentry sets the PENDING flag redundantly for this
dentry, then calls into the autofs4 wait code.
b) the autofs4 wait code takes the waitq mutex and searches for an
entry for 'foo'
Between a and b, P1 is woken up because the mount completed. P1 takes the
wait queue mutex, clears the PENDING flag from the dentry, and removes the
waitqueue entry for 'foo' from the list.
When it releases the waitq mutex, P3 (eventually) acquires it. At this
time, it looks for an existing waitq for 'foo', finds none, and so creates
a new one and calls out to the daemon to mount the 'foo' directory.
Now, the reason that three processes are required to trigger this race is
that, because the PENDING flag is not set on the dentry created by mkdir,
the window for the race would be way to slim for it to ever occur.
Basically, between the testing of d_mountpoint(dentry) and the taking of
the waitq mutex, the mount would have to complete and the daemon would
have to be woken up, and that in turn would have to wake up P1. This is
simply impossible. Add the third process, though, and it becomes slightly
more likely.
Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Ian Kent <raven@themaw.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The autofs4_catatonic_mode() function accesses the wait queue without any
locking but can be called at any time. This could lead to a possible
double free of the name field of the wait and a double fput of the daemon
communication pipe or an fput of a NULL file pointer.
Signed-off-by: Ian Kent <raven@themaw.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The autofs_wait_queue already contains all of the fields of the
struct qstr, so change it into a qstr.
This patch, from Jeff Moyer, has been modified a liitle by myself.
Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Ian Kent <raven@themaw.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When an open(2) call is made on an autofs mount point directory that
already exists and the O_DIRECTORY flag is not used the needed mount
callback to the daemon is not done. This leads to the path walk
continuing resulting in a callback to the daemon with an incorrect
key. open(2) is called without O_DIRECTORY by the "find" utility but
this should be handled properly anyway.
This happens because autofs needs to use the lookup flags to decide
when to callback to the daemon to perform a mount to prevent mount
storms. For example, an autofs indirect mount map that has the "browse"
option will have the mount point directories are pre-created and the
stat(2) call made by a color ls against each directory will cause all
these directories to be mounted. It is unfortunate we need to resort
to this but mount maps can be quite large. Additionally, if a user
manually umounts an autofs indirect mount the directory isn't removed
which also leads to this situation.
To resolve this autofs needs to use the lookup intent flags to enable
it to make this decision. This patch adds this check and triggers a
call back if any of the lookup intent flags are set as all these calls
warrant a mount attempt be requested.
I know that external VFS code which uses the lookup flags is something
that the VFS would like to eliminate but I have no choice as I can't
see any other way to do this. A VFS dentry or inode operation callback
which returns the lookup "type" (requires a definition) would be
sufficient. But this change is needed now and I'm not aware of the form
that coming VFS changes will take so I'm not willing to propose anything
along these lines.
If anyone can provide an alternate method I would be happy to use it.
[akpm@linux-foundation.org: fix build for concurrent VFS changes]
Signed-off-by: Ian Kent <raven@themaw.net>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Since we now delay hashing of dentrys until the ->mkdir() call, droping
and re-taking the directory mutex within the ->lookup() function when we
are being called by user space is not needed. This can lead to a race
when other processes are attempting to access the same directory during
mount point directory creation.
In this case we need to hang onto the mutex to ensure we don't get user
processes trying to create a mount request for a newly created dentry
after the mount point entry has already been created. This ensures that
when we need to check a dentry passed to autofs4_wait(), if it is hashed,
it is always the mount point dentry and not a new dentry created by
another lookup during directory creation.
Signed-off-by: Ian Kent <raven@themaw.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The length of the symlink name has been moved but it needs to be set
before allocating space for it in the dentry info struct. This corrects a
mistake in a recent patch.
Signed-off-by: Ian Kent <raven@themaw.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
A while ago a patch to resolve a deadlock during directory creation was
merged. This delayed the hashing of lookup dentrys until the ->mkdir()
(or ->symlink()) operation completed to ensure we always went through
->lookup() instead of also having processes go through ->revalidate() so
our VFS locking remained consistent.
Now we are seeing a couple of side affects of that change in situations
with heavy mount activity.
Two cases have been identified:
1) When a mount request is triggered, due to the delayed hashing, the
directory created by user space for the mount point doesn't have the
DCACHE_AUTOFS_PENDING flag set. In the case of an autofs multi-mount
where a tree of mount point directories are created this can lead to
the path walk continuing rather than the dentry being sent to the wait
queue to wait for request completion. This is because, if the pending
flag isn't set, the criteria for deciding this is a mount in progress
fails to hold, namely that the dentry is not a mount point and has no
subdirectories.
2) A mount request dentry is initially created negative and unhashed.
It remains this way until the ->mkdir() callback completes. Since it
is unhashed a fresh dentry is used when the user space mount request
creates the mount point directory. This leaves the original dentry
negative and unhashed. But revalidate has no way to tell the VFS that
the dentry has changed, other than to force another ->lookup() by
returning false, which is at best wastefull and at worst not possible.
This results in an -ENOENT return from the original path walk when in
fact the mount succeeded.
To resolve this we need to ensure that the same dentry is used in all
calls to ->lookup() during the course of a mount request. This patch
achieves that by adding the initial dentry to a look aside list and
removes it at ->mkdir() or ->symlink() completion (or when the dentry is
released), since these are the only create operations autofs4 supports.
Signed-off-by: Ian Kent <raven@themaw.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch series enables the use of a single dentry for lookups prior to
the dentry being hashed and so we no longer need to redo the lookup. This
patch reverts the patch of commit
033790449b.
Signed-off-by: Ian Kent <raven@themaw.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Correct the error of making a positive dentry negative after it has been
instantiated.
The code that makes this error attempts to re-use the dentry from a
concurrent expire and mount to resolve a race and the dentry used for the
lookup must be negative for mounts to trigger in the required cases. The
fact is that the dentry doesn't need to be re-used because all that is
needed is to preserve the flag that indicates an expire is still
incomplete at the time of the mount request.
This change uses the the dentry to check the flag and wait for the expire
to complete then discards it instead of attempting to re-use it.
Signed-off-by: Ian Kent <raven@themaw.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There is no good reason to immediately open the lower file, and that can
cause problems with files that the user does not intend to immediately
open, such as device nodes.
This patch removes the persistent file open from the interpose step and
pushes that to the locations where eCryptfs really does need the lower
persistent file, such as just before reading or writing the metadata
stored in the lower file header.
Two functions are jumping to out_dput when they should just be jumping to
out on error paths. This patch also fixes these.
Signed-off-by: Michael Halcrow <mhalcrow@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When creating device nodes, eCryptfs needs to delay actually opening the lower
persistent file until an application tries to open. Device handles may not be
backed by anything when they first come into existence.
[Valdis.Kletnieks@vt.edu: build fix]
Signed-off-by: Michael Halcrow <mhalcrow@us.ibm.com>
Cc: <Valdis.Kletnieks@vt.edu}
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fixes the following sparse warnings:
fs/ecryptfs/crypto.c:1036:8: warning: cast to restricted __be32
fs/ecryptfs/crypto.c:1038:8: warning: cast to restricted __be32
fs/ecryptfs/crypto.c:1077:10: warning: cast to restricted __be32
fs/ecryptfs/crypto.c:1103:6: warning: incorrect type in assignment (different base types)
fs/ecryptfs/crypto.c:1105:6: warning: incorrect type in assignment (different base types)
fs/ecryptfs/crypto.c:1124:8: warning: incorrect type in assignment (different base types)
fs/ecryptfs/crypto.c:1241:21: warning: incorrect type in assignment (different base types)
fs/ecryptfs/crypto.c:1244:30: warning: incorrect type in assignment (different base types)
fs/ecryptfs/crypto.c:1414:23: warning: cast to restricted __be32
fs/ecryptfs/crypto.c:1417:32: warning: cast to restricted __be16
Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Cc: Michael Halcrow <mhalcrow@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Clean up overcomplicated string copy, which also gets rid of this
bogus warning:
fs/ecryptfs/main.c: In function 'ecryptfs_parse_options':
include/asm/arch/string_32.h:75: warning: array subscript is above array bounds
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Cc: Michael Halcrow <mhalcrow@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Mounting with invalid key signatures should probably fail, if they were
specifically requested but not available.
Also fix case checks in process_request_key_err() for the right sign of
the errnos, as spotted by Jan Tluka.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Jan Tluka <jtluka@redhat.com>
Acked-by: Michael Halcrow <mhalcrow@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The userspace eCryptfs daemon sends HELO and QUIT messages to the kernel
for per-user daemon (un)registration. These messages are required when
netlink is used as the transport, but (un)registration is handled by
opening and closing the device file when miscdev is the transport. These
messages should be discarded in the miscdev transport so that a daemon
isn't registered twice.
Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>
Cc: Michael Halcrow <mhalcrow@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
eCryptfs would really like to have read-write access to all files in the
lower filesystem. Right now, the persistent lower file may be opened
read-only if the attempt to open it read-write fails. One way to keep
from having to do that is to have a privileged kthread that can open the
lower persistent file on behalf of the user opening the eCryptfs file;
this patch implements this functionality.
This patch will properly allow a less-privileged user to open the eCryptfs
file, followed by a more-privileged user opening the eCryptfs file, with
the first user only being able to read and the second user being able to
both read and write. eCryptfs currently does this wrong; it will wind up
calling vfs_write() on a file that was opened read-only. This is fixed in
this patch.
Signed-off-by: Michael Halcrow <mhalcrow@us.ibm.com>
Cc: Dave Kleikamp <shaggy@austin.ibm.com>
Cc: Serge Hallyn <serue@us.ibm.com>
Cc: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch adds test that ensure the boundary conditions for the various
constants introduced in the previous patches is met. No code is generated.
[akpm@linux-foundation.org: fix alpha]
Signed-off-by: Ulrich Drepper <drepper@redhat.com>
Acked-by: Davide Libenzi <davidel@xmailserver.org>
Cc: Michael Kerrisk <mtk.manpages@googlemail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch adds O_NONBLOCK support to pipe2. It is minimally more involved
than the patches for eventfd et.al but still trivial. The interfaces of the
create_write_pipe and create_read_pipe helper functions were changed and the
one other caller as well.
The following test must be adjusted for architectures other than x86 and
x86-64 and in case the syscall numbers changed.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
#include <fcntl.h>
#include <stdio.h>
#include <unistd.h>
#include <sys/syscall.h>
#ifndef __NR_pipe2
# ifdef __x86_64__
# define __NR_pipe2 293
# elif defined __i386__
# define __NR_pipe2 331
# else
# error "need __NR_pipe2"
# endif
#endif
int
main (void)
{
int fds[2];
if (syscall (__NR_pipe2, fds, 0) == -1)
{
puts ("pipe2(0) failed");
return 1;
}
for (int i = 0; i < 2; ++i)
{
int fl = fcntl (fds[i], F_GETFL);
if (fl == -1)
{
puts ("fcntl failed");
return 1;
}
if (fl & O_NONBLOCK)
{
printf ("pipe2(0) set non-blocking mode for fds[%d]\n", i);
return 1;
}
close (fds[i]);
}
if (syscall (__NR_pipe2, fds, O_NONBLOCK) == -1)
{
puts ("pipe2(O_NONBLOCK) failed");
return 1;
}
for (int i = 0; i < 2; ++i)
{
int fl = fcntl (fds[i], F_GETFL);
if (fl == -1)
{
puts ("fcntl failed");
return 1;
}
if ((fl & O_NONBLOCK) == 0)
{
printf ("pipe2(O_NONBLOCK) does not set non-blocking mode for fds[%d]\n", i);
return 1;
}
close (fds[i]);
}
puts ("OK");
return 0;
}
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Signed-off-by: Ulrich Drepper <drepper@redhat.com>
Acked-by: Davide Libenzi <davidel@xmailserver.org>
Cc: Michael Kerrisk <mtk.manpages@googlemail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Building on the previous change to anon_inode_getfd, this patch introduces
support for handling of O_NONBLOCK in addition to the already supported
O_CLOEXEC. Following patches will take advantage of this support. As can be
seen, the additional support for supporting this functionality is minimal.
Signed-off-by: Ulrich Drepper <drepper@redhat.com>
Acked-by: Davide Libenzi <davidel@xmailserver.org>
Cc: Michael Kerrisk <mtk.manpages@googlemail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch introduces the new syscall inotify_init1 (note: the 1 stands for
the one parameter the syscall takes, as opposed to no parameter before). The
values accepted for this parameter are function-specific and defined in the
inotify.h header. Here the values must match the O_* flags, though. In this
patch CLOEXEC support is introduced.
The following test must be adjusted for architectures other than x86 and
x86-64 and in case the syscall numbers changed.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
#include <fcntl.h>
#include <stdio.h>
#include <unistd.h>
#include <sys/syscall.h>
#ifndef __NR_inotify_init1
# ifdef __x86_64__
# define __NR_inotify_init1 294
# elif defined __i386__
# define __NR_inotify_init1 332
# else
# error "need __NR_inotify_init1"
# endif
#endif
#define IN_CLOEXEC O_CLOEXEC
int
main (void)
{
int fd;
fd = syscall (__NR_inotify_init1, 0);
if (fd == -1)
{
puts ("inotify_init1(0) failed");
return 1;
}
int coe = fcntl (fd, F_GETFD);
if (coe == -1)
{
puts ("fcntl failed");
return 1;
}
if (coe & FD_CLOEXEC)
{
puts ("inotify_init1(0) set close-on-exit");
return 1;
}
close (fd);
fd = syscall (__NR_inotify_init1, IN_CLOEXEC);
if (fd == -1)
{
puts ("inotify_init1(IN_CLOEXEC) failed");
return 1;
}
coe = fcntl (fd, F_GETFD);
if (coe == -1)
{
puts ("fcntl failed");
return 1;
}
if ((coe & FD_CLOEXEC) == 0)
{
puts ("inotify_init1(O_CLOEXEC) does not set close-on-exit");
return 1;
}
close (fd);
puts ("OK");
return 0;
}
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
[akpm@linux-foundation.org: add sys_ni stub]
Signed-off-by: Ulrich Drepper <drepper@redhat.com>
Acked-by: Davide Libenzi <davidel@xmailserver.org>
Cc: Michael Kerrisk <mtk.manpages@googlemail.com>
Cc: <linux-arch@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch introduces the new syscall pipe2 which is like pipe but it also
takes an additional parameter which takes a flag value. This patch implements
the handling of O_CLOEXEC for the flag. I did not add support for the new
syscall for the architectures which have a special sys_pipe implementation. I
think the maintainers of those archs have the chance to go with the unified
implementation but that's up to them.
The implementation introduces do_pipe_flags. I did that instead of changing
all callers of do_pipe because some of the callers are written in assembler.
I would probably screw up changing the assembly code. To avoid breaking code
do_pipe is now a small wrapper around do_pipe_flags. Once all callers are
changed over to do_pipe_flags the old do_pipe function can be removed.
The following test must be adjusted for architectures other than x86 and
x86-64 and in case the syscall numbers changed.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
#include <fcntl.h>
#include <stdio.h>
#include <unistd.h>
#include <sys/syscall.h>
#ifndef __NR_pipe2
# ifdef __x86_64__
# define __NR_pipe2 293
# elif defined __i386__
# define __NR_pipe2 331
# else
# error "need __NR_pipe2"
# endif
#endif
int
main (void)
{
int fd[2];
if (syscall (__NR_pipe2, fd, 0) != 0)
{
puts ("pipe2(0) failed");
return 1;
}
for (int i = 0; i < 2; ++i)
{
int coe = fcntl (fd[i], F_GETFD);
if (coe == -1)
{
puts ("fcntl failed");
return 1;
}
if (coe & FD_CLOEXEC)
{
printf ("pipe2(0) set close-on-exit for fd[%d]\n", i);
return 1;
}
}
close (fd[0]);
close (fd[1]);
if (syscall (__NR_pipe2, fd, O_CLOEXEC) != 0)
{
puts ("pipe2(O_CLOEXEC) failed");
return 1;
}
for (int i = 0; i < 2; ++i)
{
int coe = fcntl (fd[i], F_GETFD);
if (coe == -1)
{
puts ("fcntl failed");
return 1;
}
if ((coe & FD_CLOEXEC) == 0)
{
printf ("pipe2(O_CLOEXEC) does not set close-on-exit for fd[%d]\n", i);
return 1;
}
}
close (fd[0]);
close (fd[1]);
puts ("OK");
return 0;
}
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Signed-off-by: Ulrich Drepper <drepper@redhat.com>
Acked-by: Davide Libenzi <davidel@xmailserver.org>
Cc: Michael Kerrisk <mtk.manpages@googlemail.com>
Cc: <linux-arch@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch adds the new dup3 syscall. It extends the old dup2 syscall by one
parameter which is meant to hold a flag value. Support for the O_CLOEXEC flag
is added in this patch.
The following test must be adjusted for architectures other than x86 and
x86-64 and in case the syscall numbers changed.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
#include <fcntl.h>
#include <stdio.h>
#include <time.h>
#include <unistd.h>
#include <sys/syscall.h>
#ifndef __NR_dup3
# ifdef __x86_64__
# define __NR_dup3 292
# elif defined __i386__
# define __NR_dup3 330
# else
# error "need __NR_dup3"
# endif
#endif
int
main (void)
{
int fd = syscall (__NR_dup3, 1, 4, 0);
if (fd == -1)
{
puts ("dup3(0) failed");
return 1;
}
int coe = fcntl (fd, F_GETFD);
if (coe == -1)
{
puts ("fcntl failed");
return 1;
}
if (coe & FD_CLOEXEC)
{
puts ("dup3(0) set close-on-exec flag");
return 1;
}
close (fd);
fd = syscall (__NR_dup3, 1, 4, O_CLOEXEC);
if (fd == -1)
{
puts ("dup3(O_CLOEXEC) failed");
return 1;
}
coe = fcntl (fd, F_GETFD);
if (coe == -1)
{
puts ("fcntl failed");
return 1;
}
if ((coe & FD_CLOEXEC) == 0)
{
puts ("dup3(O_CLOEXEC) set close-on-exec flag");
return 1;
}
close (fd);
puts ("OK");
return 0;
}
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Signed-off-by: Ulrich Drepper <drepper@redhat.com>
Acked-by: Davide Libenzi <davidel@xmailserver.org>
Cc: Michael Kerrisk <mtk.manpages@googlemail.com>
Cc: <linux-arch@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch adds the new epoll_create2 syscall. It extends the old epoll_create
syscall by one parameter which is meant to hold a flag value. In this
patch the only flag support is EPOLL_CLOEXEC which causes the close-on-exec
flag for the returned file descriptor to be set.
A new name EPOLL_CLOEXEC is introduced which in this implementation must
have the same value as O_CLOEXEC.
The following test must be adjusted for architectures other than x86 and
x86-64 and in case the syscall numbers changed.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
#include <fcntl.h>
#include <stdio.h>
#include <time.h>
#include <unistd.h>
#include <sys/syscall.h>
#ifndef __NR_epoll_create2
# ifdef __x86_64__
# define __NR_epoll_create2 291
# elif defined __i386__
# define __NR_epoll_create2 329
# else
# error "need __NR_epoll_create2"
# endif
#endif
#define EPOLL_CLOEXEC O_CLOEXEC
int
main (void)
{
int fd = syscall (__NR_epoll_create2, 1, 0);
if (fd == -1)
{
puts ("epoll_create2(0) failed");
return 1;
}
int coe = fcntl (fd, F_GETFD);
if (coe == -1)
{
puts ("fcntl failed");
return 1;
}
if (coe & FD_CLOEXEC)
{
puts ("epoll_create2(0) set close-on-exec flag");
return 1;
}
close (fd);
fd = syscall (__NR_epoll_create2, 1, EPOLL_CLOEXEC);
if (fd == -1)
{
puts ("epoll_create2(EPOLL_CLOEXEC) failed");
return 1;
}
coe = fcntl (fd, F_GETFD);
if (coe == -1)
{
puts ("fcntl failed");
return 1;
}
if ((coe & FD_CLOEXEC) == 0)
{
puts ("epoll_create2(EPOLL_CLOEXEC) set close-on-exec flag");
return 1;
}
close (fd);
puts ("OK");
return 0;
}
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Signed-off-by: Ulrich Drepper <drepper@redhat.com>
Acked-by: Davide Libenzi <davidel@xmailserver.org>
Cc: Michael Kerrisk <mtk.manpages@googlemail.com>
Cc: <linux-arch@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The timerfd_create syscall already has a flags parameter. It just is
unused so far. This patch changes this by introducing the TFD_CLOEXEC
flag to set the close-on-exec flag for the returned file descriptor.
A new name TFD_CLOEXEC is introduced which in this implementation must
have the same value as O_CLOEXEC.
The following test must be adjusted for architectures other than x86 and
x86-64 and in case the syscall numbers changed.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
#include <fcntl.h>
#include <stdio.h>
#include <time.h>
#include <unistd.h>
#include <sys/syscall.h>
#ifndef __NR_timerfd_create
# ifdef __x86_64__
# define __NR_timerfd_create 283
# elif defined __i386__
# define __NR_timerfd_create 322
# else
# error "need __NR_timerfd_create"
# endif
#endif
#define TFD_CLOEXEC O_CLOEXEC
int
main (void)
{
int fd = syscall (__NR_timerfd_create, CLOCK_REALTIME, 0);
if (fd == -1)
{
puts ("timerfd_create(0) failed");
return 1;
}
int coe = fcntl (fd, F_GETFD);
if (coe == -1)
{
puts ("fcntl failed");
return 1;
}
if (coe & FD_CLOEXEC)
{
puts ("timerfd_create(0) set close-on-exec flag");
return 1;
}
close (fd);
fd = syscall (__NR_timerfd_create, CLOCK_REALTIME, TFD_CLOEXEC);
if (fd == -1)
{
puts ("timerfd_create(TFD_CLOEXEC) failed");
return 1;
}
coe = fcntl (fd, F_GETFD);
if (coe == -1)
{
puts ("fcntl failed");
return 1;
}
if ((coe & FD_CLOEXEC) == 0)
{
puts ("timerfd_create(TFD_CLOEXEC) set close-on-exec flag");
return 1;
}
close (fd);
puts ("OK");
return 0;
}
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Signed-off-by: Ulrich Drepper <drepper@redhat.com>
Acked-by: Davide Libenzi <davidel@xmailserver.org>
Cc: Michael Kerrisk <mtk.manpages@googlemail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch adds the new eventfd2 syscall. It extends the old eventfd
syscall by one parameter which is meant to hold a flag value. In this
patch the only flag support is EFD_CLOEXEC which causes the close-on-exec
flag for the returned file descriptor to be set.
A new name EFD_CLOEXEC is introduced which in this implementation must
have the same value as O_CLOEXEC.
The following test must be adjusted for architectures other than x86 and
x86-64 and in case the syscall numbers changed.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
#include <fcntl.h>
#include <stdio.h>
#include <unistd.h>
#include <sys/syscall.h>
#ifndef __NR_eventfd2
# ifdef __x86_64__
# define __NR_eventfd2 290
# elif defined __i386__
# define __NR_eventfd2 328
# else
# error "need __NR_eventfd2"
# endif
#endif
#define EFD_CLOEXEC O_CLOEXEC
int
main (void)
{
int fd = syscall (__NR_eventfd2, 1, 0);
if (fd == -1)
{
puts ("eventfd2(0) failed");
return 1;
}
int coe = fcntl (fd, F_GETFD);
if (coe == -1)
{
puts ("fcntl failed");
return 1;
}
if (coe & FD_CLOEXEC)
{
puts ("eventfd2(0) sets close-on-exec flag");
return 1;
}
close (fd);
fd = syscall (__NR_eventfd2, 1, EFD_CLOEXEC);
if (fd == -1)
{
puts ("eventfd2(EFD_CLOEXEC) failed");
return 1;
}
coe = fcntl (fd, F_GETFD);
if (coe == -1)
{
puts ("fcntl failed");
return 1;
}
if ((coe & FD_CLOEXEC) == 0)
{
puts ("eventfd2(EFD_CLOEXEC) does not set close-on-exec flag");
return 1;
}
close (fd);
puts ("OK");
return 0;
}
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
[akpm@linux-foundation.org: add sys_ni stub]
Signed-off-by: Ulrich Drepper <drepper@redhat.com>
Acked-by: Davide Libenzi <davidel@xmailserver.org>
Cc: Michael Kerrisk <mtk.manpages@googlemail.com>
Cc: <linux-arch@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch adds the new signalfd4 syscall. It extends the old signalfd
syscall by one parameter which is meant to hold a flag value. In this
patch the only flag support is SFD_CLOEXEC which causes the close-on-exec
flag for the returned file descriptor to be set.
A new name SFD_CLOEXEC is introduced which in this implementation must
have the same value as O_CLOEXEC.
The following test must be adjusted for architectures other than x86 and
x86-64 and in case the syscall numbers changed.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
#include <fcntl.h>
#include <signal.h>
#include <stdio.h>
#include <unistd.h>
#include <sys/syscall.h>
#ifndef __NR_signalfd4
# ifdef __x86_64__
# define __NR_signalfd4 289
# elif defined __i386__
# define __NR_signalfd4 327
# else
# error "need __NR_signalfd4"
# endif
#endif
#define SFD_CLOEXEC O_CLOEXEC
int
main (void)
{
sigset_t ss;
sigemptyset (&ss);
sigaddset (&ss, SIGUSR1);
int fd = syscall (__NR_signalfd4, -1, &ss, 8, 0);
if (fd == -1)
{
puts ("signalfd4(0) failed");
return 1;
}
int coe = fcntl (fd, F_GETFD);
if (coe == -1)
{
puts ("fcntl failed");
return 1;
}
if (coe & FD_CLOEXEC)
{
puts ("signalfd4(0) set close-on-exec flag");
return 1;
}
close (fd);
fd = syscall (__NR_signalfd4, -1, &ss, 8, SFD_CLOEXEC);
if (fd == -1)
{
puts ("signalfd4(SFD_CLOEXEC) failed");
return 1;
}
coe = fcntl (fd, F_GETFD);
if (coe == -1)
{
puts ("fcntl failed");
return 1;
}
if ((coe & FD_CLOEXEC) == 0)
{
puts ("signalfd4(SFD_CLOEXEC) does not set close-on-exec flag");
return 1;
}
close (fd);
puts ("OK");
return 0;
}
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
[akpm@linux-foundation.org: add sys_ni stub]
Signed-off-by: Ulrich Drepper <drepper@redhat.com>
Acked-by: Davide Libenzi <davidel@xmailserver.org>
Cc: Michael Kerrisk <mtk.manpages@googlemail.com>
Cc: <linux-arch@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch just extends the anon_inode_getfd interface to take an additional
parameter with a flag value. The flag value is passed on to
get_unused_fd_flags in anticipation for a use with the O_CLOEXEC flag.
No actual semantic changes here, the changed callers all pass 0 for now.
[akpm@linux-foundation.org: KVM fix]
Signed-off-by: Ulrich Drepper <drepper@redhat.com>
Acked-by: Davide Libenzi <davidel@xmailserver.org>
Cc: Michael Kerrisk <mtk.manpages@googlemail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Adds a check for an overflow in the filesystem size so if someone is
checking with statfs() on a 16G blocksize hugetlbfs in a 32bit binary that
it will report back EOVERFLOW instead of a size of 0.
Acked-by: Nishanth Aravamudan <nacc@us.ibm.com>
Signed-off-by: Jon Tollefson <kniht@linux.vnet.ibm.com>
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add the ability to configure the hugetlb hstate used on a per mount basis.
- Add a new pagesize= option to the hugetlbfs mount that allows setting
the page size
- This option causes the mount code to find the hstate corresponding to the
specified size, and sets up a pointer to the hstate in the mount's
superblock.
- Change the hstate accessors to use this information rather than the
global_hstate they were using (requires a slight change in mm/memory.c
so we don't NULL deref in the error-unmap path -- see comments).
[np: take hstate out of hugetlbfs inode and vma->vm_private_data]
Acked-by: Adam Litke <agl@us.ibm.com>
Acked-by: Nishanth Aravamudan <nacc@us.ibm.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The goal of this patchset is to support multiple hugetlb page sizes. This
is achieved by introducing a new struct hstate structure, which
encapsulates the important hugetlb state and constants (eg. huge page
size, number of huge pages currently allocated, etc).
The hstate structure is then passed around the code which requires these
fields, they will do the right thing regardless of the exact hstate they
are operating on.
This patch adds the hstate structure, with a single global instance of it
(default_hstate), and does the basic work of converting hugetlb to use the
hstate.
Future patches will add more hstate structures to allow for different
hugetlbfs mounts to have different page sizes.
[akpm@linux-foundation.org: coding-style fixes]
Acked-by: Adam Litke <agl@us.ibm.com>
Acked-by: Nishanth Aravamudan <nacc@us.ibm.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
After patch 2 in this series, a process that successfully calls mmap() for
a MAP_PRIVATE mapping will be guaranteed to successfully fault until a
process calls fork(). At that point, the next write fault from the parent
could fail due to COW if the child still has a reference.
We only reserve pages for the parent but a copy must be made to avoid
leaking data from the parent to the child after fork(). Reserves could be
taken for both parent and child at fork time to guarantee faults but if
the mapping is large it is highly likely we will not have sufficient pages
for the reservation, and it is common to fork only to exec() immediatly
after. A failure here would be very undesirable.
Note that the current behaviour of mainline with MAP_PRIVATE pages is
pretty bad. The following situation is allowed to occur today.
1. Process calls mmap(MAP_PRIVATE)
2. Process calls mlock() to fault all pages and makes sure it succeeds
3. Process forks()
4. Process writes to MAP_PRIVATE mapping while child still exists
5. If the COW fails at this point, the process gets SIGKILLed even though it
had taken care to ensure the pages existed
This patch improves the situation by guaranteeing the reliability of the
process that successfully calls mmap(). When the parent performs COW, it
will try to satisfy the allocation without using reserves. If that fails
the parent will steal the page leaving any children without a page.
Faults from the child after that point will result in failure. If the
child COW happens first, an attempt will be made to allocate the page
without reserves and the child will get SIGKILLed on failure.
To summarise the new behaviour:
1. If the original mapper performs COW on a private mapping with multiple
references, it will attempt to allocate a hugepage from the pool or
the buddy allocator without using the existing reserves. On fail, VMAs
mapping the same area are traversed and the page being COW'd is unmapped
where found. It will then steal the original page as the last mapper in
the normal way.
2. The VMAs the pages were unmapped from are flagged to note that pages
with data no longer exist. Future no-page faults on those VMAs will
terminate the process as otherwise it would appear that data was corrupted.
A warning is printed to the console that this situation occured.
2. If the child performs COW first, it will attempt to satisfy the COW
from the pool if there are enough pages or via the buddy allocator if
overcommit is allowed and the buddy allocator can satisfy the request. If
it fails, the child will be killed.
If the pool is large enough, existing applications will not notice that
the reserves were a factor. Existing applications depending on the
no-reserves been set are unlikely to exist as for much of the history of
hugetlbfs, pages were prefaulted at mmap(), allocating the pages at that
point or failing the mmap().
[npiggin@suse.de: fix CONFIG_HUGETLB=n build]
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Adam Litke <agl@us.ibm.com>
Cc: Andy Whitcroft <apw@shadowen.org>
Cc: William Lee Irwin III <wli@holomorphy.com>
Cc: Hugh Dickins <hugh@veritas.com>
Cc: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch reserves huge pages at mmap() time for MAP_PRIVATE mappings in
a similar manner to the reservations taken for MAP_SHARED mappings. The
reserve count is accounted both globally and on a per-VMA basis for
private mappings. This guarantees that a process that successfully calls
mmap() will successfully fault all pages in the future unless fork() is
called.
The characteristics of private mappings of hugetlbfs files behaviour after
this patch are;
1. The process calling mmap() is guaranteed to succeed all future faults until
it forks().
2. On fork(), the parent may die due to SIGKILL on writes to the private
mapping if enough pages are not available for the COW. For reasonably
reliable behaviour in the face of a small huge page pool, children of
hugepage-aware processes should not reference the mappings; such as
might occur when fork()ing to exec().
3. On fork(), the child VMAs inherit no reserves. Reads on pages already
faulted by the parent will succeed. Successful writes will depend on enough
huge pages being free in the pool.
4. Quotas of the hugetlbfs mount are checked at reserve time for the mapper
and at fault time otherwise.
Before this patch, all reads or writes in the child potentially needs page
allocations that can later lead to the death of the parent. This applies
to reads and writes of uninstantiated pages as well as COW. After the
patch it is only a write to an instantiated page that causes problems.
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Adam Litke <agl@us.ibm.com>
Cc: Andy Whitcroft <apw@shadowen.org>
Cc: William Lee Irwin III <wli@holomorphy.com>
Cc: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
[Summary]
Split LRU-list of unused dentries to one per superblock to avoid soft
lock up during NFS mounts and remounting of any filesystem.
Previously I posted here:
http://lkml.org/lkml/2008/3/5/590
[Descriptions]
- background
dentry_unused is a list of dentries which are not referenced.
dentry_unused grows up when references on directories or files are
released. This list can be very long if there is huge free memory.
- the problem
When shrink_dcache_sb() is called, it scans all dentry_unused linearly
under spin_lock(), and if dentry->d_sb is differnt from given
superblock, scan next dentry. This scan costs very much if there are
many entries, and very ineffective if there are many superblocks.
IOW, When we need to shrink unused dentries on one dentry, but scans
unused dentries on all superblocks in the system. For example, we scan
500 dentries to unmount a filesystem, but scans 1,000,000 or more unused
dentries on other superblocks.
In our case , At mounting NFS*, shrink_dcache_sb() is called to shrink
unused dentries on NFS, but scans 100,000,000 unused dentries on
superblocks in the system such as local ext3 filesystems. I hear NFS
mounting took 1 min on some system in use.
* : NFS uses virtual filesystem in rpc layer, so NFS is affected by
this problem.
100,000,000 is possible number on large systems.
Per-superblock LRU of unused dentried can reduce the cost in
reasonable manner.
- How to fix
I found this problem is solved by David Chinner's "Per-superblock
unused dentry LRU lists V3"(1), so I rebase it and add some fix to
reclaim with fairness, which is in Andrew Morton's comments(2).
1) http://lkml.org/lkml/2006/5/25/318
2) http://lkml.org/lkml/2006/5/25/320
Split LRU-list of unused dentries to each superblocks. Then, NFS
mounting will check dentries under a superblock instead of all. But
this spliting will break LRU of dentry-unused. So, I've attempted to
make reclaim unused dentrins with fairness by calculate number of
dentries to scan on this sb based on following way
number of dentries to scan on this sb =
count * (number of dentries on this sb / number of dentries in the machine)
- ToDo
- I have to measuring performance number and do stress tests.
- When unmount occurs during prune_dcache(), scanning on same
superblock, It is unable to reach next superblock because it is gone
away. We restart scannig superblock from first one, it causes
unfairness of reclaim unused dentries on first superblock. But I think
this happens very rarely.
- Test Results
Result on 6GB boxes with excessive unused dentries.
Without patch:
$ cat /proc/sys/fs/dentry-state
10181835 10180203 45 0 0 0
# mount -t nfs 10.124.60.70:/work/kernel-src nfs
real 0m1.830s
user 0m0.001s
sys 0m1.653s
With this patch:
$ cat /proc/sys/fs/dentry-state
10236610 10234751 45 0 0 0
# mount -t nfs 10.124.60.70:/work/kernel-src nfs
real 0m0.106s
user 0m0.002s
sys 0m0.032s
[akpm@linux-foundation.org: fix comments]
Signed-off-by: Kentaro Makita <k-makita@np.css.fujitsu.com>
Cc: Neil Brown <neilb@suse.de>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Cc: David Chinner <dgc@sgi.com>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The double indirection here is not needed anywhere and hence (at least)
confusing.
Signed-off-by: Jan Beulich <jbeulich@novell.com>
Cc: Hugh Dickins <hugh@veritas.com>
Cc: Nick Piggin <npiggin@suse.de>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: "David S. Miller" <davem@davemloft.net>
Acked-by: Jeremy Fitzhardinge <jeremy@goop.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch adds proper extern declarations for five variables in
include/linux/vmstat.h
Signed-off-by: Adrian Bunk <bunk@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
- use kzalloc() instead of kmalloc() + memset()
- improve a printk info
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
In mode_to_acl when converting a Unix mode to a Windows ACL
the subauth fields of the SID in the ACL were translated
incorrectly on bigendian architectures
Signed-off-by: Steve French <sfrench@us.ibm.com>
fs/cifs/cifssmb.c:3917:13: warning: incorrect type in assignment (different base types)
fs/cifs/cifssmb.c:3917:13: expected bool [unsigned] [usertype] is_unicode
fs/cifs/cifssmb.c:3917:13: got restricted __le16
The comment explains why __force is used here.
fs/cifs/connect.c:458:16: warning: cast to restricted __be32
fs/cifs/connect.c:458:16: warning: cast to restricted __be32
fs/cifs/connect.c:458:16: warning: cast to restricted __be32
fs/cifs/connect.c:458:16: warning: cast to restricted __be32
fs/cifs/connect.c:458:16: warning: cast to restricted __be32
fs/cifs/connect.c:458:16: warning: cast to restricted __be32
Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
Move the code that handles ATTR_SIZE changes to its own function. This
makes for a smaller function and reduces the level of indentation.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
fs/lockd/svcproc.c:115:11: warning: incorrect type in initializer (different base types)
fs/lockd/svcproc.c:115:11: expected int [signed] rc
fs/lockd/svcproc.c:115:11: got restricted __be32 [usertype] <noident>
... and so on...
Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6: (82 commits)
ipw2200: Call netif_*_queue() interfaces properly.
netxen: Needs to include linux/vmalloc.h
[netdrvr] atl1d: fix !CONFIG_PM build
r6040: rework init_one error handling
r6040: bump release number to 0.18
r6040: handle RX fifo full and no descriptor interrupts
r6040: change the default waiting time
r6040: use definitions for magic values in descriptor status
r6040: completely rework the RX path
r6040: call napi_disable when puting down the interface and set lp->dev accordingly.
mv643xx_eth: fix NETPOLL build
r6040: rework the RX buffers allocation routine
r6040: fix scheduling while atomic in r6040_tx_timeout
r6040: fix null pointer access and tx timeouts
r6040: prefix all functions with r6040
rndis_host: support WM6 devices as modems
at91_ether: use netstats in net_device structure
sfc: Create one RX queue and interrupt per CPU package by default
sfc: Use a separate workqueue for resets
sfc: I2C adapter initialisation fixes
...
get_proc_net() can now become static.
Signed-off-by: Adrian Bunk <bunk@kernel.org>
Acked-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
* git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core-2.6: (79 commits)
arm: bus_id -> dev_name() and dev_set_name() conversions
sparc64: fix up bus_id changes in sparc core code
3c59x: handle pci_name() being const
MTD: handle pci_name() being const
HP iLO driver
sysdev: Convert the x86 mce tolerant sysdev attribute to generic attribute
sysdev: Add utility functions for simple int/ulong variable sysdev attributes
sysdev: Pass the attribute to the low level sysdev show/store function
driver core: Suppress sysfs warnings for device_rename().
kobject: Transmit return value of call_usermodehelper() to caller
sysfs-rules.txt: reword API stability statement
debugfs: Implement debugfs_remove_recursive()
HOWTO: change email addresses of James in HOWTO
always enable FW_LOADER unless EMBEDDED=y
uio-howto.tmpl: use unique output names
uio-howto.tmpl: use standard copyright/legal markings
sysfs: don't call notify_change
sysdev: fix debugging statements in registration code.
kobject: should use kobject_put() in kset-example
kobject: reorder kobject to save space on 64 bit builds
...
struct pagemap_walk was placed on stack, some hooks are initialized, the
rest (->pgd_entry, ->pud_entry, ->pte_entry) are valid but junk.
Reported-by: Eric Sesterhenn <snakebyte@gmx.de>
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Cc: "Vegard Nossum" <vegard.nossum@gmail.com>
Cc: <stable@kernel.org> [2.6.25.x, 2.6.26.x]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The Linux kernel puts the filename argument of execve() into the new
address space. Many developers are surprised to learn this. Those who
know and could use it, object "But it's not documented."
Those who want to use it dislike the expression
(char *)(1+ strlen(env[-1+ n_env]) + env[-1+ n_env])
because it requires locating the last original environment variable,
and assumes that the filename follows the characters.
This patch documents the insertion of the filename, and makes it easier
to find by adding a new tag AT_EXECFN in the ElfXX_auxv_t; see <elf.h>.
In many cases readlink("/proc/self/exe",) gives the same answer. But if
all the original pages get unmapped, then the kernel erases the symlink
for /proc/self/exe. This can happen when a program decompressor does a
good job of cleaning up after uncompressing directly to memory, so that
the address space of the target program looks the same as if compression
had never happened. One example is http://upx.sourceforge.net .
One notable use of the underlying concept (what path containED the
executable) is glibc expanding $ORIGIN in DT_RUNPATH. In practice for
the near term, it may be a good idea for user-mode code to use both
/proc/self/exe and AT_EXECFN as fall-back methods for each other.
/proc/self/exe can fail due to unmapping, AT_EXECFN can fail because it
won't be present on non-new systems. The auxvec or {AT_EXECFN}.d_val
also can get overwritten, although in nearly all cases this would be the
result of a bug.
The runtime cost is one NEW_AUX_ENT using two words of stack space. The
underlying value is maintained already as bprm->exec; setup_arg_pages()
in fs/exec.c slides it for stack_shift, etc.
Signed-off-by: John Reiser <jreiser@BitWagon.com>
Cc: Roland McGrath <roland@redhat.com>
Cc: Jakub Jelinek <jakub@redhat.com>
Cc: Ulrich Drepper <drepper@redhat.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
driver core: Suppress sysfs warnings for device_rename().
Renaming network devices to an already existing name is not
something we want sysfs to print a scary warning for, since the
callers can deal with this correctly. So let's introduce
sysfs_create_link_nowarn() which gets rid of the common warning.
Signed-off-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
debugfs_remove_recursive() will remove a dentry and all its children.
Drivers can use this to zap their whole debugfs tree so that they don't
need to keep track of every single debugfs dentry they created.
It may fail to remove the whole tree in certain cases:
sh-3.2# rmmod atmel-mci < /sys/kernel/debug/mmc0/ios/clock
mmc0: card b368 removed
atmel_mci atmel_mci.0: Lost dma0chan1, falling back to PIO
sh-3.2# ls /sys/kernel/debug/mmc0/
ios
But I'm not sure if that case can be handled in any sane manner.
Signed-off-by: Haavard Skinnemoen <haavard.skinnemoen@atmel.com>
Cc: Pierre Ossman <drzeus-list@drzeus.cx>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
sysfs_chmod_file() calls notify_change() to change the permission bits
on a sysfs file. Replace with explicit call to sysfs_setattr() and
fsnotify_change().
This is equivalent, except that security_inode_setattr() is not
called. This function is called by drivers, so the security checks do
not make any sense.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Kobjects do not have a limit in name size since a while, so stop
pretending that they do.
Signed-off-by: Kay Sievers <kay.sievers@vrfy.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
device_create() is race-prone, so use the race-free
device_create_drvdata() instead as device_create() is going away.
Cc: Jan Harkes <jaharkes@cs.cmu.edu>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
* 'for-2.6.27' of git://linux-nfs.org/~bfields/linux: (51 commits)
nfsd: nfs4xdr.c do-while is not a compound statement
nfsd: Use C99 initializers in fs/nfsd/nfs4xdr.c
lockd: Pass "struct sockaddr *" to new failover-by-IP function
lockd: get host reference in nlmsvc_create_block() instead of callers
lockd: minor svclock.c style fixes
lockd: eliminate duplicate nlmsvc_lookup_host call from nlmsvc_lock
lockd: eliminate duplicate nlmsvc_lookup_host call from nlmsvc_testlock
lockd: nlm_release_host() checks for NULL, caller needn't
file lock: reorder struct file_lock to save space on 64 bit builds
nfsd: take file and mnt write in nfs4_upgrade_open
nfsd: document open share bit tracking
nfsd: tabulate nfs4 xdr encoding functions
nfsd: dprint operation names
svcrdma: Change WR context get/put to use the kmem cache
svcrdma: Create a kmem cache for the WR contexts
svcrdma: Add flush_scheduled_work to module exit function
svcrdma: Limit ORD based on client's advertised IRD
svcrdma: Remove unused wait q from svcrdma_xprt structure
svcrdma: Remove unneeded spin locks from __svc_rdma_free
svcrdma: Add dma map count and WARN_ON
...
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6: (1232 commits)
iucv: Fix bad merging.
net_sched: Add size table for qdiscs
net_sched: Add accessor function for packet length for qdiscs
net_sched: Add qdisc_enqueue wrapper
highmem: Export totalhigh_pages.
ipv6 mcast: Omit redundant address family checks in ip6_mc_source().
net: Use standard structures for generic socket address structures.
ipv6 netns: Make several "global" sysctl variables namespace aware.
netns: Use net_eq() to compare net-namespaces for optimization.
ipv6: remove unused macros from net/ipv6.h
ipv6: remove unused parameter from ip6_ra_control
tcp: fix kernel panic with listening_get_next
tcp: Remove redundant checks when setting eff_sacks
tcp: options clean up
tcp: Fix MD5 signatures for non-linear skbs
sctp: Update sctp global memory limit allocations.
sctp: remove unnecessary byteshifting, calculate directly in big-endian
sctp: Allow only 1 listening socket with SO_REUSEADDR
sctp: Do not leak memory on multiple listen() calls
sctp: Support ipv6only AF_INET6 sockets.
...
* 'configfs-fixup-ptr-error' of git://oss.oracle.com/git/jlbec/linux-2.6:
configfs: Allow ->make_item() and ->make_group() to return detailed errors.
Revert "configfs: Allow ->make_item() and ->make_group() to return detailed errors."
Move the line disciplines towards a conventional ->ops arrangement. For
the moment the actual 'tty_ldisc' struct in the tty is kept as part of
the tty struct but this can then be changed if it turns out that when it
all settles down we want to refcount ldiscs separately to the tty.
Pull the ldisc code out of /proc and put it with our ldisc code.
Signed-off-by: Alan Cox <alan@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The WRITEMEM macro produces sparse warnings of the form:
fs/nfsd/nfs4xdr.c:2668:2: warning: do-while statement is not a compound statement
Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Cc: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Thanks to problem report and original patch from Harvey Harrison.
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Cc: Harvey Harrison <harvey.harrison@gmail.com>
Cc: Benny Halevy <bhalevy@panasas.com>
There are already 7 of them - time to kill some duplicate code.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The configfs operations ->make_item() and ->make_group() currently
return a new item/group. A return of NULL signifies an error. Because
of this, -ENOMEM is the only return code bubbled up the stack.
Multiple folks have requested the ability to return specific error codes
when these operations fail. This patch adds that ability by changing the
->make_item/group() ops to return ERR_PTR() values. These errors are
bubbled up appropriately. NULL returns are changed to -ENOMEM for
compatibility.
Also updated are the in-kernel users of configfs.
This is a rework of reverted commit 11c3b79218.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
The truncate patch should not use the i_allocated_meta_blocks
value. So add seperate functions to be used in the truncate
and alloc path. We also need to release the meta-data block
that we reserved for the blocks that we are truncating.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
With the FLEX_BG layout, there is no reason why extents can't cross
block groups, so make the truncate code reserve enough credits so we
don't BUG if we come across such an extent.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
The extents codepath for ext4_truncate() requests journal transaction
credits in very small chunks, requesting only what is needed. This
means there may not be enough credits left on the transaction handle
after ext4_truncate() returns and then when ext4_delete_inode() tries
finish up its work, it may not have enough transaction credits,
causing a BUG() oops in the jbd2 core.
Also, reserve an extra 2 blocks when starting an ext4_delete_inode()
since we need to update the inode bitmap, as well as update the
orphaned inode linked list.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
The ext4_ext_journal_restart() is a convenience function which checks
to see if the requested number of credits is present, and if so it
closes the current transaction and attaches the current handle to the
new transaction. Unfortunately, it wasn't proprely checking the
return value from ext4_journal_extend(), so it was starting a new
transaction when one was not necessary, and returning an error when
all that was necessary was to restart the handle with a new
transaction.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
ext4_da_write_begin needs to call journal_stop before returning,
if the page allocation fails.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Acked-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
In ordered mode, the current jbd2 aborts the journal if a file data buffer
has an error. But this behavior is unintended, and we found that it has
been adopted accidentally.
This patch undoes it and just calls printk() instead of aborting the
journal. Unlike a similar patch for ext3/jbd, file data buffers are
written via generic_writepages(). But we also need to set AS_EIO
into their mappings because wait_on_page_writeback_range() clears
AS_EIO before a user process sees it.
Signed-off-by: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
A transient I/O error can corrupt inode data. Here is the scenario:
(1) update inode_A at the block_B
(2) pdflush writes out new inode_A to the filesystem, but it results
in write I/O error, at this point, BH_Uptodate flag of the buffer
for block_B is cleared and BH_Write_EIO is set
(3) create new inode_C which located at block_B, and
__ext4_get_inode_loc() tries to read on-disk block_B because the
buffer is not uptodate
(4) if it can read on-disk block_B successfully, inode_A is
overwritten by old data
This patch makes __ext4_get_inode_loc() not read the inode block if the
buffer has BH_Write_EIO flag. In this case, the buffer should have the
latest information, so setting the uptodate flag to the buffer (this
avoids WARN_ON_ONCE() in mark_buffer_dirty().)
According to this change, we would need to test BH_Write_EIO flag for the
error checking. Currently nobody checks write I/O errors on metadata
buffers, but it will be done in other patches I'm working on.
Signed-off-by: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com>
Cc: sugita <yumiko.sugita.yf@hitachi.com>
Cc: Satoshi OSHIMA <satoshi.oshima.fk@hitachi.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Jan Kara <jack@ucw.cz>
Cc: <linux-ext4@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Currently, the locality group prealloc list is freed only when there
is a block allocation failure. This can result in large number of
entries in the preallocation list making ext4_mb_use_preallocated()
expensive.
To fix this, we convert the locality group prealloc list to a hash
list. The hash index is the order of number of blocks in the prealloc
space with a max order of 9. When adding prealloc space to the list we
make sure total entries for each order does not exceed 8. If it is
more than 8 we discard few entries and make sure the we have only <= 5
entries.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
NR_CPUS can be really large. We should be using nr_cpu_ids instead.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Don't call BUG_ON on file system failures. Instead use ext4_error and
also handle the continue case properly.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
I noticed when filling a 1T filesystem with 4 threads using the
fs_mark benchmark:
fs_mark -d /mnt/test -D 256 -n 100000 -t 4 -s 20480 -F -S 0
that I occasionally got checksum mismatch errors:
EXT4-fs error (device sdb): ext4_init_inode_bitmap: Checksum bad for group 6935
etc. I'd reliably get 4-5 of them during the run.
It appears that the problem is likely a race to init the bg's
when the uninit_bg feature is enabled.
With the patch below, which adds sb_bgl_locking around initialization,
I was able to complete several runs with no errors or warnings.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
ext4_read_block_bitmap and read_inode_bitmap do essentially
the same thing, and yet they are structured quite differently.
I came across this difference while looking at doing bg locking
during bg initialization.
This patch:
* removes unnecessary casts in the error messages
* renames read_inode_bitmap to ext4_read_inode_bitmap
* and more substantially, restructures the inode bitmap
reading function to be more like the block bitmap counterpart.
The change to the inode bitmap reader simplifies the locking
to be applied in the next patch.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
If the block group checksums are corrupted, still allow the mount to
succeed, so e2fsck can have a chance to try to fix things up. Add
code in the remount r/w path to make sure the block group checksums
are valid before allowing the filesystem to be remounted read/write.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Inserting an extent can cause a new entry in the already existing index
block. That doesn't increase the depth of the instead. Instead it adds a
new leaf block. Now with the new leaf block the path information
corresponding to the logical block should be fetched from the new block.
The old path will be pointing to the old leaf block.
We need to recalucate the path information on extent insert
even if depth doesn't change. Without this change, the extent merge
after converting an unwritten extent to initialized extent takes the wrong
extent and cause data corruption.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
* 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mfasheh/ocfs2:
[PATCH] ocfs2: fix oops in mmap_truncate testing
configfs: call drop_link() to cleanup after create_link() failure
configfs: Allow ->make_item() and ->make_group() to return detailed errors.
configfs: Fix failing mkdir() making racing rmdir() fail
configfs: Fix deadlock with racing rmdir() and rename()
configfs: Make configfs_new_dirent() return error code instead of NULL
configfs: Protect configfs_dirent s_links list mutations
configfs: Introduce configfs_dirent_lock
ocfs2: Don't snprintf() without a format.
ocfs2: Fix CONFIG_OCFS2_DEBUG_FS #ifdefs
ocfs2/net: Silence build warnings on sparc64
ocfs2: Handle error during journal load
ocfs2: Silence an error message in ocfs2_file_aio_read()
ocfs2: use simple_read_from_buffer()
ocfs2: fix printk format warnings with OCFS2_FS_STATS=n
[PATCH 2/2] ocfs2: Instrument fs cluster locks
[PATCH 1/2] ocfs2: Add CONFIG_OCFS2_FS_STATS config option
This patch fixes a mmap_truncate bug which was found by ocfs2 test suite.
In an ocfs2 cluster more than 1 node, run program mmap_truncate, which races
mmap writes and truncates from multiple processes. While the test is
running, a stat from another node forces writeout, causing an oops in
ocfs2_get_block() because it sees a buffer to write which isn't allocated.
This patch fixed the bug by clear dirty and uptodate bits in buffer, leave
the buffer unmapped and return.
Fix is suggested by Mark Fasheh, and I code up the patch.
Signed-off-by: Coly Li <coyli@suse.de>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
* 'for_linus' of git://git.infradead.org/~dedekind/ubifs-2.6:
UBIFS: include to compilation
UBIFS: add new flash file system
UBIFS: add brief documentation
MAINTAINERS: add UBIFS section
do_mounts: allow UBI root device name
VFS: export sync_sb_inodes
VFS: move inode_lock into sync_sb_inodes
* git://git.linux-nfs.org/projects/trondmy/nfs-2.6: (82 commits)
NFSv4: Remove BKL from the nfsv4 state recovery
SUNRPC: Remove the BKL from the callback functions
NFS: Remove BKL from the readdir code
NFS: Remove BKL from the symlink code
NFS: Remove BKL from the sillydelete operations
NFS: Remove the BKL from the rename, rmdir and unlink operations
NFS: Remove BKL from NFS lookup code
NFS: Remove the BKL from nfs_link()
NFS: Remove the BKL from the inode creation operations
NFS: Remove BKL usage from open()
NFS: Remove BKL usage from the write path
NFS: Remove the BKL from the permission checking code
NFS: Remove attribute update related BKL references
NFS: Remove BKL requirement from attribute updates
NFS: Protect inode->i_nlink updates using inode->i_lock
nfs: set correct fl_len in nlmclnt_test()
SUNRPC: Support registering IPv6 interfaces with local rpcbind daemon
SUNRPC: Refactor rpcb_register to make rpcbindv4 support easier
SUNRPC: None of rpcb_create's callers wants a privileged source port
SUNRPC: Introduce a specific rpcb_create for contacting localhost
...
Fix fs/compat_ioctl.c to handle CONFIG_BLOCK=n, CONFIG_SCSI=n to avoid
build errors:
In file included from include/scsi/scsi.h:12,
from fs/compat_ioctl.c:71:
include/scsi/scsi_cmnd.h:27:25: warning: "BLK_MAX_CDB" is not defined
include/scsi/scsi_cmnd.h:28:3: error: #error MAX_COMMAND_SIZE can not be bigger than BLK_MAX_CDB
In file included from include/scsi/scsi.h:12,
from fs/compat_ioctl.c:71:
include/scsi/scsi_cmnd.h: In function 'scsi_bidi_cmnd':
include/scsi/scsi_cmnd.h:182: error: implicit declaration of function 'blk_bidi_rq'
include/scsi/scsi_cmnd.h:183: error: dereferencing pointer to incomplete type
include/scsi/scsi_cmnd.h: In function 'scsi_in':
include/scsi/scsi_cmnd.h:189: error: dereferencing pointer to incomplete type
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Push it into those callback functions that actually need it.
Note that all the NFS operations use their own locking, so don't need the
BKL. Ditto for the rpcbind client.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Page accesses are serialised using the page locks, whereas all attribute
updates are serialised using the inode->i_lock.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Page cache accesses are serialised using page locks, whereas attribute
updates are serialised using inode->i_lock.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Attribute updates are safe, and dentry operations are protected using VFS
level locks. Defer removing the BKL from sillyrename until a separate
patch.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
All dentry-related operations are already BKL-safe, since they are
protected by the VFS locking. No extra locks should be needed in the NFS
code.
In the case of nfs_revalidate_inode(), we're only doing an attribute
update (protected by the inode->i_lock).
In the case of nfs_lookup(), we're instantiating a new dentry, so there
should be no contention possible until after we call d_materialise_unique.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
nfs_instantiate() does not require the BKL, neither do the attribute
updates or the RPC code.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
All the NFSv4 stateful operations are already protected by other locks (in
particular by the rpc_sequence locks.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
The main problem is dealing with inode->i_size: we need to set the
inode->i_lock on all attribute updates, and so vmtruncate won't cut it.
Make an NFS-private version of vmtruncate that has the necessary locking
semantics.
The result should be that the following inode attribute updates are
protected by inode->i_lock
nfsi->cache_validity
nfsi->read_cache_jiffies
nfsi->attrtimeo
nfsi->attrtimeo_timestamp
nfsi->change_attr
nfsi->last_updated
nfsi->cache_change_attribute
nfsi->access_cache
nfsi->access_cache_entry_lru
nfsi->access_cache_inode_lru
nfsi->acl_access
nfsi->acl_default
nfsi->nfs_page_tree
nfsi->ncommit
nfsi->npages
nfsi->open_files
nfsi->silly_list
nfsi->acl
nfsi->open_states
inode->i_size
inode->i_atime
inode->i_mtime
inode->i_ctime
inode->i_nlink
inode->i_uid
inode->i_gid
The following is protected by dir->i_mutex
nfsi->cookieverf
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
fcntl(F_GETLK) on an nfs client incorrectly returns
the values for the conflicting lock. fl_len value is
always 1.
If the conflicting lock is (0, 4095) the F_GETLK
request for (1024, 10) returns (0, 1), which doesn't
even cover the requested range, and is quite confusing.
The fix is trivial, set fl_end from the fl_end value
recieved from the nfs server.
Signed-off-by: Felix Blyakher <felixb@sgi.com>
Signed-off-by: "J. Bruce Fields" <bfields@citi.umich.edu>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Pass a more generic socket address type to nlmsvc_unlock_all_by_ip() to
allow for future support of IPv6. Also provide additional sanity
checking in failover_unlock_ip() when constructing the server's IP
address.
As an added bonus, provide clean kerneldoc comments on related NLM
interfaces which were recently added.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
It may not be obvious (till you look at the definition of
nlm_alloc_call()) that a function like nlmsvc_create_block() should
consume a reference on success or failure, so I find it clearer if it
takes the reference it needs itself.
And both callers already do this immediately before the call anyway.
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
nlmsvc_lock calls nlmsvc_lookup_host to find a nlm_host struct. The
callers of this function, however, call nlmsvc_retrieve_args or
nlm4svc_retrieve_args, which also return a nlm_host struct.
Change nlmsvc_lock to take a host arg instead of calling
nlmsvc_lookup_host itself and change the callers to pass a pointer to
the nlm_host they've already found.
Since nlmsvc_testlock() now just uses the caller's reference, we no
longer need to get or release it.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
nlmsvc_testlock calls nlmsvc_lookup_host to find a nlm_host struct. The
callers of this functions, however, call nlmsvc_retrieve_args or
nlm4svc_retrieve_args, which also return a nlm_host struct.
Change nlmsvc_testlock to take a host arg instead of calling
nlmsvc_lookup_host itself and change the callers to pass a pointer to
the nlm_host they've already found.
We take a reference to host in the place where nlmsvc_testlock()
previous did a new lookup, so the reference counting is unchanged from
before.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/shaggy/jfs-2.6:
jfs: remove DIRENTSIZ
JFS: diAlloc() should return -EIO rather than EIO
JFS: skip bad iput() call in error path
JFS: switch to seq_files
JFS: 0 is not valid errno value so return NULL from jfs_lookup
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (61 commits)
ext4: Documention update for new ordered mode and delayed allocation
ext4: do not set extents feature from the kernel
ext4: Don't allow nonextenst mount option for large filesystem
ext4: Enable delalloc by default.
ext4: delayed allocation i_blocks fix for stat
ext4: fix delalloc i_disksize early update issue
ext4: Handle page without buffers in ext4_*_writepage()
ext4: Add ordered mode support for delalloc
ext4: Invert lock ordering of page_lock and transaction start in delalloc
mm: Add range_cont mode for writeback
ext4: delayed allocation ENOSPC handling
percpu_counter: new function percpu_counter_sum_and_set
ext4: Add delayed allocation support in data=writeback mode
vfs: add hooks for ext4's delayed allocation support
jbd2: Remove data=ordered mode support using jbd buffer heads
ext4: Use new framework for data=ordered mode in JBD2
jbd2: Implement data=ordered mode handling via inodes
vfs: export filemap_fdatawrite_range()
ext4: Fix lock inversion in ext4_ext_truncate()
ext4: Invert the locking order of page_lock and transaction start
...
Add UBIFS to Makefile and Kbuild.
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
Signed-off-by: Adrian Hunter <ext-adrian.hunter@nokia.com>
This is a new flash file system. See
http://www.linux-mtd.infradead.org/doc/ubifs.html
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
Signed-off-by: Adrian Hunter <ext-adrian.hunter@nokia.com>
When allow_link() succeeds but create_link() fails, the subsystem is not
informed of the failure.
This patch fixes this by calling drop_link() on create_link() failures.
Signed-off-by: Louis Rilling <Louis.Rilling@kerlabs.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
The configfs operations ->make_item() and ->make_group() currently
return a new item/group. A return of NULL signifies an error. Because
of this, -ENOMEM is the only return code bubbled up the stack.
Multiple folks have requested the ability to return specific error codes
when these operations fail. This patch adds that ability by changing the
->make_item/group() ops to return an int.
Also updated are the in-kernel users of configfs.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
When fixing the rename() vs rmdir() deadlock, we stopped locking default groups'
inodes in configfs_detach_prep(), letting racing mkdir() in default groups
proceed concurrently. This enables races like below happen, which leads to a
failing mkdir() making rmdir() fail, despite the group to remove having no
user-created directory under it in the end.
process A: process B:
/* PWD=A/B */
mkdir("C")
make_item("C")
attach_group("C")
rmdir("A")
detach_prep("A")
detach_prep("B")
error because of "C"
return -ENOTEMPTY
attach_group("C/D")
error (eg -ENOMEM)
return -ENOMEM
This patch prevents such scenarii by making rmdir() wait as long as
detach_prep() fails because a racing mkdir() is in the middle of attach_group().
To achieve this, mkdir() sets a flag CONFIGFS_USET_IN_MKDIR in parent's
configfs_dirent before calling attach_group(), and clears the flag once
attach_group() is done. detach_prep() fails with -EAGAIN whenever the flag is
hit and returns the guilty inode's mutex so that rmdir() can wait on it.
Signed-off-by: Louis Rilling <Louis.Rilling@kerlabs.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
This patch fixes the deadlock between racing sys_rename() and configfs_rmdir().
The idea is to avoid locking i_mutexes of default groups in
configfs_detach_prep(), and rely instead on the new configfs_dirent_lock to
protect against configfs_dirent's linkage mutations. To ensure that an mkdir()
racing with rmdir() will not create new items in a to-be-removed default group,
we make configfs_new_dirent() check for the CONFIGFS_USET_DROPPING flag right
before linking the new dirent, and return error if the flag is set. This makes
racing mkdir()/symlink()/dir_open() fail in places where errors could already
happen, resp. in (attach_item()|attach_group())/create_link()/new_dirent().
configfs_depend() remains safe since it locks all the path from configfs root,
and is thus mutually exclusive with rmdir().
An advantage of this is that now detach_groups() unconditionnaly takes the
default groups i_mutex, which makes it more consistent with populate_groups().
Signed-off-by: Louis Rilling <Louis.Rilling@kerlabs.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
This patch makes configfs_new_dirent return negative error code instead of NULL,
which will be useful in the next patch to differentiate ENOMEM from ENOENT.
Signed-off-by: Louis Rilling <Louis.Rilling@kerlabs.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Symlinks to a config_item are listed under its configfs_dirent s_links, but the
list mutations are not protected by any common lock.
This patch uses the configfs_dirent_lock spinlock to add the necessary
protection.
Note: we should also protect the list_empty() test in configfs_detach_prep() but
1/ the lock should not be released immediately because nothing would prevent the
list from being filled after a successful list_empty() test, making the problem
tricky,
2/ this will be solved by the rmdir() vs rename() deadlock bugfix.
Signed-off-by: Louis Rilling <Louis.Rilling@kerlabs.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
This patch introduces configfs_dirent_lock spinlock to protect configfs_dirent
traversals against linkage mutations (add/del/move). This will allow
configfs_detach_prep() to avoid locking i_mutexes.
Locking rules for configfs_dirent linkage mutations are the same plus the
requirement of taking configfs_dirent_lock. For configfs_dirent walking, one can
either take appropriate i_mutex as before, or take configfs_dirent_lock.
The spinlock could actually be a mutex, but the critical sections are either
O(1) or should not be too long (default groups walking in last patch).
ChangeLog:
- Clarify the comment on configfs_dirent_lock usage
- Move sd->s_element init before linking the new dirent
- In lseek(), do not release configfs_dirent_lock before the dirent is
relinked.
Signed-off-by: Louis Rilling <Louis.Rilling@kerlabs.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Some system files are per-slot. Their names include the slot number.
ocfs2_sprintf_system_inode_name() uses the system inode definitions to
fill in the slot number with snprintf().
For global system files, there is no node number, and the name was
printed as a format with no arguments. -Wformat-nonliteral and
-Wformat-security don't like this. Instead, use a static "%s" format
and the name as the argument.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
A couple places use OCFS2_DEBUG_FS where they really mean
CONFIG_OCFS2_DEBUG_FS.
Reported-by: Robert P. J. Day <rpjday@crashcourse.ca>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
suseconds_t is type long on most arches except sparc64 where it is type int.
This patch silences the following warnings that are generated when building
on it.
netdebug.c: In function 'nst_seq_show':
netdebug.c:152: warning: format '%lu' expects type 'long unsigned int', but argument 13 has type 'suseconds_t'
netdebug.c:152: warning: format '%lu' expects type 'long unsigned int', but argument 15 has type 'suseconds_t'
netdebug.c:152: warning: format '%lu' expects type 'long unsigned int', but argument 17 has type 'suseconds_t'
netdebug.c: In function 'sc_seq_show':
netdebug.c:332: warning: format '%lu' expects type 'long unsigned int', but argument 19 has type 'suseconds_t'
netdebug.c:332: warning: format '%lu' expects type 'long unsigned int', but argument 21 has type 'suseconds_t'
netdebug.c:332: warning: format '%lu' expects type 'long unsigned int', but argument 23 has type 'suseconds_t'
netdebug.c:332: warning: format '%lu' expects type 'long unsigned int', but argument 25 has type 'suseconds_t'
netdebug.c:332: warning: format '%lu' expects type 'long unsigned int', but argument 27 has type 'suseconds_t'
netdebug.c:332: warning: format '%lu' expects type 'long unsigned int', but argument 29 has type 'suseconds_t'
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
This patch ensures the mount fails if the fs is unable to load the journal.
Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com>
Acked-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
This patch silences an EINVAL error message in ocfs2_file_aio_read()
that is always due to a user error.
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Acked-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Fix printk format warnings when OCFS2_FS_STATS=n:
linux-next-20080528/fs/ocfs2/dlmglue.c: In function 'ocfs2_dlm_seq_show':
linux-next-20080528/fs/ocfs2/dlmglue.c:2623: warning: format '%llu' expects type 'long long unsigned int', but argument 3 has type 'int'
linux-next-20080528/fs/ocfs2/dlmglue.c:2623: warning: format '%llu' expects type 'long long unsigned int', but argument 4 has type 'int'
linux-next-20080528/fs/ocfs2/dlmglue.c:2623: warning: format '%llu' expects type 'long long unsigned int', but argument 7 has type 'int'
linux-next-20080528/fs/ocfs2/dlmglue.c:2623: warning: format '%llu' expects type 'long long unsigned int', but argument 8 has type 'int'
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
This patch adds code to track the number of times the fs takes
various cluster locks as well as the times associated with it.
The information is made available to users via debugfs.
This patch was originally written by Jan Kara <jack@suse.cz>.
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
This patch adds config option CONFIG_OCFS2_FS_STATS to allow building
the fs with instrumentation enabled. An upcoming patch will provide
support to instrument cluster locking, which is a crucial overhead in
a cluster file system. This config option allows users to avoid the cpu
and memory overhead that is involved in gathering such statistics.
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/security-testing-2.6: (25 commits)
security: remove register_security hook
security: remove dummy module fix
security: remove dummy module
security: remove unused sb_get_mnt_opts hook
LSM/SELinux: show LSM mount options in /proc/mounts
SELinux: allow fstype unknown to policy to use xattrs if present
security: fix return of void-valued expressions
SELinux: use do_each_thread as a proper do/while block
SELinux: remove unused and shadowed addrlen variable
SELinux: more user friendly unknown handling printk
selinux: change handling of invalid classes (Was: Re: 2.6.26-rc5-mm1 selinux whine)
SELinux: drop load_mutex in security_load_policy
SELinux: fix off by 1 reference of class_to_string in context_struct_compute_av
SELinux: open code sidtab lock
SELinux: open code load_mutex
SELinux: open code policy_rwlock
selinux: fix endianness bug in network node address handling
selinux: simplify ioctl checking
SELinux: enable processes with mac_admin to get the raw inode contexts
Security: split proc ptrace checking into read vs. attach
...
* 'for-linus' of git://git.kernel.dk/linux-2.6-block: (37 commits)
splice: fix generic_file_splice_read() race with page invalidation
ramfs: enable splice write
drivers/block/pktcdvd.c: avoid useless memset
cdrom: revert commit 22a9189 (cdrom: use kmalloced buffers instead of buffers on stack)
scsi: sr avoids useless buffer allocation
block: blk_rq_map_kern uses the bounce buffers for stack buffers
block: add blk_queue_update_dma_pad
DAC960: push down BKL
pktcdvd: push BKL down into driver
paride: push ioctl down into driver
block: use get_unaligned_* helpers
block: extend queue_flag bitops
block: request_module(): use format string
Add bvec_merge_data to handle stacked devices and ->merge_bvec()
block: integrity flags can't use bit ops on unsigned short
cmdfilter: extend default read filter
sg: fix odd style (extra parenthesis) introduced by cmd filter patch
block: add bounce support to blk_rq_map_user_iov
cfq-iosched: get rid of enable_idle being unused warning
allow userspace to modify scsi command filter on per device basis
...
gcc 4.3.0 correctly emits the following warning.
search_rsb_list does not *r_ret if no dlm_rsb is found
and _search_rsb may pass the uninitialized value upstream
on the error path when both calls to search_rsb_list
return non-zero error.
The fix sets *r_ret to NULL on search_rsb_list's not-found path.
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: David Teigland <teigland@redhat.com>
It seems that `sock' allocated by sock_create_kern in
tcp_connect_to_sock() of dlm/fs/lowcomms.c is not released if
dlm_nodeid_to_addr an error.
Acked-by: Christine Caulfield <ccaulfie@redhat.com>
Signed-off-by: Masatake YAMATO <yamato@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
The fix in commit 3650925893 was addressing
the case of a granted PR lock with waiting PR and CW locks. It's a
special case that requires forcing a CW bast. However, that forced CW
bast was incorrectly applying to a second condition where the granted
lock was CW. So, the holder of a CW lock could receive an extraneous CW
bast instead of a PR bast. This fix narrows the original special case to
what was intended.
Signed-off-by: David Teigland <teigland@redhat.com>
If `device_write' method is called via "dlm-control",
file->private_data is NULL. (See ctl_device_open() in
user.c. ) Through proc->flags is read.
Signed-off-by: Masatake YAMATO <yamato@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
With the Simple Pairing support, the authentication requirements are
an explicit setting during the bonding process. Track and enforce the
requirements and allow higher layers like L2CAP and RFCOMM to increase
them if needed.
This patch introduces a new IOCTL that allows to query the current
authentication requirements. It is also possible to detect Simple
Pairing support in the kernel this way.
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
This patch exports the 'sync_sb_inodes()' which is needed for
UBIFS because it has to force write-back from time to time.
Namely, the UBIFS budgeting subsystem forces write-back when
its pessimistic callculations show that there is no free
space on the media.
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>