Fix some issues in pagemap_read noted by Alexey:
- initialize pagemap_walk.mm to "mm" , so the code starts working as
advertised
- initialize ->private to "&pm" so it wouldn't immediately oops in
pagemap_pte_hole()
- unstatic struct pagemap_walk, so two threads won't fsckup each other
(including those started by root, including flipping ->mm when you don't
have permissions)
- pagemap_read() contains two calls to ptrace_may_attach(), second one
looks unneeded.
- avoid possible kmalloc(0) and integer wraparound.
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Matt Mackall <mpm@selenic.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
[ Personally, I'd just remove the functionality entirely - Linus ]
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Don't use a static entry, so as to prevent races during concurrent use
of this function.
Reported-by: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Matt Mackall <mpm@selenic.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In preparation for minorversion 1
All encoders now return an nfserr status (typically their
nfserr argument). Unsupported ops go through nfsd4_encode_operation
too, so use nfsd4_encode_noop to encode nothing for their reply body.
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
This commit includes a bugfix for the fragile setuid fixup code in the
case that filesystem capabilities are supported (in access()). The effect
of this fix is gated on filesystem capability support because changing
securebits is only supported when filesystem capabilities support is
configured.)
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Andrew G. Morgan <morgan@kernel.org>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Acked-by: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The url in the help text for ntfs should be updated.
Acked-by: Anton Altaparmakov <aia21@cantab.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The misc_mtx should provide all the protection required to keep the daemon
hash table sane during miscdev registration. Since this mutex is causing
gratuitous lockdep warnings, this patch removes it.
Signed-off-by: Michael Halcrow <mhalcrow@us.ibm.com>
Reported-by: Cyrill Gorcunov <gorcunov@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When write in reiserfs_quota_write() fails, we have to properly release
i_mutex. One error path has been missing the unlock...
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When write in ext4_quota_write() fails, we have to properly release
i_mutex. One error path has been missing the unlock...
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When write in ext3_quota_write() fails, we have to properly release
i_mutex. One error path has been missing the unlock...
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If a page was invalidated during splicing from file to a pipe, then
generic_file_splice_read() could return a short or zero count.
This manifested itself in rare I/O errors seen on nfs exported fuse
filesystems. This is because nfsd uses splice_direct_to_actor() to read
files, and fuse uses invalidate_inode_pages2() to invalidate stale data on
open.
Fix by redoing the page find/create if it was found to be truncated
(invalidated).
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
The legacy protocol's open operation doesn't handle an append operation
(it is expected that the client take care of it). We were incorrectly
passing the extended protocol's flag through even in legacy mode. This
was reported in bugzilla report #10689. This patch fixes the problem
by disallowing extended protocol open modes from being passed in legacy
mode and implemented append functionality on the client side by adding
a seek after the open.
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
When devices are stacked, one device's merge_bvec_fn may need to perform
the mapping and then call one or more functions for its underlying devices.
The following bio fields are used:
bio->bi_sector
bio->bi_bdev
bio->bi_size
bio->bi_rw using bio_data_dir()
This patch creates a new struct bvec_merge_data holding a copy of those
fields to avoid having to change them directly in the struct bio when
going down the stack only to have to change them back again on the way
back up. (And then when the bio gets mapped for real, the whole
exercise gets repeated, but that's a problem for another day...)
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Cc: Neil Brown <neilb@suse.de>
Cc: Milan Broz <mbroz@redhat.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Some block devices support verifying the integrity of requests by way
of checksums or other protection information that is submitted along
with the I/O.
This patch implements support for generating and verifying integrity
metadata, as well as correctly merging, splitting and cloning bios and
requests that have this extra information attached.
See Documentation/block/data-integrity.txt for more information.
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Move struct bio_set and biovec_slab definitions to bio.h so they can
be used outside of bio.c.
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
GFS2 calls permission() to verify permissions after locks on the files
have been taken.
For this it's sufficient to call gfs2_permission() instead. This
results in the following changes:
- IS_RDONLY() check is not performed
- IS_IMMUTABLE() check is not performed
- devcgroup_inode_permission() is not called
- security_inode_permission() is not called
IS_RDONLY() should be unnecessary anyway, as the per-mount read-only
flag should provide protection against read-only remounts during
operations. do_gfs2_set_flags() has been fixed to perform
mnt_want_write()/mnt_drop_write() to protect against remounting
read-only.
IS_IMMUTABLE has been added to gfs2_permission()
Repeating the security checks seems to be pointless, as they don't
normally change, and if they do, it's independent of the filesystem
state.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
lock_kernel() calls have been pushed down into code which needs it, so
there is no need to take the BKL at this level anymore.
This work inspired and aided by Andi Kleen's unlocked_fasync() patches.
Acked-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
- Replace remote_llseek with generic_file_llseek_unlocked (to force compilation
failures in all users)
- Change all users to either use generic_file_llseek_unlocked directly or
take the BKL around. I changed the file systems who don't use the BKL
for anything (CIFS, GFS) to call it directly. NCPFS and SMBFS and NFS
take the BKL, but explicitely in their own source now.
I moved them all over in a single patch to avoid unbisectable sections.
Open problem: 32bit kernels can corrupt fpos because its modification
is not atomic, but they can do that anyways because there's other paths who
modify it without BKL.
Do we need a special lock for the pos/f_version = 0 checks?
Trond says the NFS BKL is likely not needed, but keep it for now
until his full audit.
v2: Use generic_file_llseek_unlocked instead of remote_llseek_unlocked
and factor duplicated code (suggested by hch)
Cc: Trond.Myklebust@netapp.com
Cc: swhiteho@redhat.com
Cc: sfrench@samba.org
Cc: vandrove@vc.cvut.cz
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
The FAT BKL removal patch can cause deadlocks. It turns out that the new
lock_super() calls are unneeded, remove them (as directed by Linus).
Reported-by: "Tony Luck" <tony.luck@intel.com>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
Have separate vectors of operation decoders for each minorversion.
Obsolete ops in newer minorversions have default implementation returning
nfserr_opnotsupp.
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Check minorversion once before decoding any operation and reject with
nfserr_minor_vers_mismatch if != 0 (this still happens in nfsd4_proc_compound).
In this case return a zero length resultdata array as required by RFC3530.
minorversion 1 processing will have its own vector of decoders.
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Multiple mnt_want_write() calls in the switch statement looks really
ugly.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Acked-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
fsync_buffers_list() and sync_dirty_buffer() both issue async writes and
then immediately wait on them. Conceptually, that makes them sync writes
and we should treat them as such so that the IO schedulers can handle
them appropriately.
This patch fixes a write starvation issue that Lin Ming reported, where
xx is stuck for more than 2 minutes because of a large number of
synchronous IO in the system:
INFO: task kjournald:20558 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
message.
kjournald D ffff810010820978 6712 20558 2
ffff81022ddb1d10 0000000000000046 ffff81022e7baa10 ffffffff803ba6f2
ffff81022ecd0000 ffff8101e6dc9160 ffff81022ecd0348 000000008048b6cb
0000000000000086 ffff81022c4e8d30 0000000000000000 ffffffff80247537
Call Trace:
[<ffffffff803ba6f2>] kobject_get+0x12/0x17
[<ffffffff80247537>] getnstimeofday+0x2f/0x83
[<ffffffff8029c1ac>] sync_buffer+0x0/0x3f
[<ffffffff8066d195>] io_schedule+0x5d/0x9f
[<ffffffff8029c1e7>] sync_buffer+0x3b/0x3f
[<ffffffff8066d3f0>] __wait_on_bit+0x40/0x6f
[<ffffffff8029c1ac>] sync_buffer+0x0/0x3f
[<ffffffff8066d48b>] out_of_line_wait_on_bit+0x6c/0x78
[<ffffffff80243909>] wake_bit_function+0x0/0x23
[<ffffffff8029e3ad>] sync_dirty_buffer+0x98/0xcb
[<ffffffff8030056b>] journal_commit_transaction+0x97d/0xcb6
[<ffffffff8023a676>] lock_timer_base+0x26/0x4b
[<ffffffff8030300a>] kjournald+0xc1/0x1fb
[<ffffffff802438db>] autoremove_wake_function+0x0/0x2e
[<ffffffff80302f49>] kjournald+0x0/0x1fb
[<ffffffff802437bb>] kthread+0x47/0x74
[<ffffffff8022de51>] schedule_tail+0x28/0x5d
[<ffffffff8020cac8>] child_rip+0xa/0x12
[<ffffffff80243774>] kthread+0x0/0x74
[<ffffffff8020cabe>] child_rip+0x0/0x12
Lin Ming confirms that this patch fixes the issue. I've run tests with
it for the past week and no ill effects have been observed, so I'm
proposing it for inclusion into 2.6.26.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
knfsd currently uses 2 signal masks when processing requests. A "loose"
mask (SHUTDOWN_SIGS) that it uses when receiving network requests, and
then a more "strict" mask (ALLOWED_SIGS, which is just SIGKILL) that it
allows when doing the actual operation on the local storage.
This is apparently unnecessarily complicated. The underlying filesystem
should be able to sanely handle a signal in the middle of an operation.
This patch removes the signal mask handling from knfsd altogether. When
knfsd is started as a kthread, all signals are ignored. It then allows
all of the signals in SHUTDOWN_SIGS. There's no need to set the mask
as well.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Thanks to Frank Van Maarseveen for the original problem report: "A
privileged process on an NFS client which drops privileges after using
them to change the current working directory, will experience incorrect
EACCES after an NFS server reboot. This problem can also occur after
memory pressure on the server, particularly when the client side is
quiet for some time."
This occurs because the filehandle points to a directory whose parents
are no longer in the dentry cache, and we're attempting to reconnect the
directory to its parents without adequate permissions to perform lookups
in the parent directories.
We can therefore fix the problem by acquiring the necessary capabilities
before attempting the reconnection. We do this only in the
no_subtree_check case, since the documented behavior of the
subtree_check export option requires the server to check that the user
has lookup permissions on all parents.
The subtree_check case still has a problem, since reconnect_path()
unnecessarily requires both read and lookup permissions on all parent
directories. However, a fix in that case would be more delicate, and
use of subtree_check is already discouraged for other reasons.
Signed-off-by: Neil Brown <neilb@suse.de>
Cc: Frank van Maarseveen <frankvm@frankvm.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
I discovered that we had a list onto which every lock_dlm
lock was being put. Its only function was to discover whether
we'd got any locks left after umount. Since there was already
a counter for that purpose as well, I removed the list. The
saving is sizeof(struct list_head) per glock - well worth
having.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
There are several reasons why this is undesirable:
1. It never happens during normal operation anyway
2. If it does happen it causes performance to be very, very poor
3. It isn't likely to solve the original problem (memory shortage
on remote DLM node) it was supposed to solve
4. It uses a bunch of arbitrary constants which are unlikely to be
correct for any particular situation and for which the tuning seems
to be a black art.
5. In an N node cluster, only 1/N of the dropped locked will actually
contribute to solving the problem on average.
So all in all we are better off without it. This also makes merging
the lock_dlm module into GFS2 a bit easier.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
This patch fixes Red Hat bugzilla bug 450156.
This started with a not-too-improbable mount failure because the
locking protocol was never set back to its proper "lock_dlm" after the
system was rebooted in the middle of a gfs2_fsck. That left a
(purposely) invalid locking protocol in the superblock, which caused an
error when the file system was mounted the next time.
When there's an error mounting, vfs calls DQUOT_OFF, which calls
vfs_quota_off which calls gfs2_sync_fs. Next, gfs2_sync_fs calls
gfs2_log_flush passing s_fs_info. But due to the error, s_fs_info
had been previously set to NULL, and so we have the kernel oops.
My solution in this patch is to test for the NULL value before passing
it. I tested this patch and it fixes the problem.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
The previous attempt to fix the locking in readpage failed due
to the use of a "try lock" which resulted in occasional high
cpu usage during testing (due to repeated tries) and also it
did not resolve all the ordering problems wrt the transaction
lock (although it did solve all the inode lock ordering problems).
This patch avoids the problem by unlocking the page and getting the
locks in the correct order. This means that we have to retest the
page to ensure that it hasn't changed when we relock the page.
This now passes the tests which were previously failing.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
The patch to remove lock_nolock managed to get the arguments
of this list_add backwards. This fixes it.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
This patch merges the lock_nolock module into GFS2 itself. As well as removing
some of the overhead of the module, it also means that its now impossible to
build GFS2 without a lock module (which would be a pointless thing to do
anyway).
We also plan to merge lock_dlm into GFS2 in the future, but that is a more
tricky task, and will therefore be a separate patch.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Cc: David Teigland <teigland@redhat.com>
This looks like a lot of change, but in fact its not. Mostly its
things moving from one file to another. The change is just that
instead of queuing lock completions and callbacks from the DLM
we now pass them directly to GFS2.
This gives us a net loss of two list heads per glock (a fair
saving in memory) plus a reduction in the latency of delivering
the messages to GFS2, plus we now have one thread fewer as well.
There was a bug where callbacks and completions could be delivered
in the wrong order due to this unnecessary queuing which is fixed
by this patch.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Cc: Bob Peterson <rpeterso@redhat.com>
This patch implements a number of cleanups to the core of the
GFS2 glock code. As a result a lot of code is removed. It looks
like a really big change, but actually a large part of this patch
is either removing or moving existing code.
There are some new bits too though, such as the new run_queue()
function which is considerably streamlined. Highlights of this
patch include:
o Fixes a cluster coherency bug during SH -> EX lock conversions
o Removes the "glmutex" code in favour of a single bit lock
o Removes the ->go_xmote_bh() for inodes since it was duplicating
->go_lock()
o We now only use the ->lm_lock() function for both locks and
unlocks (i.e. unlock is a lock with target mode LM_ST_UNLOCKED)
o The fast path is considerably shortly, giving performance gains
especially with lock_nolock
o The glock_workqueue is now used for all the callbacks from the DLM
which allows us to simplify the lock_dlm module (see following patch)
o The way is now open to make further changes such as eliminating the two
threads (gfs2_glockd and gfs2_scand) in favour of a more efficient
scheme.
This patch has undergone extensive testing with various test suites
so it should be pretty stable by now.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Cc: Bob Peterson <rpeterso@redhat.com>
It's not even passed on to smp_call_function() anymore, since that
was removed. So kill it.
Acked-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
This patch fixes bz 450641.
This patch changes the computation for zero_metapath_length(), which it
renames to metapath_branch_start(). When you are extending the metadata
tree, The indirect blocks that point to the new data block must either
diverge from the existing tree either at the inode, or at the first
indirect block. They can diverge at the first indirect block because the
inode has room for 483 pointers while the indirect blocks have room for
509 pointers, so when the tree is grown, there is some free space in the
first indirect block. What metapath_branch_start() now computes is the
height where the first indirect block for the new data block is located.
It can either be 1 (if the indirect block diverges from the inode) or 2
(if it diverges from the first indirect block).
Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
This patch fixes bugzilla bug bz448866: gfs2: BUG: unable to
handle kernel paging request at ffff81002690e000.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>