Move the O_EXCL open handling into _nfs4_do_open() where it belongs. Doing
so also allows us to reuse the struct fattr from the opendata.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
All we really want is the ability to retrieve the root file handle. We no
longer need the ability to walk down the path, since that is now done in
nfs_follow_remote_path().
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
NFS Filehandles and struct fattr are really too large to be allocated on
the stack. This patch adds in a couple of helper functions to allocate them
dynamically instead.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Fix a number of RCU issues in the NFSv4 delegation code.
(1) delegation->cred doesn't need to be RCU protected as it's essentially an
invariant refcounted structure.
By the time we get to nfs_free_delegation(), the delegation is being
released, so no one else should be attempting to use the saved
credentials, and they can be cleared.
However, since the list of delegations could still be under traversal at
this point by such as nfs_client_return_marked_delegations(), the cred
should be released in nfs_do_free_delegation() rather than in
nfs_free_delegation(). Simply using rcu_assign_pointer() to clear it is
insufficient as that doesn't stop the cred from being destroyed, and nor
does calling put_rpccred() after call_rcu(), given that the latter is
asynchronous.
(2) nfs_detach_delegation_locked() and nfs_inode_set_delegation() should use
rcu_derefence_protected() because they can only be called if
nfs_client::cl_lock is held, and that guards against anyone changing
nfsi->delegation under it. Furthermore, the barrier imposed by
rcu_dereference() is superfluous, given that the spin_lock() is also a
barrier.
(3) nfs_detach_delegation_locked() is now passed a pointer to the nfs_client
struct so that it can issue lockdep advice based on clp->cl_lock for (2).
(4) nfs_inode_return_delegation_noreclaim() and nfs_inode_return_delegation()
should use rcu_access_pointer() outside the spinlocked region as they
merely examine the pointer and don't follow it, thus rendering unnecessary
the need to impose a partial ordering over the one item of interest.
These result in an RCU warning like the following:
[ INFO: suspicious rcu_dereference_check() usage. ]
---------------------------------------------------
fs/nfs/delegation.c:332 invoked rcu_dereference_check() without protection!
other info that might help us debug this:
rcu_scheduler_active = 1, debug_locks = 0
2 locks held by mount.nfs4/2281:
#0: (&type->s_umount_key#34){+.+...}, at: [<ffffffff810b25b4>] deactivate_super+0x60/0x80
#1: (iprune_sem){+.+...}, at: [<ffffffff810c332a>] invalidate_inodes+0x39/0x13a
stack backtrace:
Pid: 2281, comm: mount.nfs4 Not tainted 2.6.34-rc1-cachefs #110
Call Trace:
[<ffffffff8105149f>] lockdep_rcu_dereference+0xaa/0xb2
[<ffffffffa00b4591>] nfs_inode_return_delegation_noreclaim+0x5b/0xa0 [nfs]
[<ffffffffa0095d63>] nfs4_clear_inode+0x11/0x1e [nfs]
[<ffffffff810c2d92>] clear_inode+0x9e/0xf8
[<ffffffff810c3028>] dispose_list+0x67/0x10e
[<ffffffff810c340d>] invalidate_inodes+0x11c/0x13a
[<ffffffff810b1dc1>] generic_shutdown_super+0x42/0xf4
[<ffffffff810b1ebe>] kill_anon_super+0x11/0x4f
[<ffffffffa009893c>] nfs4_kill_super+0x3f/0x72 [nfs]
[<ffffffff810b25bc>] deactivate_super+0x68/0x80
[<ffffffff810c6744>] mntput_no_expire+0xbb/0xf8
[<ffffffff810c681b>] release_mounts+0x9a/0xb0
[<ffffffff810c689b>] put_mnt_ns+0x6a/0x79
[<ffffffffa00983a1>] nfs_follow_remote_path+0x5a/0x146 [nfs]
[<ffffffffa0098334>] ? nfs_do_root_mount+0x82/0x95 [nfs]
[<ffffffffa00985a9>] nfs4_try_mount+0x75/0xaf [nfs]
[<ffffffffa0098874>] nfs4_get_sb+0x291/0x31a [nfs]
[<ffffffff810b2059>] vfs_kern_mount+0xb8/0x177
[<ffffffff810b2176>] do_kern_mount+0x48/0xe8
[<ffffffff810c810b>] do_mount+0x782/0x7f9
[<ffffffff810c8205>] sys_mount+0x83/0xbe
[<ffffffff81001eeb>] system_call_fastpath+0x16/0x1b
Also on:
fs/nfs/delegation.c:215 invoked rcu_dereference_check() without protection!
[<ffffffff8105149f>] lockdep_rcu_dereference+0xaa/0xb2
[<ffffffffa00b4223>] nfs_inode_set_delegation+0xfe/0x219 [nfs]
[<ffffffffa00a9c6f>] nfs4_opendata_to_nfs4_state+0x2c2/0x30d [nfs]
[<ffffffffa00aa15d>] nfs4_do_open+0x2a6/0x3a6 [nfs]
...
And:
fs/nfs/delegation.c:40 invoked rcu_dereference_check() without protection!
[<ffffffff8105149f>] lockdep_rcu_dereference+0xaa/0xb2
[<ffffffffa00b3bef>] nfs_free_delegation+0x3d/0x6e [nfs]
[<ffffffffa00b3e71>] nfs_do_return_delegation+0x26/0x30 [nfs]
[<ffffffffa00b406a>] __nfs_inode_return_delegation+0x1ef/0x1fe [nfs]
[<ffffffffa00b448a>] nfs_client_return_marked_delegations+0xc9/0x124 [nfs]
...
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Ensure that we correctly rcu-dereference the delegation itself, and that we
protect against removal while we're changing the contents.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
* 'bugfixes' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6:
nfs: fix memory leak in nfs_get_sb with CONFIG_NFS_V4
nfs: fix some issues in nfs41_proc_reclaim_complete()
NFS: Ensure that nfs_wb_page() waits for Pg_writeback to clear
NFS: Fix an unstable write data integrity race
nfs: testing for null instead of ERR_PTR()
NFS: rsize and wsize settings ignored on v4 mounts
NFSv4: Don't attempt an atomic open if the file is a mountpoint
SUNRPC: Fix a bug in rpcauth_prune_expired
If dentry found stale happens to be a root of disconnected tree, we
can't d_drop() it; its d_hash is actually part of s_anon and d_drop()
would simply hide it from shrink_dcache_for_umount(), leading to
all sorts of fun, including busy inodes on umount and oopsen after
that.
Bug had been there since at least 2006 (commit c636eb already has it),
so it's definitely -stable fodder.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The original code passed an ERR_PTR() to rpc_put_task() and instead of
returning zero on success it returned -ENOMEM.
Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Neil Brown reports that he is seeing the BUG_ON(ret == 0) trigger in
nfs_page_async_flush. According to the trace in
https://bugzilla.novell.com/show_bug.cgi?id=599628
the problem appears to be due to nfs_wb_page() not waiting for the
PG_writeback flag to clear.
There is a ditto problem in nfs_wb_page_cancel()
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Commit 2c61be0a94 (NFS: Ensure that the WRITE
and COMMIT RPC calls are always uninterruptible) exposed a race on file
close. In order to ensure correct close-to-open behaviour, we want to wait
for all outstanding background commit operations to complete.
This patch adds an inode flag that indicates if a commit operation is under
way, and provides a mechanism to allow ->write_inode() to wait for its
completion if this is a data integrity flush.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
NFSv4 mounts ignore the rsize and wsize mount options, and always use
the default transfer size for both. This seems to be because all
NFSv4 mounts are now cloned, and the cloning logic doesn't copy the
rsize and wsize settings from the parent nfs_server.
I tested Fedora's 2.6.32.11-99 and it seems to have this problem as
well, so I'm guessing that .33, .32, and perhaps older kernels have
this issue as well.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Cc: Stable <stable@kernel.org>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Arnaud Giersch reports that NFSv4 locking is broken when we hold a
delegation since commit 8e469ebd6d (NFSv4:
Don't allow posix locking against servers that don't support it).
According to Arnaud, the lock succeeds the first time he opens the file
(since we cannot do a delegated open) but then fails after we start using
delegated opens.
The following patch fixes it by ensuring that locking behaviour is
governed by a per-filesystem capability flag that is initially set, but
gets cleared if the server ever returns an OPEN without the
NFS4_OPEN_RESULT_LOCKTYPE_POSIX flag being set.
Reported-by: Arnaud Giersch <arnaud.giersch@iut-bm.univ-fcomte.fr>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: stable@kernel.org
We always want to ensure that WRITE and COMMIT completes, whether or not
the user presses ^C. Do this by making the call asynchronous, and allowing
the user to do an interruptible wait for rpc_task completion.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
This patch fixes a race which occurs due to the fact that we release the
PG_writeback flag while still holding the nfs_page locked.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Since writeback_single_inode() checks the inode->i_state flags _before_ it
flushes out the data, we need to ensure that the I_DIRTY_DATASYNC flag is
already set. Otherwise we risk not seeing a call to write_inode(), which
again means that we break fsync() et al...
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
If nfs atomic open implementation ends up doing open request from
->d_revalidate() codepath and gets an error from server, return that error
to caller explicitly and don't bother with lookup_instantiate_filp() at all.
->d_revalidate() can return an error itself just fine...
See
http://bugzilla.kernel.org/show_bug.cgi?id=15674http://marc.info/?l=linux-kernel&m=126988782722711&w=2
for original report.
Reported-by: Daniel J Blueman <daniel.blueman@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
percpu.h is included by sched.h and module.h and thus ends up being
included when building most .c files. percpu.h includes slab.h which
in turn includes gfp.h making everything defined by the two files
universally available and complicating inclusion dependencies.
percpu.h -> slab.h dependency is about to be removed. Prepare for
this change by updating users of gfp and slab facilities include those
headers directly instead of assuming availability. As this conversion
needs to touch large number of source files, the following script is
used as the basis of conversion.
http://userweb.kernel.org/~tj/misc/slabh-sweep.py
The script does the followings.
* Scan files for gfp and slab usages and update includes such that
only the necessary includes are there. ie. if only gfp is used,
gfp.h, if slab is used, slab.h.
* When the script inserts a new include, it looks at the include
blocks and try to put the new include such that its order conforms
to its surrounding. It's put in the include block which contains
core kernel includes, in the same order that the rest are ordered -
alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
doesn't seem to be any matching order.
* If the script can't find a place to put a new include (mostly
because the file doesn't have fitting include block), it prints out
an error message indicating which .h file needs to be added to the
file.
The conversion was done in the following steps.
1. The initial automatic conversion of all .c files updated slightly
over 4000 files, deleting around 700 includes and adding ~480 gfp.h
and ~3000 slab.h inclusions. The script emitted errors for ~400
files.
2. Each error was manually checked. Some didn't need the inclusion,
some needed manual addition while adding it to implementation .h or
embedding .c file was more appropriate for others. This step added
inclusions to around 150 files.
3. The script was run again and the output was compared to the edits
from #2 to make sure no file was left behind.
4. Several build tests were done and a couple of problems were fixed.
e.g. lib/decompress_*.c used malloc/free() wrappers around slab
APIs requiring slab.h to be added manually.
5. The script was run on all .h files but without automatically
editing them as sprinkling gfp.h and slab.h inclusions around .h
files could easily lead to inclusion dependency hell. Most gfp.h
inclusion directives were ignored as stuff from gfp.h was usually
wildly available and often used in preprocessor macros. Each
slab.h inclusion directive was examined and added manually as
necessary.
6. percpu.h was updated not to include slab.h.
7. Build test were done on the following configurations and failures
were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
distributed build env didn't work with gcov compiles) and a few
more options had to be turned off depending on archs to make things
build (like ipr on powerpc/64 which failed due to missing writeq).
* x86 and x86_64 UP and SMP allmodconfig and a custom test config.
* powerpc and powerpc64 SMP allmodconfig
* sparc and sparc64 SMP allmodconfig
* ia64 SMP allmodconfig
* s390 SMP allmodconfig
* alpha SMP allmodconfig
* um on x86_64 SMP allmodconfig
8. percpu.h modifications were reverted so that it could be applied as
a separate patch and serve as bisection point.
Given the fact that I had only a couple of failures from tests on step
6, I'm fairly confident about the coverage of this conversion patch.
If there is a breakage, it's likely to be something in one of the arch
headers which should be easily discoverable easily on most builds of
the specific arch.
Signed-off-by: Tejun Heo <tj@kernel.org>
Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
The reply parsing code attempts to decode the GETATTR response even if
the DELEGRETURN portion of the compound returned an error. The GETATTR
response won't actually exist if that's the case and we're asking the
parser to read past the end of the response.
This bug is fairly benign. The parser catches this without reading past
the end of the response and decode_getfattr returns -EIO. Earlier
kernels however had decode_op_hdr using the READ_BUF macro, and this
bug would make this printk pop any time the client got an error from
a delegreturn:
kernel: decode_op_hdr: reply buffer overflowed in line XXXX
More recent kernels seem to have replaced this printk with a dprintk.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
bdi_unregister is called by nfs_put_super which is only called by
generic_shutdown_super if ->s_root is not NULL. So if we error out
in a circumstance where we called nfs_bdi_register (i.e. server !=
NULL) but have not set s_root, then we need to call bdi_unregister
explicitly in nfs_get_sb and various other *_get_sb() functions.
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
If the NFS_INO_REVAL_FORCED flag is set, that means that we don't yet have
an up to date attribute cache. Even if we hold a delegation, we must
put a GETATTR on the wire.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: stable@kernel.org
To avoid hangs in the svc_unregister(), on version 4 mounts
(and unmounts), when rpcbind is not running, make the nfs4 callback
program an 'hidden' service by setting the 'vs_hidden' flag in the
nfs4_callback_version structure.
Signed-off-by: Steve Dickson <steved@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
I'll admit that it's unlikely for the first allocation to fail and
the second one to succeed. I won't be offended if you ignore this
patch.
Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
* 'for-2.6.34' of git://linux-nfs.org/~bfields/linux: (22 commits)
nfsd4: fix minor memory leak
svcrpc: treat uid's as unsigned
nfsd: ensure sockets are closed on error
Revert "sunrpc: move the close processing after do recvfrom method"
Revert "sunrpc: fix peername failed on closed listener"
sunrpc: remove unnecessary svc_xprt_put
NFSD: NFSv4 callback client should use RPC_TASK_SOFTCONN
xfs_export_operations.commit_metadata
commit_metadata export operation replacing nfsd_sync_dir
lockd: don't clear sm_monitored on nsm_reboot_lookup
lockd: release reference to nsm_handle in nlm_host_rebooted
nfsd: Use vfs_fsync_range() in nfsd_commit
NFSD: Create PF_INET6 listener in write_ports
SUNRPC: NFS kernel APIs shouldn't return ENOENT for "transport not found"
SUNRPC: Bury "#ifdef IPV6" in svc_create_xprt()
NFSD: Support AF_INET6 in svc_addsock() function
SUNRPC: Use rpc_pton() in ip_map_parse()
nfsd: 4.1 has an rfc number
nfsd41: Create the recovery entry for the NFSv4.1 client
nfsd: use vfs_fsync for non-directories
...
Now that we have correct COMMIT semantics in writeback_single_inode, we can
reduce and simplify nfs_wb_all(). Also replace nfs_wb_nocommit() with a
call to filemap_write_and_wait(), which doesn't need to hold the
inode->i_mutex.
With that done, we can eliminate nfs_write_mapping() altogether.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Since nfs_scan_list() doesn't wait for locked pages, we have a race in
which it is possible to end up with an inode that needs to send a COMMIT,
but which does not have the I_DIRTY_DATASYNC flag set.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
If the caller is doing a non-blocking flush, and there are still writebacks
pending on the wire, we can usually defer the COMMIT call until those
writes are done.
Also ensure that we honour the wbc->nonblocking flag.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
In order to know when we should do opportunistic commits of the unstable
writes, when the VM is doing a background flush, we add a field to count
the number of unstable writes.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
The sole purpose of nfs_write_inode is to commit unstable writes, so
move it into fs/nfs/write.c, and make nfs_commit_inode static.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
This gives the filesystem more information about the writeback that
is happening. Trond requested this for the NFS unstable write handling,
and other filesystems might benefit from this too by beeing able to
distinguish between the different callers in more detail.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Similar to the fsync issue fixed a while ago in commit
2daea67e96 we need to write for data to
actually hit the disk before writing out the metadata to guarantee
data integrity for filesystems that modify the inode in the data I/O
completion path. Currently XFS and NFS handle this manually, and AFS
has a write_inode method that does nothing but waiting for data, while
others are possibly missing out on this.
Fortunately this change has a lot less impact than the fsync change
as none of the write_inode methods starts data writeout of any form
by itself.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: (52 commits)
init: Open /dev/console from rootfs
mqueue: fix typo "failues" -> "failures"
mqueue: only set error codes if they are really necessary
mqueue: simplify do_open() error handling
mqueue: apply mathematics distributivity on mq_bytes calculation
mqueue: remove unneeded info->messages initialization
mqueue: fix mq_open() file descriptor leak on user-space processes
fix race in d_splice_alias()
set S_DEAD on unlink() and non-directory rename() victims
vfs: add NOFOLLOW flag to umount(2)
get rid of ->mnt_parent in tomoyo/realpath
hppfs can use existing proc_mnt, no need for do_kern_mount() in there
Mirror MS_KERNMOUNT in ->mnt_flags
get rid of useless vfsmount_lock use in put_mnt_ns()
Take vfsmount_lock to fs/internal.h
get rid of insanity with namespace roots in tomoyo
take check for new events in namespace (guts of mounts_poll()) to namespace.c
Don't mess with generic_permission() under ->d_lock in hpfs
sanitize const/signedness for udf
nilfs: sanitize const/signedness in dealing with ->d_name.name
...
Fix up fairly trivial (famous last words...) conflicts in
drivers/infiniband/core/uverbs_main.c and security/tomoyo/realpath.c
sunrpc_cache_update() will always call detail->update() from inside the
detail->hash_lock, so it cannot allocate memory.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: stable@kernel.org
Ensure that we change the EXCHANGE_ID verifier (i.e. clp->cl_boot_time)
when we want to reset all state. This is mainly needed when the server
tells us that it is revoking our open or lock stateids.
Handle revoking of recallable state by expiring the delegations.
Handle callback path issues by expiring the delegations and then resetting
the session.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
renewd sends RENEW requests to the NFS server in order to renew state.
As the request is asynchronous, renewd should take a reference to the
nfs_client to prevent concurrent umounts from freeing the client
Signed-off-by: Alexandros Batsakis <batsakis@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
renewd sends SEQUENCE requests to the NFS server in order to renew state.
As the request is asynchronous, renewd should take a reference to the
nfs_client to prevent concurrent umounts from freeing the session/client
Signed-off-by: Alexandros Batsakis <batsakis@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
If the renewd send queue gets backlogged (e.g., if the server goes down),
we will keep filling the queue with periodic RENEW/SEQUENCE requests.
This patch schedules a new renewd request if and only if the previous one
returns (either success or failure)
Signed-off-by: Alexandros Batsakis <batsakis@netapp.com>
[Trond.Myklebust@netapp.com: moved nfs4_schedule_state_renewal() into
separate nfs4_renew_release() and nfs41_sequence_release() callbacks
to ensure correct behaviour on call setup failure]
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
renewd should be synchronously killed before we destroy the session in
nfs4_clear_minor_version
Signed-off-by: Alexandros Batsakis <batsakis@netapp.com>
[Trond.Myklebust@netapp.com: clean up to remove 'unused function
warning when !CONFIG_NFS_V4]
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
There's currently an open Ubuntu bug[0], with the intent to compile NFS_FSCACHE
(and possibly AFS_FSCACHE, 9P_FSCACHE) into the standard Ubuntu kernel.
However, since *_FSCACHE still depends on EXPERIMENTAL, this won't happen.
As Arjan van de Ven pointed out[1], the EXPERIMENTAL flag doesn't mean that
much any more, I propose the following patch to fs/nfs/Kconfig. I'd do the
same for fs/9p/Kconfig and fs/afs/Kconfig, but as I did not test 9p or AFS, I
feel it would not be appropriate for me to remove the flag.
[0] https://bugs.launchpad.net/ubuntu/+source/linux/+bug/440522/comments/5
[1] http://lkml.org/lkml/2010/1/23/145
Signed-off-by: Christian Kujau <lists@nerdbynature.de>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add __percpu sparse annotations to fs.
These annotations are to make sparse consider percpu variables to be
in a different address space and warn if accessed without going
through percpu accessors. This patch doesn't affect normal builds.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: Alex Elder <aelder@sgi.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
The cached read and write paths initialize fattr->time_start in their
setup procedures. The value of fattr->time_start is propagated to
read_cache_jiffies by nfs_update_inode(). Subsequent calls to
nfs_attribute_timeout() will then use a good time stamp when
computing the attribute cache timeout, and squelch unneeded GETATTR
calls.
Since the direct I/O paths erroneously leave the inode's
fattr->time_start field set to zero, read_cache_jiffies for that inode
is set to zero after any direct read or write operation. This
triggers an otw GETATTR or ACCESS call to update the file's attribute
and access caches properly, even when the NFS READ or WRITE replies
have usable post-op attributes.
Make sure the direct read and write setup code performs the same fattr
initialization as the cached I/O paths to prevent unnecessary GETATTR
calls.
This was likely introduced by commit 0e574af1 in 2.6.15, which appears
to add new nfs_fattr_init() call sites in the cached read and write
paths, but not in the equivalent places in fs/nfs/direct.c. A
subsequent commit in the same series, 33801147, introduces the
fattr->time_start field.
Interestingly, the direct write reschedule path already has a call to
nfs_fattr_init() in the right place.
Reported-by: Quentin Barnes <qbarnes@yahoo-inc.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Cc: stable@kernel.org
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
For NFSv2 and v3:
O_DIRECT writes are always synchronous, and aren't cached, so nothing
should be flushed when closing an NFS O_DIRECT file descriptor. Thus
there are no write errors to report on close(2).
In addition, there's no cached data to verify on the next open(2),
so we don't need clean GETATTR results at close time to compare with.
Thus, there's no need for the nfs_revalidate_inode() call when closing
an NFS O_DIRECT file. This reduces the number of synchronous
on-the-wire requests for a simple open-write-close of an NFS O_DIRECT
file by roughly 20%.
For NFSv4:
Call nfs4_do_close() with wait set to zero when closing an NFS
O_DIRECT file. The CLOSE will go on the wire, but the application
won't wait for it to complete.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
The bytes counted by the performance counters for NFS writes should
reflect write and sync errors. If the write(2) system call reports
an error, the bytes should not be counted. And, if the write is
short, the actual number of bytes that was written should be counted,
not the number of bytes that was requested.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Bytes read via the splice API should be accounted for in the NFS
performance statistics.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Currently, the NFS I/O counters count the number of bytes requested
by applications, rather than the number of bytes actually read by the
system calls.
The number of bytes requested for reads is actually not that useful,
because the value is usually a buffer size for reads. That is, that
requested number is usually a maximum, and frequently doesn't reflect
the actual number of bytes read.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Nit: The VFSOPEN and VFSFLUSH counters are function call counters.
Count every call to these routines.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
When session is reset, client can renegotiate slot table size.
Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Drain the fore channel and reset the max_slots to the new value.
Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
For now the back channel ca_maxresponsesize_cached is 0 and there is no
backchannel DRC. Return NFS4ERR_REP_TOO_BIG_TO_CACHE when a cb_sequence
cachethis is true. When it is false, return NFS4ERR_RETRY_UNCACHED_REP as the
next operation error.
Remember the replay error accross compound operation processing.
Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Make all cb_sequence arguments available to verify_seqid which will make
replay decisions.
Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
All callback operations have arguments to decode and require processing.
The preprocess_nfs4X_op functions catch unsupported or illegal ops so
decode_args and process_op pointers are always non NULL.
Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Skip all other processing when error is encountered.
Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Set NFS4ERR_RESOURCE as CB_COMPOUND status and do not return an op on
decode_op_hdr or encode_op_hdr buffer overflow.
NFS4ERR_RESOURCE is correct for v4.0. Will fix the return for v4.1 along with
all the other NFS4ERR_RESOURCE errors in a later patch.
Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
If a CB_SEQUENCE referring call triple matches a slot table entry, the
client is still waiting for a response to the original request. In this
case, return NFS4ERR_DELAY as the response to the callback.
Signed-off-by: Mike Sager <sager@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Traverse a list of referring calls and look for a session/slot/seq number
match.
Signed-off-by: Mike Sager <sager@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
For the CREATE_SESSION attribute ca_maxresponsesize_cached, calculate
the value based on the rpc reply header size plus the maximum nfs compound
reply size.
Signed-off-by: Mike Sager <sager@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Add a wrapper around rpc_call_sync that handles -EKEYEXPIRED errors from
the RPC layer as it would an -EJUKEBOX error if NFSv2 had such a thing.
Also, add a handler for that error for async calls that makes it
resubmit the RPC on -EKEYEXPIRED.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
We're using -EKEYEXPIRED to indicate that a krb5 credcache contains an
expired ticket and that we should have the NFS layer retry the RPC call
instead of returning an error back to the caller. Handle this as we
would an -EJUKEBOX error return.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
If a KRB5 TGT ticket expires, we don't want to return an error
immediatel. If someone has a long running job and just forgets to run
"kinit" in time then this will make it fail.
Instead, we want to treat this situation as we would NFS4ERR_DELAY and
retry the upcall after delaying a bit with an exponential backoff.
This patch just makes any place that would handle NFS4ERR_DELAY also
handle -EKEYEXPIRED the same way. In the future, we may want to be more
sophisticated however and handle hard vs. soft mounts differently, or
specify some upper limit on how long we'll wait for a new TGT to be
acquired.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
It was recently pointed out that the NFSERR_SERVERFAULT error, which is
designed to inform the user of a serious internal error on the server, was
being mapped to an error value that is internal to the kernel.
This patch maps it to the error EREMOTEIO, which is exported to userland
through errno.h.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: stable@kernel.org
Not having an fscache cookie is perfectly valid if the user didn't mount
with the fscache option.
This patch fixes http://bugzilla.kernel.org/show_bug.cgi?id=15234
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Acked-by: David Howells <dhowells@redhat.com>
Cc: stable@kernel.org
If the NFS_ATTR_FATTR_TYPE field isn't set in fattr->valid, then we should
not set the S_IFMT part of inode->i_mode.
Reported-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Ensure that we unregister the bdi before kill_anon_super() calls
ida_remove() on our device name.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: stable@kernel.org
The VM/VFS does not allow mapping->a_ops->invalidatepage() to fail.
Unfortunately, nfs_wb_page_cancel() may fail if a fatal signal occurs.
Since the NFS code assumes that the page stays mapped for as long as the
writeback is active, we can end up Oopsing (among other things).
The only safe fix here is to convert nfs_wait_on_request(), so as to make
it uninterruptible (as is already the case with wait_on_page_writeback()).
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: stable@kernel.org
Clean up: Bruce observed we have more or less common logic in each of
svc_create_xprt()'s callers: the check to create an IPv6 RPC listener
socket only if CONFIG_IPV6 is set. I'm about to add another case
that does just the same.
If we move the ifdefs into __svc_xpo_create(), then svc_create_xprt()
call sites can get rid of the "#ifdef" ugliness, and can use the same
logic with or without IPv6 support available in the kernel.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Even if the server is crazy, we should be able to mark the stateid as being
bad, to ensure it gets recovered.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Reviewed-by: Chuck Lever <chuck.lever@oracle.com>
Currently, nfs4_handle_exception() will call it twice if called with an
error of -NFS4ERR_STALE_CLIENTID, -NFS4ERR_STALE_STATEID or
-NFS4ERR_EXPIRED.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Reviewed-by: Chuck Lever <chuck.lever@oracle.com>
In most cases, we just want to mark the lock_stateid sequence id as being
uninitialised.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: stable@kernel.org
Reviewed-by: Chuck Lever <chuck.lever@oracle.com>
Avoid the following warnings when CONFIG_NFS_V4=n:
fs/nfs/sysctl.c:19: warning: unused variable `nfs_set_port_max'
fs/nfs/sysctl.c:18: warning: unused variable `nfs_set_port_min'
by making those variables contingent on NFSv4 being configured.
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Reviewed-by: Chuck Lever <chuck.lever@oracle.com>
The symbol nfs_commitdata_release is only used locally
in this file. Make it static to prevent the following sparse warning:
warning: symbol 'nfs_commitdata_release' was not declared. Should it be static?
Signed-off-by: H Hartley Sweeten <hsweeten@visionengravers.com>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Reviewed-by: Chuck Lever <chuck.lever@oracle.com>
If someone calls nfs_release_page(), we presumably already know that the
page is clean, however it may be holding an unstable write.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: stable@kernel.org
Reviewed-by: Chuck Lever <chuck.lever@oracle.com>
Recent change is missing to update "rehash". With that change, it will
become the cause of adding dentry to hash twice.
This explains the reason of Oops (dereference the freed dentry in
__d_lookup()) on my machine.
Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Reported-by: Marvin <marvin24@gmx.de>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
This reverts commit e9496ff46a. Quoth Al:
"it's dependent on a lot of other stuff not currently in mainline
and badly broken with current fs/namespace.c. Sorry, badly
out-of-order cherry-pick from old queue.
PS: there's a large pending series reworking the refcounting and
lifetime rules for vfsmounts that will, among other things, allow to
rip a subtree away _without_ dissolving connections in it, to be
garbage-collected when all active references are gone. It's
considerably saner wrt "is the subtree busy" logics, but it's nowhere
near being ready for merge at the moment; this changeset is one of the
things becoming possible with that sucker, but it certainly shouldn't
have been picked during this cycle. My apologies..."
Noticed-by: Eric Paris <eparis@redhat.com>
Requested-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: (38 commits)
direct I/O fallback sync simplification
ocfs: stop using do_sync_mapping_range
cleanup blockdev_direct_IO locking
make generic_acl slightly more generic
sanitize xattr handler prototypes
libfs: move EXPORT_SYMBOL for d_alloc_name
vfs: force reval of target when following LAST_BIND symlinks (try #7)
ima: limit imbalance msg
Untangling ima mess, part 3: kill dead code in ima
Untangling ima mess, part 2: deal with counters
Untangling ima mess, part 1: alloc_file()
O_TRUNC open shouldn't fail after file truncation
ima: call ima_inode_free ima_inode_free
IMA: clean up the IMA counts updating code
ima: only insert at inode creation time
ima: valid return code from ima_inode_alloc
fs: move get_empty_filp() deffinition to internal.h
Sanitize exec_permission_lite()
Kill cached_lookup() and real_lookup()
Kill path_lookup_open()
...
Trivial conflicts in fs/direct-io.c
* 'bugfixes' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6:
NFSv4: Fix a regression in the NFSv4 state manager
NFSv4: Release the sequence id before restarting a CLOSE rpc call
nfs41: fix session fore channel negotiation
nfs41: do not zero seqid portion of stateid on close
nfs: run state manager in privileged mode
nfs: make recovery state manager operations privileged
nfs: enforce FIFO ordering of operations trying to acquire slot
rpc: add a new priority in RPC task
nfs: remove rpc_task argument from nfs4_find_slot
rpc: add rpc_queue_empty function
nfs: change nfs4_do_setlk params to identify recovery type
nfs: do not do a LOOKUP after open
nfs: minor cleanup of session draining
* 'for-2.6.33' of git://linux-nfs.org/~bfields/linux: (42 commits)
nfsd: remove pointless paths in file headers
nfsd: move most of nfsfh.h to fs/nfsd
nfsd: remove unused field rq_reffh
nfsd: enable V4ROOT exports
nfsd: make V4ROOT exports read-only
nfsd: restrict filehandles accepted in V4ROOT case
nfsd: allow exports of symlinks
nfsd: filter readdir results in V4ROOT case
nfsd: filter lookup results in V4ROOT case
nfsd4: don't continue "under" mounts in V4ROOT case
nfsd: introduce export flag for v4 pseudoroot
nfsd: let "insecure" flag vary by pseudoflavor
nfsd: new interface to advertise export features
nfsd: Move private headers to source directory
vfs: nfsctl.c un-used nfsd #includes
lockd: Remove un-used nfsd headers #includes
s390: remove un-used nfsd #includes
sparc: remove un-used nfsd #includes
parsic: remove un-used nfsd #includes
compat.c: Remove dependence on nfsd private headers
...
Commit 5601a00d67 (nfs: run state manager
in privileged mode) introduces a regression in the NFSv4 code when
compiled with CONFIG_NFS_V4_1. The calls to nfs4_end_drain_session()
from the main loop in nfs4_state_manager() Oops due to the lack of an
NFSv4.1 session when running NFSv4.0.
The fix is to move those two calls back into nfs41_init_clientid() and
nfs4_reset_session().
The calls to nfs4_end_drain_session() that remain inside
nfs4_state_manager() are safe, since the NFSv4.0 code will never set the
NFS4CLNT_SESSION_DRAINING bit.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
If the CLOSE or OPEN_DOWNGRADE call triggers a state recovery, and has
to be resent, then we must release the seqid. Otherwise the open
recovery will wait for the close to finish, which causes a deadlock.
This is mainly a NFSv4.1 problem, although it can theoretically happen
with NFSv4.0 too, in a OPEN_DOWNGRADE situation.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
If the rsize or wsize is not set on the mount command, negotiate the highest
supported rsize and wsize in session creation.
Fixes a bug where the client negotiated nfs41_maxwrite_overhead as
ca_maxrequestsize and nfs41_maxread_overhead as ca_maxresponsesize resulting
in NFS4ERR_REQ_TOO_BIG errors on writes.
Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Remove code left over from a previous minorversion draft.
which specified zeroing seqid portions of stateid's.
Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
* git://git.linux-nfs.org/projects/trondmy/nfs-2.6: (75 commits)
NFS: Fix nfs_migrate_page()
rpc: remove unneeded function parameter in gss_add_msg()
nfs41: Invoke RECLAIM_COMPLETE on all new client ids
SUNRPC: IS_ERR/PTR_ERR confusion
NFSv41: Fix a potential state leakage when restarting nfs4_close_prepare
nfs41: Handle NFSv4.1 session errors in the delegation recall code
nfs41: Retry delegation return if it failed with session error
nfs41: Handle session errors during delegation return
nfs41: Mark stateids in need of reclaim if state manager gets stale clientid
NFS: Fix up the declaration of nfs4_restart_rpc when NFSv4 not configured
nfs41: Don't clear DRAINING flag on NFS4ERR_STALE_CLIENTID
nfs41: nfs41_setup_state_renewal
NFSv41: More cleanups
NFSv41: Fix up some bugs in the NFS4CLNT_SESSION_DRAINING code
NFSv41: Clean up slot table management
NFSv41: Fix nfs4_proc_create_session
nfs41: Invoke RECLAIM_COMPLETE
nfs41: RECLAIM_COMPLETE functionality
nfs41: RECLAIM_COMPLETE XDR functionality
Cleanup some NFSv4 XDR decode comments
...
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu: (34 commits)
m68k: rename global variable vmalloc_end to m68k_vmalloc_end
percpu: add missing per_cpu_ptr_to_phys() definition for UP
percpu: Fix kdump failure if booted with percpu_alloc=page
percpu: make misc percpu symbols unique
percpu: make percpu symbols in ia64 unique
percpu: make percpu symbols in powerpc unique
percpu: make percpu symbols in x86 unique
percpu: make percpu symbols in xen unique
percpu: make percpu symbols in cpufreq unique
percpu: make percpu symbols in oprofile unique
percpu: make percpu symbols in tracer unique
percpu: make percpu symbols under kernel/ and mm/ unique
percpu: remove some sparse warnings
percpu: make alloc_percpu() handle array types
vmalloc: fix use of non-existent percpu variable in put_cpu_var()
this_cpu: Use this_cpu_xx in trace_functions_graph.c
this_cpu: Use this_cpu_xx for ftrace
this_cpu: Use this_cpu_xx in nmi handling
this_cpu: Use this_cpu operations in RCU
this_cpu: Use this_cpu ops for VM statistics
...
Fix up trivial (famous last words) global per-cpu naming conflicts in
arch/x86/kvm/svm.c
mm/slab.c
The call to migrate_page() will cause the page->private field to be
cleared.
Also fix up the locking around the page->private transfer, so that we ensure
that calls to nfs_page_find_request() don't end up racing.
Finally, fix up a double free bug: nfs_unlock_request() already calls
nfs_release_request() for us...
Reported-by: Wu Fengguang <fengguang.wu@intel.com>
Tested-by: Andi Kleen <andi@firstfloor.org>
Cc: stable@kernel.org
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
While Linux provided an O_SYNC flag basically since day 1, it took until
Linux 2.4.0-test12pre2 to actually get it implemented for filesystems,
since that day we had generic_osync_around with only minor changes and the
great "For now, when the user asks for O_SYNC, we'll actually give
O_DSYNC" comment. This patch intends to actually give us real O_SYNC
semantics in addition to the O_DSYNC semantics. After Jan's O_SYNC
patches which are required before this patch it's actually surprisingly
simple, we just need to figure out when to set the datasync flag to
vfs_fsync_range and when not.
This patch renames the existing O_SYNC flag to O_DSYNC while keeping it's
numerical value to keep binary compatibility, and adds a new real O_SYNC
flag. To guarantee backwards compatiblity it is defined as expanding to
both the O_DSYNC and the new additional binary flag (__O_SYNC) to make
sure we are backwards-compatible when compiled against the new headers.
This also means that all places that don't care about the differences can
just check O_DSYNC and get the right behaviour for O_SYNC, too - only
places that actuall care need to check __O_SYNC in addition. Drivers and
network filesystems have been updated in a fail safe way to always do the
full sync magic if O_DSYNC is set. The few places setting O_SYNC for
lower layers are kept that way for now to stay failsafe.
We enforce that O_DSYNC is set when __O_SYNC is set early in the open path
to make sure we always get these sane options.
Note that parisc really screwed up their headers as they already define a
O_DSYNC that has always been a no-op. We try to repair it by using it for
the new O_DSYNC and redefinining O_SYNC to send both the traditional
O_SYNC numerical value _and_ the O_DSYNC one.
Cc: Richard Henderson <rth@twiddle.net>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Grant Grundler <grundler@parisc-linux.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andreas Dilger <adilger@sun.com>
Acked-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Acked-by: Kyle McMartin <kyle@mcmartin.ca>
Acked-by: Ulrich Drepper <drepper@redhat.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Jan Kara <jack@suse.cz>
The NFSv4.1 spec indicates RECLAIM_COMPLETE is to be issued
whenever a client establishes a new client id, not only after
detecting the server has rebooted.
Set the NFS4CLNT_RECLAIM_REBOOT bit after every new client id has
been established. This enables us to issue RECLAIM_COMPLETE
during the wrap up of the NFS4CLNT_RECLAIM_REBOOT state.
Signed-off-by: Ricardo Labiaga <Ricardo.Labiaga@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
* 'for-2.6.33' of git://git.kernel.dk/linux-2.6-block: (113 commits)
cfq-iosched: Do not access cfqq after freeing it
block: include linux/err.h to use ERR_PTR
cfq-iosched: use call_rcu() instead of doing grace period stall on queue exit
blkio: Allow CFQ group IO scheduling even when CFQ is a module
blkio: Implement dynamic io controlling policy registration
blkio: Export some symbols from blkio as its user CFQ can be a module
block: Fix io_context leak after failure of clone with CLONE_IO
block: Fix io_context leak after clone with CLONE_IO
cfq-iosched: make nonrot check logic consistent
io controller: quick fix for blk-cgroup and modular CFQ
cfq-iosched: move IO controller declerations to a header file
cfq-iosched: fix compile problem with !CONFIG_CGROUP
blkio: Documentation
blkio: Wait on sync-noidle queue even if rq_noidle = 1
blkio: Implement group_isolation tunable
blkio: Determine async workload length based on total number of queues
blkio: Wait for cfq queue to get backlogged if group is empty
blkio: Propagate cgroup weight updation to cfq groups
blkio: Drop the reference to queue once the task changes cgroup
blkio: Provide some isolation between groups
...
* git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/sysctl-2.6: (43 commits)
security/tomoyo: Remove now unnecessary handling of security_sysctl.
security/tomoyo: Add a special case to handle accesses through the internal proc mount.
sysctl: Drop & in front of every proc_handler.
sysctl: Remove CTL_NONE and CTL_UNNUMBERED
sysctl: kill dead ctl_handler definitions.
sysctl: Remove the last of the generic binary sysctl support
sysctl net: Remove unused binary sysctl code
sysctl security/tomoyo: Don't look at ctl_name
sysctl arm: Remove binary sysctl support
sysctl x86: Remove dead binary sysctl support
sysctl sh: Remove dead binary sysctl support
sysctl powerpc: Remove dead binary sysctl support
sysctl ia64: Remove dead binary sysctl support
sysctl s390: Remove dead sysctl binary support
sysctl frv: Remove dead binary sysctl support
sysctl mips/lasat: Remove dead binary sysctl support
sysctl drivers: Remove dead binary sysctl support
sysctl crypto: Remove dead binary sysctl support
sysctl security/keys: Remove dead binary sysctl support
sysctl kernel: Remove binary sysctl logic
...
Currently, if the call to nfs4_setup_sequence() in nfs4_close_prepare
fails, any later retries will fail to launch an RPC call, due to the fact
that the &state->flags will have been cleared.
Ditto if nfs4_close_done() triggers a call to the NFSv4.1 version of
nfs_restart_rpc().
We therefore move the actual clearing of the state->flags to
nfs4_close_done(), when we know that the RPC call was successful.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Update nfs4_delegreturn_done() to retry the operation after setting the
NFS4CLNT_SESSION_SETUP bit to indicate the need to reset the session.
Signed-off-by: Ricardo Labiaga <Ricardo.Labiaga@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
The state manager was not marking the stateids as needing to be reclaimed
after reestablishing the clientid.
Signed-off-by: Ricardo Labiaga <Ricardo.Labiaga@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Also rename it: it is used in generic code, and so should not have a 'nfs4'
prefix.
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
If CREATE_SESSION fails with NFS4ERR_STALE_CLIENTID, don't clear the
NFS4CLNT_SESSION_DRAINING flag and don't wake RPCs waiting for the
session to be reestablished. We don't have a session yet, so there
is no reason to wake other RPCs.
This avoids sending spurious compounds with bogus sequenceID during
session and state recovery.
Signed-off-by: Ricardo Labiaga <Ricardo.Labiaga@netapp.com>
[Trond.Myklebust@netapp.com: cleaned up patch by adding the
nfs41_begin/end_drain_session() helpers]
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Move call to get the lease time and the setup of the state
renewal out of nfs4_create_session so that it can be called
after clearing the DRAINING flag. We use the getattr RPC
to obtain the lease time, which requires a sequence slot.
Signed-off-by: Ricardo Labiaga <Ricardo.Labiaga@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
We no longer need to maintain a distinction between nfs41_sequence_done and
nfs41_sequence_free_slot.
This fixes a number of slot table leakages in the NFSv4.1 code.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
We should not assume that nfs41_init_clientid() will always want to
initialise the session. If it is being called due to a server reboot, then
we just want to reset the session after re-establishing the clientid.
Fix this by getting rid of the 'reset' parameter in
nfs4_proc_create_session(), and instead relying on whether or not the
session slot table pointer is non-NULL.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
This patch invokes RECLAIM_COMPLETE after the client is done
reclaiming state.
There are interpretations of the spec that suggest that
RECLAIM_COMPLETE should also be issued after a new clientid
has been obtained from the server and even if there is no
state to reclaim. This tells the server that the client
has no state to reclaim even if the client isn't aware the
server may have rebooted.
Signed-off-by: Ricardo Labiaga <Ricardo.Labiaga@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Implements RECLAIM_COMPLETE as an asynchronous RPC.
NFS4ERR_DELAY is retried, NFS4ERR_DEADSESSION invokes the error handling
but does not result in a retry, since we don't want to have a lingering
RECLAIM_COMPLETE call sent in the middle of a possible new state recovery
cycle. If a session reset occurs, a new wave of reclaim operations will
follow, containing their own RECLAIM_COMPLETE call. We don't want a
retry to get on the way of recovery by incorrectly indicating to the
server that we're done reclaiming state.
A subsequent patch invokes the functionality.
Signed-off-by: Ricardo Labiaga <Ricardo.Labiaga@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
XDR encoding and decoding for RECLAIM_COMPLETE. Implements the necessary
encoding to indicate reclaim complete for the entire client. In the future,
it can be extended to provide reclaim complete functionality for a single
file system after migration.
Signed-off-by: Ricardo Labiaga <Ricardo.Labiaga@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Otherwise we have no guarantees that other processes won't start another
RPC call while we're resetting the session.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
in NFSv4.1 the seqid part of a stateid in CB_RECALL must be 0
Signed-off-by: Alexandros Batsakis <batsakis@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
the server can indicate a number of error conditions by setting the
appropriate bits in the SEQUENCE operation. The client re-establishes
state with the server when it receives one of those, with the action
depending on the specific case.
Signed-off-by: Alexandros Batsakis <batsakis@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
The v4.1 client should take into account the desired rsize, wsize when
negotiating the max size in CREATE_SESSION. Accordingly, it should use
rsize, wsize that are smaller than the session negotiated values.
Signed-off-by: Alexandros Batsakis <batsakis@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
In v4.1 the client MUST/SHOULD use the EXCLUSIVE4_1 flag instead of
EXCLUSIVE4, and GUARDED when the server supports persistent sessions.
For now (and until we support suppattr_exclcreat), we don't send any
attributes with EXCLUSIVE4_1 relying in the subsequent SETATTR as in v4.0
Signed-off-by: Alexandros Batsakis <batsakis@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
For now the clients returns _all_ the delegations of the specificed type
it holds
Signed-off-by: Alexandros Batsakis <batsakis@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
The NFSv4.1 spec-29 (18.36.3) says that the server MUST use an ONC RPC
(program) version number equal to 4 in callbacks sent to the client.
For now we allow both versions 1 and 4.
Signed-off-by: Alexandros Batsakis <batsakis@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Replace sync and async handlers setting of the NFS4CLNT_SESSION_SETUP bit with
setting NFS4CLNT_CHECK_LEASE, and let the state manager decide to reset the session.
Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Do not wake up the next slot_tbl_waitq task in nfs4_free_slot because we
may be draining the slot. Either signal the state manager that the session
is drained (the state manager wakes up tasks) OR wake up the next task.
In nfs41_sequence_done, the slot dereference is only needed in the sequence
operation success case.
Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
If the session is reset during state recovery, the state manager thread can
sleep on the slot_tbl_waitq causing a deadlock.
Add a completion framework to the session. Have the state manager thread set
a new session state (NFS4CLNT_SESSION_DRAINING) and wait for the session slot
table to drain.
Signal the state manager thread in nfs41_sequence_free_slot when the
NFS4CLNT_SESSION_DRAINING bit is set and the session is drained.
Reported-by: Trond Myklebust <trond@netapp.com>
Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
nfs4_recover_session can put rpciod to sleep. Just use nfs4_schedule_recovery.
Reported-by: Trond Myklebust <trond.myklebust@netapp.com>
Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Do not fall through and set NFS4CLNT_SESSION_RESET bit on NFS4ERR_EXPIRED
Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Do not fall through and call nfs4_delay on session error handling.
Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
nfs4_read_done returns zero on unhandled errors. nfs_readpage_result will
return on a negative tk_status without freeing the slot.
Call nfs4_sequence_free_slot on unhandled errors in nfs4_read_done.
Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
nfs41_sequence_free_slot can be called multiple times on SEQUENCE operation
errors.
No reason to inline nfs4_restart_rpc
Reported-by: Trond Myklebust <trond.myklebust@netapp.com>
nfs_writeback_done and nfs_readpage_retry call nfs4_restart_rpc outside the
error handler, and the slot is not freed prior to restarting in the rpc_prepare
state during session reset.
Fix this by moving the call to nfs41_sequence_free_slot from the error
path of nfs41_sequence_done into nfs4_restart_rpc, and by removing the test
for NFS4CLNT_SESSION_SETUP.
Always free slot and goto the rpc prepare state on async errors.
Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Make this clear by calling rpc_restart-call.
Prepare for nfs4_restart_rpc() to free slots.
Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
The bit is no longer used for session setup, only for session reset.
Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Reported-by: Trond Myklebust <trond.myklebust@netapp.com>
Resetting the clientid from the state manager could result in not confirming
the clientid due to create session not being called.
Move the create session call from the NFS4CLNT_SESSION_SETUP state manager
initialize session case into the NFS4CLNT_LEASE_EXPIRED case establish_clid
call.
Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
NFS4ERR_FILE_OPEN is return by the server when an operation cannot be
performed because the file is currently open and local (to the server)
semantics prohibit the operation while the file is open.
A typical case is a RENAME operation on an MS-Windows platform, which
prevents rename while the file is open.
While it is possible that such a condition is transitory, it is also
very possible that the file will be held open for an extended period
of time thus preventing the operation.
The current behaviour of Linux/NFS is to retry the operation
indefinitely. This is not appropriate - we do not expect a rename to
take an arbitrary amount of time to complete.
Rather, and error should be returned. The most obvious error code
would be EBUSY, which is a legal at least for 'rename' and 'unlink',
and accurately captures the reason for the error.
This patch allows a few retries until about 2 seconds have elapsed,
then returns EBUSY.
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
The d_instantiate(new_dentry, NULL) is superfluous, the dentry is
already negative. Rehashing this dummy dentry isn't needed either,
d_move() works fine on an unhashed target.
The re-checking for busy after a failed nfs_sillyrename() is bogus
too: new_dentry->d_count < 2 would be a bug here.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Move unhashing the target to after the check for existence and being a
non-directory.
If renaming a directory then the VFS already unhashes the target if it
is not busy. If it's busy then acquiring more references during the
rename makes no difference.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Comments are wrong or out of date. In particular d_drop() doesn't
free the inode it just unhashes the dentry. And if target is a
directory then it is not checked for being busy.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
VFS already checks if both source and target are directories.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
When the "rsize=" or "wsize=" mount options are not specified,
text-based mounts have slightly different behavior than legacy binary
mounts. Text-based mounts use the smaller of the server's maximum
and the client's maximum, but binary mounts use the smaller of the
server's _preferred_ size and the client's maximum.
This difference is actually pretty subtle. Most servers advertise
the same value as their maximum and their preferred transfer size, so
the end result is the same in most cases.
The reason for this difference is that for text-based mounts, if
r/wsize are not specified, they are set to the largest value supported
by the client. For legacy mounts, the values are set to zero if these
options are not specified.
nfs_server_set_fsinfo() can negotiate the transfer size defaults
correctly in any case. There's no need to specify any particular
value as default in the text-based option parsing logic.
Note that nfs4 doesn't use nfs_server_set_fsinfo(), but the mount.nfs4
command does set rsize and wsize to 0 if the user didn't specify these
options. So, make the same change for text-based NFSv4 mounts.
Thanks to James Pearson <james-p@moving-picture.com> for reporting and
diagnosing the problem.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Recent changes to snprintf() introduced the %pI6c formatter, which can
display an IPv6 address with standard shorthanding. Use this new
formatter when displaying IPv6 server addresses in /proc/mounts.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Solaris uses netids as values for the proto= option, so that when
someone specifies "tcp6" they get traffic over TCP + IPv6. Until
recently, this has never really been an issue for Linux since it didn't
support NFS over IPv6. The netid and the protocol name were generally
always the same (modulo any strange configuration in /etc/netconfig).
The solaris manpage documents their proto= option as:
proto= _netid_ | rdma
This patch is intended to bring Linux closer to how the Solaris proto=
option works, by declaring a static netid mapping in the kernel and
converting the proto= and mountproto= options to follow it and display
the proper values in /proc/mounts.
Much of this functionality will need to be provided by a userspace
mount.nfs patch. Chuck Lever has a patch to change mount.nfs in
the same way. In principle, we could do *all* of this in userspace but
that would mean that the options in /proc/mounts may not match the
options used by userspace.
The alternative to the static mapping here is to add a mechanism to
upcall to userspace for netid's. I'm not opposed to that option, but
it'll probably mean more overhead (and quite a bit more code). Rather
than shoot for that at first, I figured it was probably better to
start simply.
Comments welcome.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
The nfs4_state_manager should not be looking at the error values when
deciding whether or not to loop round in order to handle a higher priority
state recovery task. It should rather be looking at the clp->cl_state.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
If our lease expires, and the server reboots while we're recovering, we
need to be able to wait until the grace period is over.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
nfs4_recovery_handle_error() will correctly handle errors such as
NFS4ERR_CB_PATH_DOWN, however because they are still passed back to the
main loop in nfs4_state_manager(), they can cause the latter to exit
prematurely.
Fix this by letting nfs4_recovery_handle_error() change the error value in
cases where there is no action required by the caller.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
In practice, we need to ensure that we call nfs4_state_end_reclaim_reboot
in 2 cases:
- If we lose the lease while we were reclaiming state
OR
- After we're done with reboot recovery
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
The nfsv4 state manager could potentially deadlock inside
__nfs_inode_return_delegation() if the server reboots, so that the calls to
nfs_msync_inode() end up waiting on state recovery to complete.
Also ensure that if a server reboot or network partition causes us to have
to stop returning delegations, that NFS4CLNT_DELEGRETURN is set so that
the state manager can resume any outstanding delegation returns after it
has dealt with the state recovery situation.
Finally, ensure that the state manager doesn't wait for the DELEGRETURN
call to complete. It doesn't need to, and that too can cause a deadlock.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Subject: [PATCH] nfs: fix acl decoding
Commit 28f566942c "NFS: use dynamically
computed compound_hdr.replen for xdr_inline_pages offset" accidentally
changed the amount of space to allow for the acl reply, resulting in an
IO error on attempts to get an acl.
Reported-by: Paul Rudin <paul@rudin.co.uk>
Cc: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
It will lower the flush priority for NFS, and maybe more in future.
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: Jens Axboe <jens.axboe@oracle.com>
Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
* git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-2.6-fscache: (31 commits)
FS-Cache: Provide nop fscache_stat_d() if CONFIG_FSCACHE_STATS=n
SLOW_WORK: Fix GFS2 to #include <linux/module.h> before using THIS_MODULE
SLOW_WORK: Fix CIFS to pass THIS_MODULE to slow_work_register_user()
CacheFiles: Don't log lookup/create failing with ENOBUFS
CacheFiles: Catch an overly long wait for an old active object
CacheFiles: Better showing of debugging information in active object problems
CacheFiles: Mark parent directory locks as I_MUTEX_PARENT to keep lockdep happy
CacheFiles: Handle truncate unlocking the page we're reading
CacheFiles: Don't write a full page if there's only a partial page to cache
FS-Cache: Actually requeue an object when requested
FS-Cache: Start processing an object's operations on that object's death
FS-Cache: Make sure FSCACHE_COOKIE_LOOKING_UP cleared on lookup failure
FS-Cache: Add a retirement stat counter
FS-Cache: Handle pages pending storage that get evicted under OOM conditions
FS-Cache: Handle read request vs lookup, creation or other cache failure
FS-Cache: Don't delete pending pages from the page-store tracking tree
FS-Cache: Fix lock misorder in fscache_write_op()
FS-Cache: The object-available state can't rely on the cookie to be available
FS-Cache: Permit cache retrieval ops to be interrupted in the initial wait phase
FS-Cache: Use radix tree preload correctly in tracking of pages to be stored
...
Handle netfs pages that the vmscan algorithm wants to evict from the pagecache
under OOM conditions, but that are waiting for write to the cache. Under these
conditions, vmscan calls the releasepage() function of the netfs, asking if a
page can be discarded.
The problem is typified by the following trace of a stuck process:
kslowd005 D 0000000000000000 0 4253 2 0x00000080
ffff88001b14f370 0000000000000046 ffff880020d0d000 0000000000000007
0000000000000006 0000000000000001 ffff88001b14ffd8 ffff880020d0d2a8
000000000000ddf0 00000000000118c0 00000000000118c0 ffff880020d0d2a8
Call Trace:
[<ffffffffa00782d8>] __fscache_wait_on_page_write+0x8b/0xa7 [fscache]
[<ffffffff8104c0f1>] ? autoremove_wake_function+0x0/0x34
[<ffffffffa0078240>] ? __fscache_check_page_write+0x63/0x70 [fscache]
[<ffffffffa00b671d>] nfs_fscache_release_page+0x4e/0xc4 [nfs]
[<ffffffffa00927f0>] nfs_release_page+0x3c/0x41 [nfs]
[<ffffffff810885d3>] try_to_release_page+0x32/0x3b
[<ffffffff81093203>] shrink_page_list+0x316/0x4ac
[<ffffffff8109372b>] shrink_inactive_list+0x392/0x67c
[<ffffffff813532fa>] ? __mutex_unlock_slowpath+0x100/0x10b
[<ffffffff81058df0>] ? trace_hardirqs_on_caller+0x10c/0x130
[<ffffffff8135330e>] ? mutex_unlock+0x9/0xb
[<ffffffff81093aa2>] shrink_list+0x8d/0x8f
[<ffffffff81093d1c>] shrink_zone+0x278/0x33c
[<ffffffff81052d6c>] ? ktime_get_ts+0xad/0xba
[<ffffffff81094b13>] try_to_free_pages+0x22e/0x392
[<ffffffff81091e24>] ? isolate_pages_global+0x0/0x212
[<ffffffff8108e743>] __alloc_pages_nodemask+0x3dc/0x5cf
[<ffffffff81089529>] grab_cache_page_write_begin+0x65/0xaa
[<ffffffff8110f8c0>] ext3_write_begin+0x78/0x1eb
[<ffffffff81089ec5>] generic_file_buffered_write+0x109/0x28c
[<ffffffff8103cb69>] ? current_fs_time+0x22/0x29
[<ffffffff8108a509>] __generic_file_aio_write+0x350/0x385
[<ffffffff8108a588>] ? generic_file_aio_write+0x4a/0xae
[<ffffffff8108a59e>] generic_file_aio_write+0x60/0xae
[<ffffffff810b2e82>] do_sync_write+0xe3/0x120
[<ffffffff8104c0f1>] ? autoremove_wake_function+0x0/0x34
[<ffffffff810b18e1>] ? __dentry_open+0x1a5/0x2b8
[<ffffffff810b1a76>] ? dentry_open+0x82/0x89
[<ffffffffa00e693c>] cachefiles_write_page+0x298/0x335 [cachefiles]
[<ffffffffa0077147>] fscache_write_op+0x178/0x2c2 [fscache]
[<ffffffffa0075656>] fscache_op_execute+0x7a/0xd1 [fscache]
[<ffffffff81082093>] slow_work_execute+0x18f/0x2d1
[<ffffffff8108239a>] slow_work_thread+0x1c5/0x308
[<ffffffff8104c0f1>] ? autoremove_wake_function+0x0/0x34
[<ffffffff810821d5>] ? slow_work_thread+0x0/0x308
[<ffffffff8104be91>] kthread+0x7a/0x82
[<ffffffff8100beda>] child_rip+0xa/0x20
[<ffffffff8100b87c>] ? restore_args+0x0/0x30
[<ffffffff8102ef83>] ? tg_shares_up+0x171/0x227
[<ffffffff8104be17>] ? kthread+0x0/0x82
[<ffffffff8100bed0>] ? child_rip+0x0/0x20
In the above backtrace, the following is happening:
(1) A page storage operation is being executed by a slow-work thread
(fscache_write_op()).
(2) FS-Cache farms the operation out to the cache to perform
(cachefiles_write_page()).
(3) CacheFiles is then calling Ext3 to perform the actual write, using Ext3's
standard write (do_sync_write()) under KERNEL_DS directly from the netfs
page.
(4) However, for Ext3 to perform the write, it must allocate some memory, in
particular, it must allocate at least one page cache page into which it
can copy the data from the netfs page.
(5) Under OOM conditions, the memory allocator can't immediately come up with
a page, so it uses vmscan to find something to discard
(try_to_free_pages()).
(6) vmscan finds a clean netfs page it might be able to discard (possibly the
one it's trying to write out).
(7) The netfs is called to throw the page away (nfs_release_page()) - but it's
called with __GFP_WAIT, so the netfs decides to wait for the store to
complete (__fscache_wait_on_page_write()).
(8) This blocks a slow-work processing thread - possibly against itself.
The system ends up stuck because it can't write out any netfs pages to the
cache without allocating more memory.
To avoid this, we make FS-Cache cancel some writes that aren't in the middle of
actually being performed. This means that some data won't make it into the
cache this time. To support this, a new FS-Cache function is added
fscache_maybe_release_page() that replaces what the netfs releasepage()
functions used to do with respect to the cache.
The decisions fscache_maybe_release_page() makes are counted and displayed
through /proc/fs/fscache/stats on a line labelled "VmScan". There are four
counters provided: "nos=N" - pages that weren't pending storage; "gon=N" -
pages that were pending storage when we first looked, but weren't by the time
we got the object lock; "bsy=N" - pages that we ignored as they were actively
being written when we looked; and "can=N" - pages that we cancelled the storage
of.
What I'd really like to do is alter the behaviour of the cancellation
heuristics, depending on how necessary it is to expel pages. If there are
plenty of other pages that aren't waiting to be written to the cache that
could be ejected first, then it would be nice to hold up on immediate
cancellation of cache writes - but I don't see a way of doing that.
Signed-off-by: David Howells <dhowells@redhat.com>
For consistency drop & in front of every proc_handler. Explicity
taking the address is unnecessary and it prevents optimizations
like stubbing the proc_handlers to NULL.
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Joe Perches <joe@perches.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Now that sys_sysctl is a generic wrapper around /proc/sys .ctl_name
and .strategy members of sysctl tables are dead code. Remove them.
Cc: Jan Harkes <jaharkes@cs.cmu.edu>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Changeset a65318bf3a (NFSv4: Simplify some
cache consistency post-op GETATTRs) incorrectly changed the getattr
bitmap for readdir().
This causes the readdir() function to fail to return a
fileid/inode number, which again exposed a bug in the NFS readdir code that
causes spurious ENOENT errors to appear in applications (see
http://bugzilla.kernel.org/show_bug.cgi?id=14541).
The immediate band aid is to revert the incorrect bitmap change, but more
long term, we should change the NFS readdir code to cope with the
fact that NFSv4 servers are not required to support fileids/inode numbers.
Reported-by: Daniel J Blueman <daniel.blueman@gmail.com>
Cc: stable@kernel.org
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
We're adding enough nfs documentation that it may as well have its own
subdirectory.
Acked-by: Randy Dunlap <rdunlap@xenotime.net>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Commits 29fba38b (nfs41: lease renewal) and fc01cea9 (nfs41: sequence
operation) introduce a couple of put_rpccred() calls on credentials for
which there is no corresponding get_rpccred().
See http://bugzilla.kernel.org/show_bug.cgi?id=14249
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
RFC 3530 states that when we recieve the error NFS4ERR_RESOURCE, we are not
supposed to bump the sequence number on OPEN, LOCK, LOCKU, CLOSE, etc
operations. The problem is that we map that error into EREMOTEIO in the XDR
layer, and so the NFSv4 middle-layer routines like seqid_mutating_err(),
and nfs_increment_seqid() don't recognise it.
The fix is to defer the mapping until after the middle layers have
processed the error.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Actually pass the NFS_FILE_SYNC option to the server to avoid a
Panic in nfs_direct_write_complete() when a commit fails.
At the end of an nfs write, if the nfs commit fails, all the writes
will be rescheduled. They are supposed to be rescheduled as NFS_FILE_SYNC
writes, but the rpc_task structure is not completely intialized and so
the option is not passed. When the rescheduled writes complete, the
return indicates that they are NFS_UNSTABLE and we try to do another
commit. This leads to a Panic because the commit data structure pointer
was set to null in the initial (failed) commit attempt.
Signed-off-by: Terry Loftin <terry.loftin@hp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Fix a (small) memory leak in one of the error paths of the NFS mount
options parsing code.
Regression introduced in 2.6.30 by commit a67d18f (NFS: load the
rpc/rdma transport module automatically).
Reported-by: Yinghai Lu <yinghai@kernel.org>
Reported-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
struct sockaddr_storage * can safely be used as struct sockaddr *.
Suppress an "incompatible pointer type" warning.
Signed-off-by: Stefan Richter <stefanr@s5r6.in-berlin.de>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The NFSv4 renew daemon is shared between all active super blocks that refer
to a particular NFS server, so it is wrong to be shutting it down in
nfs4_kill_super every time a super block is destroyed.
This patch therefore kills nfs4_renewd_prepare_shutdown altogether, and
leaves it up to nfs4_shutdown_client() to also shut down the renew daemon
by means of the existing call to nfs4_kill_renewd().
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Fix a typo which causes try_location() to use the wrong length argument
when calling nfs_parse_server_name(). This again, causes the initialisation
of the mount's sockaddr structure to fail.
Also ensure that if nfs4_pathname_string() returns an error, then we pass
that error back up the stack instead of ENOENT.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
As seen in <http://bugs.debian.org/549002>, nfs4_init_client() can
overrun the source string when copying the client IP address from
nfs_parsed_mount_data::client_address to nfs_client::cl_ipaddr. Since
these are both treated as null-terminated strings elsewhere, the copy
should be done with strlcpy() not memcpy().
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
The recent changeset 53a0b9c4c9 (NFS: Replace
nfs_parse_ip_address() with rpc_pton()) broke nfs_remount, since the call
to rpc_pton() will zero out the port number in data->nfs_server.address.
This is actually due to a bug in nfs_remount: it should be looking at the
port number in nfs_server.port instead...
This fixes bug
http://bugzilla.kernel.org/show_bug.cgi?id=14276
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Currently, the port and mount port will both display as 65535 if you do not
specify a port number. That would be wrong...
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
With the recent spate of changes, the nfs protocol version will now default
to 2 instead of 3, while the mount protocol version defaults to 3.
The following patch should ensure the defaults are consistent with the
previous defaults of vers=3,proto=tcp,mountvers=3,mountproto=tcp.
This fixes the bug
http://bugzilla.kernel.org/show_bug.cgi?id=14259
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
* mark struct vm_area_struct::vm_ops as const
* mark vm_ops in AGP code
But leave TTM code alone, something is fishy there with global vm_ops
being used.
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We forget to set nfs_server.protocol in tcp case when old-style binary
options are passed to mount. The thing remains zero and never validated
afterwards. As the result, we hit BUG in fs/nfs/client.c:588.
Breakage has been introduced in NFS: Add nfs_alloc_parsed_mount_data
merged yesterday...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6:
truncate: use new helpers
truncate: new helpers
fs: fix overflow in sys_mount() for in-kernel calls
fs: Make unload_nls() NULL pointer safe
freeze_bdev: grab active reference to frozen superblocks
freeze_bdev: kill bd_mount_sem
exofs: remove BKL from super operations
fs/romfs: correct error-handling code
vfs: seq_file: add helpers for data filling
vfs: remove redundant position check in do_sendfile
vfs: change sb->s_maxbytes to a loff_t
vfs: explicitly cast s_maxbytes in fiemap_check_ranges
libfs: return error code on failed attr set
seq_file: return a negative error code when seq_path_root() fails.
vfs: optimize touch_time() too
vfs: optimization for touch_atime()
vfs: split generic_forget_inode() so that hugetlbfs does not have to copy it
fs/inode.c: add dev-id and inode number for debugging in init_special_inode()
libfs: make simple_read_from_buffer conventional
* 'hwpoison' of git://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-mce-2.6: (21 commits)
HWPOISON: Enable error_remove_page on btrfs
HWPOISON: Add simple debugfs interface to inject hwpoison on arbitary PFNs
HWPOISON: Add madvise() based injector for hardware poisoned pages v4
HWPOISON: Enable error_remove_page for NFS
HWPOISON: Enable .remove_error_page for migration aware file systems
HWPOISON: The high level memory error handler in the VM v7
HWPOISON: Add PR_MCE_KILL prctl to control early kill behaviour per process
HWPOISON: shmem: call set_page_dirty() with locked page
HWPOISON: Define a new error_remove_page address space op for async truncation
HWPOISON: Add invalidate_inode_page
HWPOISON: Refactor truncate to allow direct truncating of page v2
HWPOISON: check and isolate corrupted free pages v2
HWPOISON: Handle hardware poisoned pages in try_to_unmap
HWPOISON: Use bitmask/action code for try_to_unmap behaviour
HWPOISON: x86: Add VM_FAULT_HWPOISON handling to x86 page fault handler v2
HWPOISON: Add poison check to page fault handling
HWPOISON: Add basic support for poisoned pages in fault handler v3
HWPOISON: Add new SIGBUS error codes for hardware poison signals
HWPOISON: Add support for poison swap entries v2
HWPOISON: Export some rmap vma locking to outside world
...
Update some fs code to make use of new helper functions introduced
in the previous patch. Should be no significant change in behaviour
(except CIFS now calls send_sig under i_lock, via inode_newsize_ok).
Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: Miklos Szeredi <miklos@szeredi.hu>
Cc: linux-nfs@vger.kernel.org
Cc: Trond.Myklebust@netapp.com
Cc: linux-cifs-client@lists.samba.org
Cc: sfrench@samba.org
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* remove asm/atomic.h inclusion from linux/utsname.h --
not needed after kref conversion
* remove linux/utsname.h inclusion from files which do not need it
NOTE: it looks like fs/binfmt_elf.c do not need utsname.h, however
due to some personality stuff it _is_ needed -- cowardly leave ELF-related
headers and files alone.
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Propagate the NFS 'fsc' mount option through NFS automounts of various types.
This is now required as commit:
commit c02d7adf8c
Author: Trond Myklebust <Trond.Myklebust@netapp.com>
Date: Mon Jun 22 15:09:14 2009 -0400
NFSv4: Replace nfs4_path_walk() with VFS path lookup in a private namespace
uses VFS-driven automounting to reach all submounts barring the root, thus
preventing fscaching from being enabled on any submount other than the root.
This patch gets around that by propagating the NFS_OPTION_FSCACHE flag across
automounts. If a uniquifier is supplied to a mount then this is propagated to
all automounts of that mount too.
Signed-off-by: David Howells <dhowells@redhat.com>
[Trond: Fixed up the definition of nfs_fscache_get_super_cookie for the
case of #undef CONFIG_NFS_FSCACHE]
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Allocating nfs_parsed_mount_data and setting up the defaults is nearly
the same for both nfs and nfs4 mounts.
Both paths seem to use nfs_validate_transport_protocol(), so setting a
default value for nfs_server.protocol ought to be unnecessary.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Keep it in the case of the legacy binary mount interface, but purge it from
the nfs_server structure.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Make all seq_operations structs const, to help mitigate against
revectoring user-triggerable function pointers.
This is derived from the grsecurity patch, although generated from scratch
because it's simpler than extracting the changes from there.
Signed-off-by: James Morris <jmorris@namei.org>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Acked-by: Casey Schaufler <casey@schaufler-ca.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (34 commits)
trivial: fix typo in aic7xxx comment
trivial: fix comment typo in drivers/ata/pata_hpt37x.c
trivial: typo in kernel-parameters.txt
trivial: fix typo in tracing documentation
trivial: add __init/__exit macros in drivers/gpio/bt8xxgpio.c
trivial: add __init macro/ fix of __exit macro location in ipmi_poweroff.c
trivial: remove unnecessary semicolons
trivial: Fix duplicated word "options" in comment
trivial: kbuild: remove extraneous blank line after declaration of usage()
trivial: improve help text for mm debug config options
trivial: doc: hpfall: accept disk device to unload as argument
trivial: doc: hpfall: reduce risk that hpfall can do harm
trivial: SubmittingPatches: Fix reference to renumbered step
trivial: fix typos "man[ae]g?ment" -> "management"
trivial: media/video/cx88: add __init/__exit macros to cx88 drivers
trivial: fix typo in CONFIG_DEBUG_FS in gcov doc
trivial: fix missing printk space in amd_k7_smp_check
trivial: fix typo s/ketymap/keymap/ in comment
trivial: fix typo "to to" in multiple files
trivial: fix typos in comments s/DGBU/DBGU/
...
NFS may free the server structure without ever having used the
bdi, so we either need to flag the bdi as being uninitialized or
initialize it up front. This does the latter.
This fixes a crash with mounting more than one NFS file system,
should people ever need that kind of obscure NFS functionality.
Tested-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Otherwise we could be attempting to flush data for a writeback
thread and bdi that have already disappeared.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
We do this automatically in get_sb_bdev() from the set_bdev_super()
callback. Filesystems that have their own private backing_dev_info
must assign that in ->fill_super().
Note that ->s_bdi assignment is required for proper writeback!
Acked-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Enable hardware memory error handling for NFS
Truncation of data pages at runtime should be safe in NFS,
even when it doesn't support migration so far.
Trond tells me migration is also queued up for 2.6.32.
Acked-by: Trond.Myklebust@netapp.com
Signed-off-by: Andi Kleen <ak@linux.intel.com>
This enables us to track who does what and print info. Its main use
is catching dirty inodes on the default_backing_dev_info, so we can
fix that up.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
When mounting an "nfs" type file system, recognize "v4," "vers=4," or
"nfsvers=4" mount options, and convert the file system to "nfs4" under
the covers.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
[trondmy: fixed up binary mount code so it sets the 'version' field too]
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Clean up: Refactor nfs4_get_sb() to allow its guts to be invoked by
nfs_get_sb().
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Clean up: Refactor the part of nfs4_validate_mount_options() that
handles text-based options, so we can call it from the NFSv2/v3
option validation function.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
The meaning of not specifying the "port=" mount option is different
for "-t nfs" and "-t nfs4" mounts. The default port value for
NFSv2/v3 mounts is 0, but the default for NFSv4 mounts is 2049.
To support "-t nfs -o vers=4", the mount option parser must detect
when "port=" is missing so that the correct default port value can be
set depending on which NFS version is requested.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Hi Trond,
Recently we were observing the behaviour difference between a 2.4.x and
2.6.x kernel with respect to O_EXCL. A comment from 2.4.x era, "For now,
we don't implement O_EXCL." seems inaccurate in TOT.
If so, here's a patch to remove the comment.
This patch is against:
git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6
Signed-off-by: Harshula Jayasuriya <harshula@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Commit 76db6d9500 (nfs41: add session setup
to the state manager) introduces an infinite loop possibility in the NFSv4
state manager. By first checking nfs4_has_session() before clearing the
NFS4CLNT_SESSION_SETUP flag, it allows for a situation where someone sets
that flag, but it never gets cleared, and so the state manager loops.
In fact commit c3fad1b1aa (nfs41: add session
reset to state manager) causes this to happen every time we get a network
partition error.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Tested-by: Daniel J Blueman <daniel.blueman@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Some releases of Linux rpc.mountd (nfs-utils 1.1.4 and later) return an
empty auth flavor list if no sec= was specified for the export. This is
notably broken server behavior.
The new auth flavor list checking added in a recent commit rejects this
case. The OpenSolaris client does too.
The broken mountd implementation is already widely deployed. To avoid
a behavioral regression, the kernel's mount client skips flavor checking
(ie reverts to the pre-2.6.32 behavior) if mountd returns an empty
flavor list.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
generic_file_direct_write() no longer calls generic_osync_inode() so remove the
comment.
CC: linux-nfs@vger.kernel.org
CC: Neil Brown <neilb@suse.de>
CC: "J. Bruce Fields" <bfields@fieldses.org>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
In the referral code, use it to look up the new server's ip address if the
fs_locations attribute contains a hostname.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
The NFSv4 and NFSv4.1 protocols both allow for the redirection of a client
from one server to another in order to support filesystem migration and
replication. For full protocol support, we need to add the ability to
convert a DNS host name into an IP address that we can feed to the RPC
client.
We'll reuse the sunrpc cache, now that it has been converted to work with
rpc_pipefs.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>