Commit Graph

495400 Commits

Author SHA1 Message Date
Peng Tao
6b7f3cf963 nfs41: pull decode_ds_addr from file layout to generic pnfs
It can be reused by flexfile layout.

Reviewed-by: Jeff Layton <jlayton@primarydata.com>
Signed-off-by: Peng Tao <tao.peng@primarydata.com>
Signed-off-by: Tom Haynes <Thomas.Haynes@primarydata.com>
2015-02-03 11:06:32 -08:00
Peng Tao
875ae0694b nfs41: pull data server cache from file layout to generic pnfs
Also pull nfs4_pnfs_ds_addr and nfs4_pnfs_ds to generic pnfs.

They can all be reused by flexfile layout as well.

Reviewed-by: Jeff Layton <jlayton@primarydata.com>
Signed-off-by: Peng Tao <tao.peng@primarydata.com>
Signed-off-by: Tom Haynes <Thomas.Haynes@primarydata.com>
2015-02-03 11:06:32 -08:00
Tom Haynes
085d1e33a6 pnfs: Do not grab the commit_info lock twice when rescheduling writes
Acked-by: Jeff Layton <jlayton@primarydata.com>
Signed-off-by: Tom Haynes <loghyr@primarydata.com>
2015-02-03 11:06:31 -08:00
Tom Haynes
f54bcf2ece pnfs: Prepare for flexfiles by pulling out common code
The flexfilelayout driver will share some common code
with the filelayout driver. This set of changes refactors
that common code out to avoid any module depenencies.

Signed-off-by: Tom Haynes <loghyr@primarydata.com>
2015-02-03 11:06:31 -08:00
Trond Myklebust
cc3ea893cb NFS: Client side changes for RDMA
These patches improve the scalability of the NFSoRDMA client and take large
 variables off of the stack.  Additionally, the GFP_* flags are updated to
 match what TCP uses.
 
 Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQIcBAABCAAGBQJUy/21AAoJENfLVL+wpUDrDR4QAJ+aZ1LqO5muJG5LDIReh0p+
 IyvAO3ekWKaaXZd5e/wOviHRRSBCs3oRBm1sNyjOOZYf7nl67RrndjGq2Ccp6tGw
 6V6ligIlhu3QAApFiQmLfpDXQeDBysMfhEQMXDj8kVhAvn3SWwbT6q1Bb8tMZvnu
 lwSPYCuLAdtPzDs3j3Pt3jSrOD7gS06Vtumco1yS82YiLK86E7JiOLb9w5Ltk0Wo
 DT9hZ7JXtflBy3trBsNYQ7GTjoYEp792Ca9DT/nie8/Tuuy3DhooLJKdwVErq/0/
 ycI033mw//RneY+/eF2wLeS4PlzaiSq7DmXKpE5MnkpVuNAdPbHbCuVzuXa4/LLD
 NhKyw/NTAE9ucFJ8e3apj3m42O00bMV0rx0rOUvMMdbNPBKzXk3rrb4uRYfT4k+D
 yss7aEZg0EcDbe9KqfuDdwXVJIZnvAvXD+V1iL4X3QmVM0JyJZouNZPoG7fCwA1g
 uoqGJK9zu7osr59YMomf2okWQFUQYCea5ombXYOo3ZtANw+OYaHDcMKTXaiCtMdA
 CZUEOND8mB/EFtgTMA2TZmDzch8iOOuD7wmtifb8h1DqN/3GxgecBcT5XMCx5zDR
 bMpt9vbCfe4YP6idpl5XEnVbcPG9CZF2W2Hg2gI7axX8R+ZMvwQQlWJjOqxoR6+F
 bpD/jjWFYaS6U7NGyylM
 =YCZe
 -----END PGP SIGNATURE-----

Merge tag 'nfs-rdma-for-3.20' of git://git.linux-nfs.org/projects/anna/nfs-rdma

NFS: Client side changes for RDMA

These patches improve the scalability of the NFSoRDMA client and take large
variables off of the stack.  Additionally, the GFP_* flags are updated to
match what TCP uses.

Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>

* tag 'nfs-rdma-for-3.20' of git://git.linux-nfs.org/projects/anna/nfs-rdma: (21 commits)
  xprtrdma: Update the GFP flags used in xprt_rdma_allocate()
  xprtrdma: Clean up after adding regbuf management
  xprtrdma: Allocate zero pad separately from rpcrdma_buffer
  xprtrdma: Allocate RPC/RDMA receive buffer separately from struct rpcrdma_rep
  xprtrdma: Allocate RPC/RDMA send buffer separately from struct rpcrdma_req
  xprtrdma: Allocate RPC send buffer separately from struct rpcrdma_req
  xprtrdma: Add struct rpcrdma_regbuf and helpers
  xprtrdma: Refactor rpcrdma_buffer_create() and rpcrdma_buffer_destroy()
  xprtrdma: Simplify synopsis of rpcrdma_buffer_create()
  xprtrdma: Take struct ib_qp_attr and ib_qp_init_attr off the stack
  xprtrdma: Take struct ib_device_attr off the stack
  xprtrdma: Free the pd if ib_query_qp() fails
  xprtrdma: Remove rpcrdma_ep::rep_func and ::rep_xprt
  xprtrdma: Move credit update to RPC reply handler
  xprtrdma: Remove rl_mr field, and the mr_chunk union
  xprtrdma: Remove rpcrdma_ep::rep_ia
  xprtrdma: Rename "xprt" and "rdma_connect" fields in struct rpcrdma_xprt
  xprtrdma: Clean up hdrlen
  xprtrdma: Display XIDs in host byte order
  xprtrdma: Modernize htonl and ntohl
  ...
2015-02-03 11:54:58 -05:00
Dan Carpenter
c7c545d4a3 NFS: a couple off by ones
These tests are off by one because if len == sizeof(nfs_export_path)
then we have truncated the name.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-01-30 20:43:30 -05:00
Omar Sandoval
3a7ed3fff3 nfs: prevent truncate on active swapfile
Most filesystems prevent truncation of an active swapfile by way of
inode_newsize_ok, called from inode_change_ok. NFS doesn't call either
from nfs_setattr, presumably because most of these checks are expected
to be done server-side. However, the IS_SWAPFILE check can only be done
client-side, and truncating a swapfile can't possibly be good.

Signed-off-by: Omar Sandoval <osandov@osandov.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-01-30 20:43:29 -05:00
Jeff Layton
6ffa30d3f7 nfs: don't call blocking operations while !TASK_RUNNING
Bruce reported seeing this warning pop when mounting using v4.1:

     ------------[ cut here ]------------
     WARNING: CPU: 1 PID: 1121 at kernel/sched/core.c:7300 __might_sleep+0xbd/0xd0()
    do not call blocking ops when !TASK_RUNNING; state=1 set at [<ffffffff810ff58f>] prepare_to_wait+0x2f/0x90
    Modules linked in: rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace sunrpc fscache ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 xt_conntrack ebtable_nat ebtable_broute bridge stp llc ebtable_filter ebtables ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_security ip6table_raw ip6table_filter ip6_tables iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_security iptable_raw snd_hda_codec_generic snd_hda_intel snd_hda_controller snd_hda_codec snd_hwdep snd_pcm snd_timer ppdev joydev snd virtio_console virtio_balloon pcspkr serio_raw parport_pc parport pvpanic floppy soundcore i2c_piix4 virtio_blk virtio_net qxl drm_kms_helper ttm drm virtio_pci virtio_ring ata_generic virtio pata_acpi
    CPU: 1 PID: 1121 Comm: nfsv4.1-svc Not tainted 3.19.0-rc4+ #25
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20140709_153950- 04/01/2014
     0000000000000000 000000004e5e3f73 ffff8800b998fb48 ffffffff8186ac78
     0000000000000000 ffff8800b998fba0 ffff8800b998fb88 ffffffff810ac9da
     ffff8800b998fb68 ffffffff81c923e7 00000000000004d9 0000000000000000
    Call Trace:
     [<ffffffff8186ac78>] dump_stack+0x4c/0x65
     [<ffffffff810ac9da>] warn_slowpath_common+0x8a/0xc0
     [<ffffffff810aca65>] warn_slowpath_fmt+0x55/0x70
     [<ffffffff810ff58f>] ? prepare_to_wait+0x2f/0x90
     [<ffffffff810ff58f>] ? prepare_to_wait+0x2f/0x90
     [<ffffffff810dd2ad>] __might_sleep+0xbd/0xd0
     [<ffffffff8124c973>] kmem_cache_alloc_trace+0x243/0x430
     [<ffffffff810d941e>] ? groups_alloc+0x3e/0x130
     [<ffffffff810d941e>] groups_alloc+0x3e/0x130
     [<ffffffffa0301b1e>] svcauth_unix_accept+0x16e/0x290 [sunrpc]
     [<ffffffffa0300571>] svc_authenticate+0xe1/0xf0 [sunrpc]
     [<ffffffffa02fc564>] svc_process_common+0x244/0x6a0 [sunrpc]
     [<ffffffffa02fd044>] bc_svc_process+0x1c4/0x260 [sunrpc]
     [<ffffffffa03d5478>] nfs41_callback_svc+0x128/0x1f0 [nfsv4]
     [<ffffffff810ff970>] ? wait_woken+0xc0/0xc0
     [<ffffffffa03d5350>] ? nfs4_callback_svc+0x60/0x60 [nfsv4]
     [<ffffffff810d45bf>] kthread+0x11f/0x140
     [<ffffffff810ea815>] ? local_clock+0x15/0x30
     [<ffffffff810d44a0>] ? kthread_create_on_node+0x250/0x250
     [<ffffffff81874bfc>] ret_from_fork+0x7c/0xb0
     [<ffffffff810d44a0>] ? kthread_create_on_node+0x250/0x250
    ---[ end trace 675220a11e30f4f2 ]---

nfs41_callback_svc does most of its work while in TASK_INTERRUPTIBLE,
which is just wrong. Fix that by finishing the wait immediately if we've
found that the list has something on it.

Also, we don't expect this kthread to accept signals, so we should be
using a TASK_UNINTERRUPTIBLE sleep instead. That however, opens us up
hung task warnings from the watchdog, so have the schedule_timeout
wake up every 60s if there's no callback activity.

Reported-by: "J. Bruce Fields" <bfields@fieldses.org>
Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Cc: stable@vger.kernel.org
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-01-30 20:39:50 -05:00
Chuck Lever
a0a1d50cd1 xprtrdma: Update the GFP flags used in xprt_rdma_allocate()
Reflect the more conservative approach used in the socket transport's
version of this transport method. An RPC buffer allocation should
avoid forcing not just FS activity, but any I/O.

In particular, two recent changes missed updating xprtrdma:

 - Commit c6c8fe79a8 ("net, sunrpc: suppress allocation warning ...")
 - Commit a564b8f039 ("nfs: enable swap on NFS")

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-01-30 12:18:48 -05:00
Chuck Lever
df515ca7b3 xprtrdma: Clean up after adding regbuf management
rpcrdma_{de}register_internal() are used only in verbs.c now.

MAX_RPCRDMAHDR is no longer used and can be removed.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-01-30 10:47:49 -05:00
Chuck Lever
c05fbb5a59 xprtrdma: Allocate zero pad separately from rpcrdma_buffer
Use the new rpcrdma_alloc_regbuf() API to shrink the amount of
contiguous memory needed for a buffer pool by moving the zero
pad buffer into a regbuf.

This is for consistency with the other uses of internally
registered memory.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-01-30 10:47:49 -05:00
Chuck Lever
6b1184cd4f xprtrdma: Allocate RPC/RDMA receive buffer separately from struct rpcrdma_rep
The rr_base field is currently the buffer where RPC replies land.

An RPC/RDMA reply header lands in this buffer. In some cases an RPC
reply header also lands in this buffer, just after the RPC/RDMA
header.

The inline threshold is an agreed-on size limit for RDMA SEND
operations that pass from server and client. The sum of the
RPC/RDMA reply header size and the RPC reply header size must be
less than this threshold.

The largest RDMA RECV that the client should have to handle is the
size of the inline threshold. The receive buffer should thus be the
size of the inline threshold, and not related to RPCRDMA_MAX_SEGS.

RPC replies received via RDMA WRITE (long replies) are caught in
rq_rcv_buf, which is the second half of the RPC send buffer. Ie,
such replies are not involved in any way with rr_base.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-01-30 10:47:49 -05:00
Chuck Lever
85275c874e xprtrdma: Allocate RPC/RDMA send buffer separately from struct rpcrdma_req
The rl_base field is currently the buffer where each RPC/RDMA call
header is built.

The inline threshold is an agreed-on size limit to for RDMA SEND
operations that pass between client and server. The sum of the
RPC/RDMA header size and the RPC header size must be less than or
equal to this threshold.

Increasing the r/wsize maximum will require MAX_SEGS to grow
significantly, but the inline threshold size won't change (both
sides agree on it). The server's inline threshold doesn't change.

Since an RPC/RDMA header can never be larger than the inline
threshold, make all RPC/RDMA header buffers the size of the
inline threshold.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-01-30 10:47:49 -05:00
Chuck Lever
0ca77dc372 xprtrdma: Allocate RPC send buffer separately from struct rpcrdma_req
Because internal memory registration is an expensive and synchronous
operation, xprtrdma pre-registers send and receive buffers at mount
time, and then re-uses them for each RPC.

A "hardway" allocation is a memory allocation and registration that
replaces a send buffer during the processing of an RPC. Hardway must
be done if the RPC send buffer is too small to accommodate an RPC's
call and reply headers.

For xprtrdma, each RPC send buffer is currently part of struct
rpcrdma_req so that xprt_rdma_free(), which is passed nothing but
the address of an RPC send buffer, can find its matching struct
rpcrdma_req and rpcrdma_rep quickly via container_of / offsetof.

That means that hardway currently has to replace a whole rpcrmda_req
when it replaces an RPC send buffer. This is often a fairly hefty
chunk of contiguous memory due to the size of the rl_segments array
and the fact that both the send and receive buffers are part of
struct rpcrdma_req.

Some obscure re-use of fields in rpcrdma_req is done so that
xprt_rdma_free() can detect replaced rpcrdma_req structs, and
restore the original.

This commit breaks apart the RPC send buffer and struct rpcrdma_req
so that increasing the size of the rl_segments array does not change
the alignment of each RPC send buffer. (Increasing rl_segments is
needed to bump up the maximum r/wsize for NFS/RDMA).

This change opens up some interesting possibilities for improving
the design of xprt_rdma_allocate().

xprt_rdma_allocate() is now the one place where RPC send buffers
are allocated or re-allocated, and they are now always left in place
by xprt_rdma_free().

A large re-allocation that includes both the rl_segments array and
the RPC send buffer is no longer needed. Send buffer re-allocation
becomes quite rare. Good send buffer alignment is guaranteed no
matter what the size of the rl_segments array is.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-01-30 10:47:49 -05:00
Chuck Lever
9128c3e794 xprtrdma: Add struct rpcrdma_regbuf and helpers
There are several spots that allocate a buffer via kmalloc (usually
contiguously with another data structure) and then register that
buffer internally. I'd like to split the buffers out of these data
structures to allow the data structures to scale.

Start by adding functions that can kmalloc and register a buffer,
and can manage/preserve the buffer's associated ib_sge and ib_mr
fields.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-01-30 10:47:49 -05:00
Chuck Lever
1392402c40 xprtrdma: Refactor rpcrdma_buffer_create() and rpcrdma_buffer_destroy()
Move the details of how to create and destroy rpcrdma_req and
rpcrdma_rep structures into helper functions.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-01-30 10:47:48 -05:00
Chuck Lever
ac920d04a7 xprtrdma: Simplify synopsis of rpcrdma_buffer_create()
Clean up: There is one call site for rpcrdma_buffer_create(). All of
the arguments there are fields of an rpcrdma_xprt.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-01-30 10:47:48 -05:00
Chuck Lever
ce1ab9ab47 xprtrdma: Take struct ib_qp_attr and ib_qp_init_attr off the stack
Reduce stack footprint of the connection upcall handler function.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-01-30 10:47:48 -05:00
Chuck Lever
7bc7972cdd xprtrdma: Take struct ib_device_attr off the stack
Device attributes are large, and are used in more than one place.
Stash a copy in dynamically allocated memory.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-01-30 10:47:48 -05:00
Chuck Lever
5ae711a246 xprtrdma: Free the pd if ib_query_qp() fails
If ib_query_qp() fails or the memory registration mode isn't
supported, don't leak the PD. An orphaned IB/core resource will
cause IB module removal to hang.

Fixes: bd7ed1d133 ("RPC/RDMA: check selected memory registration ...")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-01-30 10:47:48 -05:00
Chuck Lever
afadc468eb xprtrdma: Remove rpcrdma_ep::rep_func and ::rep_xprt
Clean up: The rep_func field always refers to rpcrdma_conn_func().
rep_func should have been removed by commit b45ccfd25d ("xprtrdma:
Remove MEMWINDOWS registration modes").

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-01-30 10:47:48 -05:00
Chuck Lever
eba8ff660b xprtrdma: Move credit update to RPC reply handler
Reduce work in the receive CQ handler, which can be run at hardware
interrupt level, by moving the RPC/RDMA credit update logic to the
RPC reply handler.

This has some additional benefits: More header sanity checking is
done before trusting the incoming credit value, and the receive CQ
handler no longer touches the RPC/RDMA header (the CPU stalls while
waiting for the header contents to be brought into the cache).

This further extends work begun by commit e7ce710a88 ("xprtrdma:
Avoid deadlock when credit window is reset").

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-01-30 10:47:48 -05:00
Chuck Lever
3eb3581066 xprtrdma: Remove rl_mr field, and the mr_chunk union
Clean up: Since commit 0ac531c183 ("xprtrdma: Remove REGISTER
memory registration mode"), the rl_mr pointer is no longer used
anywhere.

After removal, there's only a single member of the mr_chunk union,
so mr_chunk can be removed as well, in favor of a single pointer
field.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-01-30 10:47:48 -05:00
Chuck Lever
5d410ba061 xprtrdma: Remove rpcrdma_ep::rep_ia
Clean up: This field is not used.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-01-30 10:47:48 -05:00
Chuck Lever
5abefb861f xprtrdma: Rename "xprt" and "rdma_connect" fields in struct rpcrdma_xprt
Clean up: Use consistent field names in struct rpcrdma_xprt.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-01-30 10:47:48 -05:00
Chuck Lever
f2846481b4 xprtrdma: Clean up hdrlen
Clean up: Replace naked integers with a documenting macro.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-01-30 10:47:48 -05:00
Chuck Lever
052151a979 xprtrdma: Display XIDs in host byte order
xprtsock.c and the backchannel code display XIDs in host byte order.
Follow suit in xprtrdma.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-01-30 10:47:48 -05:00
Chuck Lever
284f4902a6 xprtrdma: Modernize htonl and ntohl
Clean up: Replace htonl and ntohl with the be32 equivalents.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-01-30 10:47:48 -05:00
Chuck Lever
8502427ccd xprtrdma: human-readable completion status
Make it easier to grep the system log for specific error conditions.

The wc.opcode field is not included because opcode numbers are
sparse, and because wc.opcode is not necessarily valid when
completion reports an error.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-01-30 10:47:47 -05:00
Trond Myklebust
cf6726e2ee NFSv4: Deal with atomic upgrades of an existing delegation
Ensure that we deal correctly with the case where the server sends us a
newer instance of the same delegation. If the stateids match, but the
sequence numbers differ, then treat the new delegation as if it were
an atomic upgrade.

Signed-off-by: Trond Myklebust <Trond.Myklebust@primarydata.com>
2015-01-24 18:46:51 -05:00
Trond Myklebust
89f0ff386c NFSv4.1: Replace usage of nfs_client->cl_addr in encode_create_session
Replace the current code with something that is a little closer to what
net/sunrpc/auth_unix.c uses.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-01-24 18:46:50 -05:00
Trond Myklebust
c4a7ca7749 SUNRPC: Allow waiting on memory allocation
We should be safe now, as long as we don't do GFP_IO or higher allocations

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-01-24 18:46:50 -05:00
Trond Myklebust
127b21b89f SUNRPC: Adjust rpciod workqueue parameters
Increase the concurrency level for rpciod threads to allow for allocations
etc that happen in the RPCSEC_GSS layer. Also note that the NFSv4 byte range
locks may now need to allocate memory from inside rpciod.

Add the WQ_HIGHPRI flag to improve latency guarantees while we're at it.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-01-24 18:46:49 -05:00
Trond Myklebust
40dd4b7aee NFSv4.1: Optimise layout return-on-close
Optimise the layout return on close code by ensuring that

1) Add a check for whether we hold a layout before taking any spinlocks
2) Only take the spin lock once
3) Use nfs_state->state to speed up open file checks

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-01-24 18:46:48 -05:00
Trond Myklebust
b4019c0e21 NFSv4.1: Allow parallel LOCK/LOCKU calls
Note, however, that we still serialise on the open stateid if the lock
stateid is unconfirmed. Hopefully that will not prove too much of a
burden for first time locks; it should leave the ability to parallelise
OPENs unchanged, since they no longer call the serialisation primitives.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-01-24 18:46:48 -05:00
Trond Myklebust
c69899a17c NFSv4: Update of VFS byte range lock must be atomic with the stateid update
Ensure that we test the lock stateid remained unchanged while we were
updating the VFS tracking of the byte range lock. Have the process
replay the lock to the server if we detect that was not the case.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-01-24 18:46:47 -05:00
Trond Myklebust
425c1d4e5b NFSv4: Fix lock on-wire reordering issues
This patch ensures that the server cannot reorder our LOCK/LOCKU
requests if they are sent in parallel on the wire.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-01-24 18:46:47 -05:00
Trond Myklebust
6b447539aa NFSv4: Always do open_to_lock_owner if the lock stateid is uninitialised
The original text in RFC3530 was terribly confusing since it conflated
lockowners and lock stateids. RFC3530bis clarifies that you must use
open_to_lock_owner when there is no lock state for that file+lockowner
combination.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-01-24 18:46:46 -05:00
Trond Myklebust
39071e6fff NFSv4: Fix atomicity problems with lock stateid updates
When we update the lock stateid, we really do need to ensure that this is
done under the state->state_lock, and that we are indeed only updating
confirmed locks with a newer version of the same stateid.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-01-24 17:27:41 -05:00
Trond Myklebust
63f5f796af NFSv4.1: Allow parallel OPEN/OPEN_DOWNGRADE/CLOSE
Remove the serialisation of OPEN/OPEN_DOWNGRADE and CLOSE calls for the
case of NFSv4.1 and newer.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-01-23 23:06:45 -05:00
Trond Myklebust
a67964197c NFSv4: Check for NULL argument in nfs_*_seqid() functions
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-01-23 23:06:45 -05:00
Trond Myklebust
badc76dd0d NFSv4: Convert nfs_alloc_seqid() to return an ERR_PTR() if allocation fails
When we relax the sequencing on the NFSv4.1 OPEN/CLOSE code, we will want
to use the value NULL to indicate that no sequencing is needed.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-01-23 23:06:44 -05:00
Trond Myklebust
f95549cf24 NFSv4: More CLOSE/OPEN races
If an OPEN RPC call races with a CLOSE or OPEN_DOWNGRADE so that it
updates the nfs_state structure before the CLOSE/OPEN_DOWNGRADE has
a chance to do so, then we know that the state->flags need to be
recalculated from scratch.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-01-23 23:06:44 -05:00
Trond Myklebust
566fcec60b NFSv4: Fix an atomicity problem in CLOSE
If we are to remove the serialisation of OPEN/CLOSE, then we need to
ensure that the stateid sent as part of a CLOSE operation does not
change after we test the state in nfs4_close_prepare.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-01-23 19:22:39 -05:00
Anna Schumaker
2ef47eb1ae NFS: Fix use of nfs_attr_use_mounted_on_fileid()
This function call was being optimized out during nfs_fhget(), leading
to situations where we have a valid fileid but still want to use the
mounted_on_fileid.  For example, imagine we have our server configured
like this:

server % df
Filesystem      Size  Used Avail Use% Mounted on
/dev/vda1       9.1G  6.5G  1.9G  78% /
/dev/vdb1       487M  2.3M  456M   1% /exports
/dev/vdc1       487M  2.3M  456M   1% /exports/vol1
/dev/vdd1       487M  2.3M  456M   1% /exports/vol2

If our client mounts /exports and tries to do a "chown -R" across the
entire mountpoint, we will get a nasty message warning us about a circular
directory structure.  Running chown with strace tells me that each directory
has the same device and inode number:

newfstatat(AT_FDCWD, "/nfs/", {st_dev=makedev(0, 38), st_ino=2, ...}) = 0
newfstatat(4, "vol1", {st_dev=makedev(0, 38), st_ino=2, ...}) = 0
newfstatat(4, "vol2", {st_dev=makedev(0, 38), st_ino=2, ...}) = 0

With this patch the mounted_on_fileid values are used for st_ino, so the
directory loop warning isn't reported.

Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-01-21 17:15:41 -05:00
Trond Myklebust
3175e1dcec NFSv4.1: Fix an Oops in nfs41_walk_client_list
If we start state recovery on a client that failed to initialise correctly,
then we are very likely to Oops.

Reported-by: "Mkrtchyan, Tigran" <tigran.mkrtchyan@desy.de>
Link: http://lkml.kernel.org/r/130621862.279655.1421851650684.JavaMail.zimbra@desy.de
Cc: stable@vger.kernel.org
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-01-21 17:10:09 -05:00
Peng Tao
ee8a1a8b16 nfs: fix dio deadlock when O_DIRECT flag is flipped
We only support swap file calling nfs_direct_IO. However, application
might be able to get to nfs_direct_IO if it toggles O_DIRECT flag
during IO and it can deadlock because we grab inode->i_mutex in
nfs_file_direct_write(). So return 0 for such case. Then the generic
layer will fall back to buffer IO.

Signed-off-by: Peng Tao <tao.peng@primarydata.com>
Cc: stable@vger.kernel.org
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-01-21 17:10:08 -05:00
Linus Torvalds
ec6f34e5b5 Linux 3.19-rc5 2015-01-18 18:02:20 +12:00
Linus Torvalds
d0ac5d8e67 ARM: SoC fixes
We've been sitting on our fixes branch for a while, so this batch is
 unfortunately on the large side.
 
 A lot of these are tweaks and fixes to device trees, fixing various bugs
 around clocks, reg ranges, etc. There's also a few defconfig updates
 (which are on the late side, no more of those).
 
 All in all the diffstat is bigger than ideal at this time, but the nothing
 in here seems particularly risky.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJUudSXAAoJEIwa5zzehBx3WmkP/RDPvMHGllPxZ7jDTBK2scGY
 U0zg3zeIKbJANke8BZNaYNnYmLtvOcwiqA80CsOE23+l1zv2tSf8v/je1dENFmzr
 rHahs1ZAQ2fv5k1NCazOxkeTcu5frcDujoHkDVo2b4ofLzhlTYP5UEkapLBdihrB
 KLGCXynjmMLXDViLw7mhaM0gZOxyyO3CTaBUJeLPWqTpy26LliFdJfDxe+oa+dx9
 CX3YbfKKHJ9ENFwHB6oLk0cQy1eLieWTcYJk06wUsCdcsoZmWySiaWpLFb9MIyoT
 eLqT4k8cNMNdB49GNvwZz7NxbG9RetzNd5Ixglr9NodB3mNxpW3PyU3lxrRUSc4X
 6Ij9rgFWwfRKlmCFZnHF5mxSx7z4NoBQJWsVBB4EFjfyX8eVkZ+Gu82gK6V/2HNa
 vpMAqmNCM99VXx4nsoiNBpYVShAgXxC0r8D5MKNaITZ/Z7tarJe/M2JDnxyR+r5L
 DCyjj3swQ21hKMv8FFXkOSfXir9v9bQg5KMeA7HNPCsKjvcWxpHGQdVZVkGQ3D8J
 umFsForMr3AY0G+HtmP+ntVEEB8g8AiTQgiC7gyfAKhJhjMd/vYmJdsVvsXk2SL/
 yh1y08f46FFasbVR2TTYPt6njj4FdcbDDsB5ks2gBpkb4qjutoMlNRDOYbfoN7eX
 VTacVVRJy4ftSLeNnN70
 =lJPi
 -----END PGP SIGNATURE-----

Merge tag 'armsoc-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc

Pull ARM SoC fixes from Olof Johansson:
 "We've been sitting on our fixes branch for a while, so this batch is
  unfortunately on the large side.

  A lot of these are tweaks and fixes to device trees, fixing various
  bugs around clocks, reg ranges, etc.  There's also a few defconfig
  updates (which are on the late side, no more of those).

  All in all the diffstat is bigger than ideal at this time, but nothing
  in here seems particularly risky"

* tag 'armsoc-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc: (31 commits)
  reset: sunxi: fix spinlock initialization
  ARM: dts: disable CCI on exynos5420 based arndale-octa
  drivers: bus: check cci device tree node status
  ARM: rockchip: disable jtag/sdmmc autoswitching on rk3288
  ARM: nomadik: fix up leftover device tree pins
  ARM: at91: board-dt-sama5: add phy_fixup to override NAND_Tree
  ARM: at91/dt: sam9263: Add missing clocks to lcdc node
  ARM: at91: sama5d3: dt: correct the sound route
  ARM: at91/dt: sama5d4: fix the timer reg length
  ARM: exynos_defconfig: Enable LM90 driver
  ARM: exynos_defconfig: Enable options for display panel support
  arm: dts: Use pmu_system_controller phandle for dp phy
  ARM: shmobile: sh73a0 legacy: Set .control_parent for all irqpin instances
  ARM: dts: berlin: correct BG2Q's SM GPIO location.
  ARM: dts: berlin: add broken-cd and set bus width for eMMC in Marvell DMP DT
  ARM: dts: berlin: fix io clk and add missing core clk for BG2Q sdhci2 host
  ARM: dts: Revert disabling of smc91x for n900
  ARM: dts: imx51-babbage: Fix ULPI PHY reset modelling
  ARM: dts: dra7-evm: fix qspi device tree partition size
  ARM: omap2plus_defconfig: use CONFIG_CPUFREQ_DT
  ...
2015-01-18 18:00:40 +12:00
Linus Torvalds
12ba8571ab Small number of fixes for clock drivers and a single null pointer
dereference fix in the framework core code. The driver fixes vary from
 fixing section mismatch warnings to preventing machines from hanging
 (and preventing developers from crying).
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJUuuT0AAoJEDqPOy9afJhJTCIP/iZ2AtnG/5mbYR8i7FzfSR8y
 gm+vpTvKAhIkWxj1DNUMtSTRBvfxO8xpnsXJ4BibIhmtO8yJbYI8LIEycIJ4TcWC
 4s0MDQsaMGVEfSI8K+OoFsXI+WzU1j28le2yYE6oHVuLe7gdLnpx6sheNdnL0XxX
 sv8HoI/pTFpw0jI20EZUcX/pEELGWlAZN9NCpW74cbVl/wusvV20CYG5n879Sg8n
 Zl26wXusys83+0mFgs6+Kvpeuxo78XXveTSvB+aJ5VEWDfm10kE5bqyo6iOL0rpI
 luGIMf6Uufq6+1Hzp8whgE59FOvugNjay3OR+pz7P+gWk1Ea5c9qXpBtg3gEtjF9
 JoMpjPSXAnGgjhJsuZhO4+z23OhpB+FcuC1x6EcL0i6iqpzbNpJTYa8eNMOOt8FR
 h3YCzr32IHZ6a2YutCuEdof8d9GZ5I2r8G9p8ezv7CJEBHIrLVTyu3xELwN9Ijuj
 p83716w0NU2avN2N6nF2sAF26UJhG/GbmQWkOSnj2cmeDI5xxnClJD/3etgtIaIj
 RA/WLVfUscszR52IZ2V56KKTrRJkNz04Zsx803yNZKXkNIrJ+I04xBAvQETKk24f
 fImY65mkJWC8iAErEKHYZi8WxdHAu5xRYwL34HvIfpDAsHvqHNZBltYTee6HuM2k
 wbD42D8XsOoBfZwg07RF
 =B+t3
 -----END PGP SIGNATURE-----

Merge tag 'clk-fixes-for-linus' of git://git.linaro.org/people/mike.turquette/linux

Pull clock driver fixes from Mike Turquette:
 "Small number of fixes for clock drivers and a single null pointer
  dereference fix in the framework core code.

  The driver fixes vary from fixing section mismatch warnings to
  preventing machines from hanging (and preventing developers from
  crying)"

* tag 'clk-fixes-for-linus' of git://git.linaro.org/people/mike.turquette/linux:
  clk: fix possible null pointer dereference
  Revert "clk: ppc-corenet: Fix Section mismatch warning"
  clk: rockchip: fix deadlock possibility in cpuclk
  clk: berlin: bg2q: remove non-exist "smemc" gate clock
  clk: at91: keep slow clk enabled to prevent system hang
  clk: rockchip: fix rk3288 cpuclk core dividers
  clk: rockchip: fix rk3066 pll lock bit location
  clk: rockchip: Fix clock gate for rk3188 hclk_emem_peri
  clk: rockchip: add CLK_IGNORE_UNUSED flag to fix rk3066/rk3188 USB Host
2015-01-18 15:29:11 +12:00