Commit Graph

5224 Commits

Author SHA1 Message Date
Trond Myklebust 7b0df92ac1 pNFS/flexfiles: Process writeback resends from nfsiod context as well
Although the writeback resends are more robust than the reads, since they
are not immediately rescheduled by the same thread, we are better off
processing them in the same place as the reads.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2018-06-19 09:25:27 -04:00
Trond Myklebust 42f86b44a4 pNFS/flexfiles: Don't tie up all the rpciod threads in resends
We do not want to have rpciod threads perform recursive calls into the
RPC layer since that can deadlock. In particular, having to wait for
a layoutget can be nasty... We want rather to defer scheduling those
retries until we're in the rpc_release() callback, since that is
called from the nfsiod workqueue.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2018-06-19 09:25:27 -04:00
Trond Myklebust c8bf707353 pNFS: Don't send layoutreturn if the layout is already invalid
If the layout was invalidated due to a reboot, then don't try to send
a layoutreturn for it.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2018-06-19 08:52:27 -04:00
Trond Myklebust 2dbf8dffbf pNFS: Always free the session slot on error in nfs4_layoutget_handle_exception
Right now, we can call nfs_commit_inode() while holding the session slot,
which could lead to NFSv4 deadlocks. Ensure we only keep the slot if
the server returned a layout that we have to process.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2018-06-19 08:52:27 -04:00
Linus Torvalds 7a932516f5 vfs/y2038: inode timestamps conversion to timespec64
This is a late set of changes from Deepa Dinamani doing an automated
 treewide conversion of the inode and iattr structures from 'timespec'
 to 'timespec64', to push the conversion from the VFS layer into the
 individual file systems.
 
 There were no conflicts between this and the contents of linux-next
 until just before the merge window, when we saw multiple problems:
 
 - A minor conflict with my own y2038 fixes, which I could address
   by adding another patch on top here.
 - One semantic conflict with late changes to the NFS tree. I addressed
   this by merging Deepa's original branch on top of the changes that
   now got merged into mainline and making sure the merge commit includes
   the necessary changes as produced by coccinelle.
 - A trivial conflict against the removal of staging/lustre.
 - Multiple conflicts against the VFS changes in the overlayfs tree.
   These are still part of linux-next, but apparently this is no longer
   intended for 4.18 [1], so I am ignoring that part.
 
 As Deepa writes:
 
   The series aims to switch vfs timestamps to use struct timespec64.
   Currently vfs uses struct timespec, which is not y2038 safe.
 
   The series involves the following:
   1. Add vfs helper functions for supporting struct timepec64 timestamps.
   2. Cast prints of vfs timestamps to avoid warnings after the switch.
   3. Simplify code using vfs timestamps so that the actual
      replacement becomes easy.
   4. Convert vfs timestamps to use struct timespec64 using a script.
      This is a flag day patch.
 
   Next steps:
   1. Convert APIs that can handle timespec64, instead of converting
      timestamps at the boundaries.
   2. Update internal data structures to avoid timestamp conversions.
 
 Thomas Gleixner adds:
 
   I think there is no point to drag that out for the next merge window.
   The whole thing needs to be done in one go for the core changes which
   means that you're going to play that catchup game forever. Let's get
   over with it towards the end of the merge window.
 
 [1] https://www.spinics.net/lists/linux-fsdevel/msg128294.html
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJbInZAAAoJEGCrR//JCVInReoQAIlVIIMt5ZX6wmaKbrjy9Itf
 MfgbFihQ/djLnuSPVQ3nztcxF0d66BKHZ9puVjz6+mIHqfDvJTRwZs9nU+sOF/T1
 g78fRkM1cxq6ZCkGYAbzyjyo5aC4PnSMP/NQLmwqvi0MXqqrbDoq5ZdP9DHJw39h
 L9lD8FM/P7T29Fgp9tq/pT5l9X8VU8+s5KQG1uhB5hii4VL6pD6JyLElDita7rg+
 Z7/V7jkxIGEUWF7vGaiR1QTFzEtpUA/exDf9cnsf51OGtK/LJfQ0oiZPPuq3oA/E
 LSbt8YQQObc+dvfnGxwgxEg1k5WP5ekj/Wdibv/+rQKgGyLOTz6Q4xK6r8F2ahxs
 nyZQBdXqHhJYyKr1H1reUH3mrSgQbE5U5R1i3My0xV2dSn+vtK5vgF21v2Ku3A1G
 wJratdtF/kVBzSEQUhsYTw14Un+xhBLRWzcq0cELonqxaKvRQK9r92KHLIWNE7/v
 c0TmhFbkZA+zR8HdsaL3iYf1+0W/eYy8PcvepyldKNeW2pVk3CyvdTfY2Z87G2XK
 tIkK+BUWbG3drEGG3hxZ3757Ln3a9qWyC5ruD3mBVkuug/wekbI8PykYJS7Mx4s/
 WNXl0dAL0Eeu1M8uEJejRAe1Q3eXoMWZbvCYZc+wAm92pATfHVcKwPOh8P7NHlfy
 A3HkjIBrKW5AgQDxfgvm
 =CZX2
 -----END PGP SIGNATURE-----

Merge tag 'vfs-timespec64' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/playground

Pull inode timestamps conversion to timespec64 from Arnd Bergmann:
 "This is a late set of changes from Deepa Dinamani doing an automated
  treewide conversion of the inode and iattr structures from 'timespec'
  to 'timespec64', to push the conversion from the VFS layer into the
  individual file systems.

  As Deepa writes:

   'The series aims to switch vfs timestamps to use struct timespec64.
    Currently vfs uses struct timespec, which is not y2038 safe.

    The series involves the following:
    1. Add vfs helper functions for supporting struct timepec64
       timestamps.
    2. Cast prints of vfs timestamps to avoid warnings after the switch.
    3. Simplify code using vfs timestamps so that the actual replacement
       becomes easy.
    4. Convert vfs timestamps to use struct timespec64 using a script.
       This is a flag day patch.

    Next steps:
    1. Convert APIs that can handle timespec64, instead of converting
       timestamps at the boundaries.
    2. Update internal data structures to avoid timestamp conversions'

  Thomas Gleixner adds:

   'I think there is no point to drag that out for the next merge
    window. The whole thing needs to be done in one go for the core
    changes which means that you're going to play that catchup game
    forever. Let's get over with it towards the end of the merge window'"

* tag 'vfs-timespec64' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/playground:
  pstore: Remove bogus format string definition
  vfs: change inode times to use struct timespec64
  pstore: Convert internal records to timespec64
  udf: Simplify calls to udf_disk_stamp_to_time
  fs: nfs: get rid of memcpys for inode times
  ceph: make inode time prints to be long long
  lustre: Use long long type to print inode time
  fs: add timespec64_truncate()
2018-06-15 07:31:07 +09:00
Anna Schumaker d5681f59ee NFS: Fix an rcu deadlock in nfs_delegation_find_inode()
I was able to reproduce this pretty regularily using xfstests
generic/013 on NFS v4.0.

Reported-by: Ross Zwisler <Ross.Zwisler@linux.intel.com>
Fixes: 6c34265502 (NFSv4: Return NFS4ERR_DELAY when a delegation recall fails due to igrab())
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2018-06-14 14:05:38 -04:00
Arnd Bergmann 15eefe2a99 Merge branch 'vfs_timespec64' of https://github.com/deepa-hub/vfs into vfs-timespec64
Pull the timespec64 conversion from Deepa Dinamani:
 "The series aims to switch vfs timestamps to use
  struct timespec64. Currently vfs uses struct timespec,
  which is not y2038 safe.

  The flag patch applies cleanly. I've not seen the timestamps
  update logic change often. The series applies cleanly on 4.17-rc6
  and linux-next tip (top commit: next-20180517).

  I'm not sure how to merge this kind of a series with a flag patch.
  We are targeting 4.18 for this.
  Let me know if you have other suggestions.

  The series involves the following:
  1. Add vfs helper functions for supporting struct timepec64 timestamps.
  2. Cast prints of vfs timestamps to avoid warnings after the switch.
  3. Simplify code using vfs timestamps so that the actual
     replacement becomes easy.
  4. Convert vfs timestamps to use struct timespec64 using a script.
     This is a flag day patch.

  I've tried to keep the conversions with the script simple, to
  aid in the reviews. I've kept all the internal filesystem data
  structures and function signatures the same.

  Next steps:
  1. Convert APIs that can handle timespec64, instead of converting
     timestamps at the boundaries.
  2. Update internal data structures to avoid timestamp conversions."

I've pulled it into a branch based on top of the NFS changes that
are now in mainline, so I could resolve the non-obvious conflict
between the two while merging.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2018-06-14 14:54:00 +02:00
Linus Torvalds b08fc5277a - Error path bug fix for overflow tests (Dan)
- Additional struct_size() conversions (Matthew, Kees)
 - Explicitly reported overflow fixes (Silvio, Kees)
 - Add missing kvcalloc() function (Kees)
 - Treewide conversions of allocators to use either 2-factor argument
   variant when available, or array_size() and array3_size() as needed (Kees)
 -----BEGIN PGP SIGNATURE-----
 Comment: Kees Cook <kees@outflux.net>
 
 iQJKBAABCgA0FiEEpcP2jyKd1g9yPm4TiXL039xtwCYFAlsgVtMWHGtlZXNjb29r
 QGNocm9taXVtLm9yZwAKCRCJcvTf3G3AJhsJEACLYe2EbwLFJz7emOT1KUGK5R1b
 oVxJog0893WyMqgk9XBlA2lvTBRBYzR3tzsadfYo87L3VOBzazUv0YZaweJb65sF
 bAvxW3nY06brhKKwTRed1PrMa1iG9R63WISnNAuZAq7+79mN6YgW4G6YSAEF9lW7
 oPJoPw93YxcI8JcG+dA8BC9w7pJFKooZH4gvLUSUNl5XKr8Ru5YnWcV8F+8M4vZI
 EJtXFmdlmxAledUPxTSCIojO8m/tNOjYTreBJt9K1DXKY6UcgAdhk75TRLEsp38P
 fPvMigYQpBDnYz2pi9ourTgvZLkffK1OBZ46PPt8BgUZVf70D6CBg10vK47KO6N2
 zreloxkMTrz5XohyjfNjYFRkyyuwV2sSVrRJqF4dpyJ4NJQRjvyywxIP4Myifwlb
 ONipCM1EjvQjaEUbdcqKgvlooMdhcyxfshqJWjHzXB6BL22uPzq5jHXXugz8/ol8
 tOSM2FuJ2sBLQso+szhisxtMd11PihzIZK9BfxEG3du+/hlI+2XgN7hnmlXuA2k3
 BUW6BSDhab41HNd6pp50bDJnL0uKPWyFC6hqSNZw+GOIb46jfFcQqnCB3VZGCwj3
 LH53Be1XlUrttc/NrtkvVhm4bdxtfsp4F7nsPFNDuHvYNkalAVoC3An0BzOibtkh
 AtfvEeaPHaOyD8/h2Q==
 =zUUp
 -----END PGP SIGNATURE-----

Merge tag 'overflow-v4.18-rc1-part2' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux

Pull more overflow updates from Kees Cook:
 "The rest of the overflow changes for v4.18-rc1.

  This includes the explicit overflow fixes from Silvio, further
  struct_size() conversions from Matthew, and a bug fix from Dan.

  But the bulk of it is the treewide conversions to use either the
  2-factor argument allocators (e.g. kmalloc(a * b, ...) into
  kmalloc_array(a, b, ...) or the array_size() macros (e.g. vmalloc(a *
  b) into vmalloc(array_size(a, b)).

  Coccinelle was fighting me on several fronts, so I've done a bunch of
  manual whitespace updates in the patches as well.

  Summary:

   - Error path bug fix for overflow tests (Dan)

   - Additional struct_size() conversions (Matthew, Kees)

   - Explicitly reported overflow fixes (Silvio, Kees)

   - Add missing kvcalloc() function (Kees)

   - Treewide conversions of allocators to use either 2-factor argument
     variant when available, or array_size() and array3_size() as needed
     (Kees)"

* tag 'overflow-v4.18-rc1-part2' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: (26 commits)
  treewide: Use array_size in f2fs_kvzalloc()
  treewide: Use array_size() in f2fs_kzalloc()
  treewide: Use array_size() in f2fs_kmalloc()
  treewide: Use array_size() in sock_kmalloc()
  treewide: Use array_size() in kvzalloc_node()
  treewide: Use array_size() in vzalloc_node()
  treewide: Use array_size() in vzalloc()
  treewide: Use array_size() in vmalloc()
  treewide: devm_kzalloc() -> devm_kcalloc()
  treewide: devm_kmalloc() -> devm_kmalloc_array()
  treewide: kvzalloc() -> kvcalloc()
  treewide: kvmalloc() -> kvmalloc_array()
  treewide: kzalloc_node() -> kcalloc_node()
  treewide: kzalloc() -> kcalloc()
  treewide: kmalloc() -> kmalloc_array()
  mm: Introduce kvcalloc()
  video: uvesafb: Fix integer overflow in allocation
  UBIFS: Fix potential integer overflow in allocation
  leds: Use struct_size() in allocation
  Convert intel uncore to struct_size
  ...
2018-06-12 18:28:00 -07:00
Kees Cook 6396bb2215 treewide: kzalloc() -> kcalloc()
The kzalloc() function has a 2-factor argument form, kcalloc(). This
patch replaces cases of:

        kzalloc(a * b, gfp)

with:
        kcalloc(a * b, gfp)

as well as handling cases of:

        kzalloc(a * b * c, gfp)

with:

        kzalloc(array3_size(a, b, c), gfp)

as it's slightly less ugly than:

        kzalloc_array(array_size(a, b), c, gfp)

This does, however, attempt to ignore constant size factors like:

        kzalloc(4 * 1024, gfp)

though any constants defined via macros get caught up in the conversion.

Any factors with a sizeof() of "unsigned char", "char", and "u8" were
dropped, since they're redundant.

The Coccinelle script used for this was:

// Fix redundant parens around sizeof().
@@
type TYPE;
expression THING, E;
@@

(
  kzalloc(
-	(sizeof(TYPE)) * E
+	sizeof(TYPE) * E
  , ...)
|
  kzalloc(
-	(sizeof(THING)) * E
+	sizeof(THING) * E
  , ...)
)

// Drop single-byte sizes and redundant parens.
@@
expression COUNT;
typedef u8;
typedef __u8;
@@

(
  kzalloc(
-	sizeof(u8) * (COUNT)
+	COUNT
  , ...)
|
  kzalloc(
-	sizeof(__u8) * (COUNT)
+	COUNT
  , ...)
|
  kzalloc(
-	sizeof(char) * (COUNT)
+	COUNT
  , ...)
|
  kzalloc(
-	sizeof(unsigned char) * (COUNT)
+	COUNT
  , ...)
|
  kzalloc(
-	sizeof(u8) * COUNT
+	COUNT
  , ...)
|
  kzalloc(
-	sizeof(__u8) * COUNT
+	COUNT
  , ...)
|
  kzalloc(
-	sizeof(char) * COUNT
+	COUNT
  , ...)
|
  kzalloc(
-	sizeof(unsigned char) * COUNT
+	COUNT
  , ...)
)

// 2-factor product with sizeof(type/expression) and identifier or constant.
@@
type TYPE;
expression THING;
identifier COUNT_ID;
constant COUNT_CONST;
@@

(
- kzalloc
+ kcalloc
  (
-	sizeof(TYPE) * (COUNT_ID)
+	COUNT_ID, sizeof(TYPE)
  , ...)
|
- kzalloc
+ kcalloc
  (
-	sizeof(TYPE) * COUNT_ID
+	COUNT_ID, sizeof(TYPE)
  , ...)
|
- kzalloc
+ kcalloc
  (
-	sizeof(TYPE) * (COUNT_CONST)
+	COUNT_CONST, sizeof(TYPE)
  , ...)
|
- kzalloc
+ kcalloc
  (
-	sizeof(TYPE) * COUNT_CONST
+	COUNT_CONST, sizeof(TYPE)
  , ...)
|
- kzalloc
+ kcalloc
  (
-	sizeof(THING) * (COUNT_ID)
+	COUNT_ID, sizeof(THING)
  , ...)
|
- kzalloc
+ kcalloc
  (
-	sizeof(THING) * COUNT_ID
+	COUNT_ID, sizeof(THING)
  , ...)
|
- kzalloc
+ kcalloc
  (
-	sizeof(THING) * (COUNT_CONST)
+	COUNT_CONST, sizeof(THING)
  , ...)
|
- kzalloc
+ kcalloc
  (
-	sizeof(THING) * COUNT_CONST
+	COUNT_CONST, sizeof(THING)
  , ...)
)

// 2-factor product, only identifiers.
@@
identifier SIZE, COUNT;
@@

- kzalloc
+ kcalloc
  (
-	SIZE * COUNT
+	COUNT, SIZE
  , ...)

// 3-factor product with 1 sizeof(type) or sizeof(expression), with
// redundant parens removed.
@@
expression THING;
identifier STRIDE, COUNT;
type TYPE;
@@

(
  kzalloc(
-	sizeof(TYPE) * (COUNT) * (STRIDE)
+	array3_size(COUNT, STRIDE, sizeof(TYPE))
  , ...)
|
  kzalloc(
-	sizeof(TYPE) * (COUNT) * STRIDE
+	array3_size(COUNT, STRIDE, sizeof(TYPE))
  , ...)
|
  kzalloc(
-	sizeof(TYPE) * COUNT * (STRIDE)
+	array3_size(COUNT, STRIDE, sizeof(TYPE))
  , ...)
|
  kzalloc(
-	sizeof(TYPE) * COUNT * STRIDE
+	array3_size(COUNT, STRIDE, sizeof(TYPE))
  , ...)
|
  kzalloc(
-	sizeof(THING) * (COUNT) * (STRIDE)
+	array3_size(COUNT, STRIDE, sizeof(THING))
  , ...)
|
  kzalloc(
-	sizeof(THING) * (COUNT) * STRIDE
+	array3_size(COUNT, STRIDE, sizeof(THING))
  , ...)
|
  kzalloc(
-	sizeof(THING) * COUNT * (STRIDE)
+	array3_size(COUNT, STRIDE, sizeof(THING))
  , ...)
|
  kzalloc(
-	sizeof(THING) * COUNT * STRIDE
+	array3_size(COUNT, STRIDE, sizeof(THING))
  , ...)
)

// 3-factor product with 2 sizeof(variable), with redundant parens removed.
@@
expression THING1, THING2;
identifier COUNT;
type TYPE1, TYPE2;
@@

(
  kzalloc(
-	sizeof(TYPE1) * sizeof(TYPE2) * COUNT
+	array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
  , ...)
|
  kzalloc(
-	sizeof(TYPE1) * sizeof(THING2) * (COUNT)
+	array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
  , ...)
|
  kzalloc(
-	sizeof(THING1) * sizeof(THING2) * COUNT
+	array3_size(COUNT, sizeof(THING1), sizeof(THING2))
  , ...)
|
  kzalloc(
-	sizeof(THING1) * sizeof(THING2) * (COUNT)
+	array3_size(COUNT, sizeof(THING1), sizeof(THING2))
  , ...)
|
  kzalloc(
-	sizeof(TYPE1) * sizeof(THING2) * COUNT
+	array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
  , ...)
|
  kzalloc(
-	sizeof(TYPE1) * sizeof(THING2) * (COUNT)
+	array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
  , ...)
)

// 3-factor product, only identifiers, with redundant parens removed.
@@
identifier STRIDE, SIZE, COUNT;
@@

(
  kzalloc(
-	(COUNT) * STRIDE * SIZE
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  kzalloc(
-	COUNT * (STRIDE) * SIZE
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  kzalloc(
-	COUNT * STRIDE * (SIZE)
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  kzalloc(
-	(COUNT) * (STRIDE) * SIZE
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  kzalloc(
-	COUNT * (STRIDE) * (SIZE)
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  kzalloc(
-	(COUNT) * STRIDE * (SIZE)
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  kzalloc(
-	(COUNT) * (STRIDE) * (SIZE)
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  kzalloc(
-	COUNT * STRIDE * SIZE
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
)

// Any remaining multi-factor products, first at least 3-factor products,
// when they're not all constants...
@@
expression E1, E2, E3;
constant C1, C2, C3;
@@

(
  kzalloc(C1 * C2 * C3, ...)
|
  kzalloc(
-	(E1) * E2 * E3
+	array3_size(E1, E2, E3)
  , ...)
|
  kzalloc(
-	(E1) * (E2) * E3
+	array3_size(E1, E2, E3)
  , ...)
|
  kzalloc(
-	(E1) * (E2) * (E3)
+	array3_size(E1, E2, E3)
  , ...)
|
  kzalloc(
-	E1 * E2 * E3
+	array3_size(E1, E2, E3)
  , ...)
)

// And then all remaining 2 factors products when they're not all constants,
// keeping sizeof() as the second factor argument.
@@
expression THING, E1, E2;
type TYPE;
constant C1, C2, C3;
@@

(
  kzalloc(sizeof(THING) * C2, ...)
|
  kzalloc(sizeof(TYPE) * C2, ...)
|
  kzalloc(C1 * C2 * C3, ...)
|
  kzalloc(C1 * C2, ...)
|
- kzalloc
+ kcalloc
  (
-	sizeof(TYPE) * (E2)
+	E2, sizeof(TYPE)
  , ...)
|
- kzalloc
+ kcalloc
  (
-	sizeof(TYPE) * E2
+	E2, sizeof(TYPE)
  , ...)
|
- kzalloc
+ kcalloc
  (
-	sizeof(THING) * (E2)
+	E2, sizeof(THING)
  , ...)
|
- kzalloc
+ kcalloc
  (
-	sizeof(THING) * E2
+	E2, sizeof(THING)
  , ...)
|
- kzalloc
+ kcalloc
  (
-	(E1) * E2
+	E1, E2
  , ...)
|
- kzalloc
+ kcalloc
  (
-	(E1) * (E2)
+	E1, E2
  , ...)
|
- kzalloc
+ kcalloc
  (
-	E1 * E2
+	E1, E2
  , ...)
)

Signed-off-by: Kees Cook <keescook@chromium.org>
2018-06-12 16:19:22 -07:00
Linus Torvalds 0725d4e1b8 NFS client updates for Linux 4.18
Highlights include:
 
 Stable fixes:
 - Fix a 1-byte stack overflow in nfs_idmap_read_and_verify_message
 - Fix a hang due to incorrect error returns in rpcrdma_convert_iovs()
 - Revert an incorrect change to the NFSv4.1 callback channel
 - Fix a bug in the NFSv4.1 sequence error handling
 
 Features and optimisations:
 - Support for piggybacking a LAYOUTGET operation to the OPEN compound
 - RDMA performance enhancements to deal with transport congestion
 - Add proper SPDX tags for NetApp-contributed RDMA source
 - Do not request delegated file attributes (size+change) from the server
 - Optimise away a GETATTR in the lookup revalidate code when doing NFSv4 OPEN
 - Optimise away unnecessary lookups for rename targets
 - Misc performance improvements when freeing NFSv4 delegations
 
 Bugfixes and cleanups:
 - Try to fail quickly if proto=rdma
 - Clean up RDMA receive trace points
 - Fix sillyrename to return the delegation when appropriate
 - Misc attribute revalidation fixes
 - Immediately clear the pNFS layout on a file when the server returns ESTALE
 - Return NFS4ERR_DELAY when delegation/layout recalls fail due to igrab()
 - Fix the client behaviour on NFS4ERR_SEQ_FALSE_RETRY
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJbH8gIAAoJEA4mA3inWBJcpzYQAJYY3ykt9oLQgm/2b/D/weDe
 6890M9W5nIeuZq5soWSpYsZTxqIFbGV4laG/eCTW1gUN1TitSZsoOp7kqhRHXOjq
 Rv3ZvjlZsP2qv2SnzsEmhJsynfyB46d19smSTJhgQ8dnXhaZv04Wsd4krLHx0z6p
 uUUis5Q1m+vL7HsFPp3iUareO/DFKeSkw2cQ2V5ksTIEiAzX7GC+Ex/KKWf82nrJ
 hm7+Nq7rLf1QHJkQvsc3fYCMR4gIzEwUu6F8RyxCoAVgD6O90Hx6NbxnINaHDD4N
 U0nRP5LwCyN9hbPWvwcH7Sn4ePDTos2yj2tFO5NP9btTLDVLFSGYZ2a74d9PRdAf
 9jn6f6juSDwI7T6NXvkHzzkJG6Or9ABAUZo+yX5JoD6lmgOcPUJpLRy6fu7UxAuN
 a5OZ7d9edYpOi0Kys8sDSIlLlxZtFkvybOMVuI3dSHsI+c0g39w8oarpqT2wXWMs
 /ZtFz0FCreHhKkNtz7Z49z1UQHDv/XYM0WkcO+eaeK58RLIEE0pZHoMvPKP63lkI
 nbbgHvBRAu38Jtvvu65Hpb/VpBcqNGM5hjN1cfW/BOqAPKW23s4vWVj+/1silfW/
 uw0MkNrDC9endoALp/YMCcMwPvEw9Awt9y4KjMgfVgSnKwXd0HaSZ2zE6aJU3Wry
 Fy2Tv0e0OH3z9Bi/LNuJ
 =YWSl
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-4.18-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs

Pull NFS client updates from Trond Myklebust:
 "Highlights include:

  Stable fixes:

   - Fix a 1-byte stack overflow in nfs_idmap_read_and_verify_message

   - Fix a hang due to incorrect error returns in rpcrdma_convert_iovs()

   - Revert an incorrect change to the NFSv4.1 callback channel

   - Fix a bug in the NFSv4.1 sequence error handling

  Features and optimisations:

   - Support for piggybacking a LAYOUTGET operation to the OPEN compound

   - RDMA performance enhancements to deal with transport congestion

   - Add proper SPDX tags for NetApp-contributed RDMA source

   - Do not request delegated file attributes (size+change) from the
     server

   - Optimise away a GETATTR in the lookup revalidate code when doing
     NFSv4 OPEN

   - Optimise away unnecessary lookups for rename targets

   - Misc performance improvements when freeing NFSv4 delegations

  Bugfixes and cleanups:

   - Try to fail quickly if proto=rdma

   - Clean up RDMA receive trace points

   - Fix sillyrename to return the delegation when appropriate

   - Misc attribute revalidation fixes

   - Immediately clear the pNFS layout on a file when the server returns
     ESTALE

   - Return NFS4ERR_DELAY when delegation/layout recalls fail due to
     igrab()

   - Fix the client behaviour on NFS4ERR_SEQ_FALSE_RETRY"

* tag 'nfs-for-4.18-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (80 commits)
  skip LAYOUTRETURN if layout is invalid
  NFSv4.1: Fix the client behaviour on NFS4ERR_SEQ_FALSE_RETRY
  NFSv4: Fix a typo in nfs41_sequence_process
  NFSv4: Revert commit 5f83d86cf5 ("NFSv4.x: Fix wraparound issues..")
  NFSv4: Return NFS4ERR_DELAY when a layout recall fails due to igrab()
  NFSv4: Return NFS4ERR_DELAY when a delegation recall fails due to igrab()
  NFSv4.0: Remove transport protocol name from non-UCS client ID
  NFSv4.0: Remove cl_ipaddr from non-UCS client ID
  NFSv4: Fix a compiler warning when CONFIG_NFS_V4_1 is undefined
  NFS: Filter cache invalidation when holding a delegation
  NFS: Ignore NFS_INO_REVAL_FORCED in nfs_check_inode_attributes()
  NFS: Improve caching while holding a delegation
  NFS: Fix attribute revalidation
  NFS: fix up nfs_setattr_update_inode
  NFSv4: Ensure the inode is clean when we set a delegation
  NFSv4: Ignore NFS_INO_REVAL_FORCED in nfs4_proc_access
  NFSv4: Don't ask for delegated attributes when adding a hard link
  NFSv4: Don't ask for delegated attributes when revalidating the inode
  NFS: Pass the inode down to the getattr() callback
  NFSv4: Don't request size+change attribute if they are delegated to us
  ...
2018-06-12 10:09:03 -07:00
Olga Kornievskaia 93b7f7ad20 skip LAYOUTRETURN if layout is invalid
Currently, when IO to DS fails, client returns the layout and
retries against the MDS. However, then on umounting (inode eviction)
it returns the layout again.

This is because pnfs_return_layout() was changed in
commit d78471d32b ("pnfs/blocklayout: set PNFS_LAYOUTRETURN_ON_ERROR")
to always set NFS_LAYOUT_RETURN_REQUESTED so even if we returned
the layout, it will be returned again. Instead, let's also check
if we have already marked the layout invalid.

Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2018-06-12 08:48:04 -04:00
Trond Myklebust f9312a5410 NFSv4.1: Fix the client behaviour on NFS4ERR_SEQ_FALSE_RETRY
If the server returns NFS4ERR_SEQ_FALSE_RETRY or NFS4ERR_RETRY_UNCACHED_REP,
then it thinks we're trying to replay an existing request. If so, then
let's just bump the sequence ID and retry the operation.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2018-06-09 19:10:31 -04:00
Trond Myklebust 995891006c NFSv4: Fix a typo in nfs41_sequence_process
We want to compare the slot_id to the highest slot number advertised by the
server.

Fixes: 3be0f80b5f ("NFSv4.1: Fix up replays of interrupted requests")
Cc: stable@vger.kernel.org # 4.15+
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2018-06-09 12:59:55 -04:00
Trond Myklebust fc40724fc6 NFSv4: Revert commit 5f83d86cf5 ("NFSv4.x: Fix wraparound issues..")
The correct behaviour for NFSv4 sequence IDs is to wrap around
to the value 0 after 0xffffffff.
See https://tools.ietf.org/html/rfc5661#section-2.10.6.1

Fixes: 5f83d86cf5 ("NFSv4.x: Fix wraparound issues when validing...")
Cc: stable@vger.kernel.org # 4.6+
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2018-06-09 12:56:30 -04:00
Trond Myklebust ce5624f7e6 NFSv4: Return NFS4ERR_DELAY when a layout recall fails due to igrab()
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2018-06-08 16:36:10 -04:00
Trond Myklebust 6c34265502 NFSv4: Return NFS4ERR_DELAY when a delegation recall fails due to igrab()
If the attempt to recall the delegation fails because the inode is
in the process of being evicted from cache, then use NFS4ERR_DELAY
to ask the server to retry later.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2018-06-08 16:36:10 -04:00
Chuck Lever 025bb9f872 NFSv4.0: Remove transport protocol name from non-UCS client ID
Commit 69dd716c5f ("NFSv4: Add socket proto argument to
setclientid") (2007) added the transport protocol name to the client
ID string, but the patch description doesn't explain why this was
necessary.

At that time, the only transport protocol name that would have been
used is "tcp" (for both IPv4 and IPv6), resulting in no additional
distinctiveness of the client ID string.

Since there is one client instance, the server should recognize it's
state whether the client is connecting via TCP or RDMA. Same client,
same lease.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2018-06-06 11:45:52 -04:00
Chuck Lever 848a4eb2e3 NFSv4.0: Remove cl_ipaddr from non-UCS client ID
It is possible for two distinct clients to have the same cl_ipaddr:

 - if the client admin disables callback with clientaddr=0.0.0.0 on
   more than one client

 - if two clients behind separate NATs use the same private subnet
   number

 - if the client admin specifies the same address via clientaddr=
   mount option (pointing the server at the same NAT box, for
   example)

Because of the way the Linux NFSv4.0 client constructs its client
ID string by default, such clients could interfere with each others'
lease state when mounting the same server:

	scnprintf(str, len, "Linux NFSv4.0 %s/%s %s",
		clp->cl_ipaddr,
		rpc_peeraddr2str(clp->cl_rpcclient, RPC_DISPLAY_ADDR),
		rpc_peeraddr2str(clp->cl_rpcclient, RPC_DISPLAY_PROTO));

cl_ipaddr is set to the value of the clientaddr= mount option. Two
clients whose addresses are 192.168.3.77 that mount the same server
(whose public IP address is, say, 3.4.5.6) would both generate the
same client ID string when sending a SETCLIENTID:

  Linux NFSv4.0 192.168.3.77/3.4.5.6 tcp

and thus the server would not be able to distinguish the clients'
leases. If both clients are using AUTH_SYS when sending SETCLIENTID
then the server could possibly permit the two clients to interfere
with or purge each others' leases.

To better ensure that Linux's NFSv4.0 client ID strings are distinct
in these cases, remove cl_ipaddr from the client ID string and
replace it with something more likely to be unique. Note that the
replacement looks a lot like the uniform client ID string.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2018-06-06 11:45:44 -04:00
Deepa Dinamani 95582b0083 vfs: change inode times to use struct timespec64
struct timespec is not y2038 safe. Transition vfs to use
y2038 safe struct timespec64 instead.

The change was made with the help of the following cocinelle
script. This catches about 80% of the changes.
All the header file and logic changes are included in the
first 5 rules. The rest are trivial substitutions.
I avoid changing any of the function signatures or any other
filesystem specific data structures to keep the patch simple
for review.

The script can be a little shorter by combining different cases.
But, this version was sufficient for my usecase.

virtual patch

@ depends on patch @
identifier now;
@@
- struct timespec
+ struct timespec64
  current_time ( ... )
  {
- struct timespec now = current_kernel_time();
+ struct timespec64 now = current_kernel_time64();
  ...
- return timespec_trunc(
+ return timespec64_trunc(
  ... );
  }

@ depends on patch @
identifier xtime;
@@
 struct \( iattr \| inode \| kstat \) {
 ...
-       struct timespec xtime;
+       struct timespec64 xtime;
 ...
 }

@ depends on patch @
identifier t;
@@
 struct inode_operations {
 ...
int (*update_time) (...,
-       struct timespec t,
+       struct timespec64 t,
...);
 ...
 }

@ depends on patch @
identifier t;
identifier fn_update_time =~ "update_time$";
@@
 fn_update_time (...,
- struct timespec *t,
+ struct timespec64 *t,
 ...) { ... }

@ depends on patch @
identifier t;
@@
lease_get_mtime( ... ,
- struct timespec *t
+ struct timespec64 *t
  ) { ... }

@te depends on patch forall@
identifier ts;
local idexpression struct inode *inode_node;
identifier i_xtime =~ "^i_[acm]time$";
identifier ia_xtime =~ "^ia_[acm]time$";
identifier fn_update_time =~ "update_time$";
identifier fn;
expression e, E3;
local idexpression struct inode *node1;
local idexpression struct inode *node2;
local idexpression struct iattr *attr1;
local idexpression struct iattr *attr2;
local idexpression struct iattr attr;
identifier i_xtime1 =~ "^i_[acm]time$";
identifier i_xtime2 =~ "^i_[acm]time$";
identifier ia_xtime1 =~ "^ia_[acm]time$";
identifier ia_xtime2 =~ "^ia_[acm]time$";
@@
(
(
- struct timespec ts;
+ struct timespec64 ts;
|
- struct timespec ts = current_time(inode_node);
+ struct timespec64 ts = current_time(inode_node);
)

<+... when != ts
(
- timespec_equal(&inode_node->i_xtime, &ts)
+ timespec64_equal(&inode_node->i_xtime, &ts)
|
- timespec_equal(&ts, &inode_node->i_xtime)
+ timespec64_equal(&ts, &inode_node->i_xtime)
|
- timespec_compare(&inode_node->i_xtime, &ts)
+ timespec64_compare(&inode_node->i_xtime, &ts)
|
- timespec_compare(&ts, &inode_node->i_xtime)
+ timespec64_compare(&ts, &inode_node->i_xtime)
|
ts = current_time(e)
|
fn_update_time(..., &ts,...)
|
inode_node->i_xtime = ts
|
node1->i_xtime = ts
|
ts = inode_node->i_xtime
|
<+... attr1->ia_xtime ...+> = ts
|
ts = attr1->ia_xtime
|
ts.tv_sec
|
ts.tv_nsec
|
btrfs_set_stack_timespec_sec(..., ts.tv_sec)
|
btrfs_set_stack_timespec_nsec(..., ts.tv_nsec)
|
- ts = timespec64_to_timespec(
+ ts =
...
-)
|
- ts = ktime_to_timespec(
+ ts = ktime_to_timespec64(
...)
|
- ts = E3
+ ts = timespec_to_timespec64(E3)
|
- ktime_get_real_ts(&ts)
+ ktime_get_real_ts64(&ts)
|
fn(...,
- ts
+ timespec64_to_timespec(ts)
,...)
)
...+>
(
<... when != ts
- return ts;
+ return timespec64_to_timespec(ts);
...>
)
|
- timespec_equal(&node1->i_xtime1, &node2->i_xtime2)
+ timespec64_equal(&node1->i_xtime2, &node2->i_xtime2)
|
- timespec_equal(&node1->i_xtime1, &attr2->ia_xtime2)
+ timespec64_equal(&node1->i_xtime2, &attr2->ia_xtime2)
|
- timespec_compare(&node1->i_xtime1, &node2->i_xtime2)
+ timespec64_compare(&node1->i_xtime1, &node2->i_xtime2)
|
node1->i_xtime1 =
- timespec_trunc(attr1->ia_xtime1,
+ timespec64_trunc(attr1->ia_xtime1,
...)
|
- attr1->ia_xtime1 = timespec_trunc(attr2->ia_xtime2,
+ attr1->ia_xtime1 =  timespec64_trunc(attr2->ia_xtime2,
...)
|
- ktime_get_real_ts(&attr1->ia_xtime1)
+ ktime_get_real_ts64(&attr1->ia_xtime1)
|
- ktime_get_real_ts(&attr.ia_xtime1)
+ ktime_get_real_ts64(&attr.ia_xtime1)
)

@ depends on patch @
struct inode *node;
struct iattr *attr;
identifier fn;
identifier i_xtime =~ "^i_[acm]time$";
identifier ia_xtime =~ "^ia_[acm]time$";
expression e;
@@
(
- fn(node->i_xtime);
+ fn(timespec64_to_timespec(node->i_xtime));
|
 fn(...,
- node->i_xtime);
+ timespec64_to_timespec(node->i_xtime));
|
- e = fn(attr->ia_xtime);
+ e = fn(timespec64_to_timespec(attr->ia_xtime));
)

@ depends on patch forall @
struct inode *node;
struct iattr *attr;
identifier i_xtime =~ "^i_[acm]time$";
identifier ia_xtime =~ "^ia_[acm]time$";
identifier fn;
@@
{
+ struct timespec ts;
<+...
(
+ ts = timespec64_to_timespec(node->i_xtime);
fn (...,
- &node->i_xtime,
+ &ts,
...);
|
+ ts = timespec64_to_timespec(attr->ia_xtime);
fn (...,
- &attr->ia_xtime,
+ &ts,
...);
)
...+>
}

@ depends on patch forall @
struct inode *node;
struct iattr *attr;
struct kstat *stat;
identifier ia_xtime =~ "^ia_[acm]time$";
identifier i_xtime =~ "^i_[acm]time$";
identifier xtime =~ "^[acm]time$";
identifier fn, ret;
@@
{
+ struct timespec ts;
<+...
(
+ ts = timespec64_to_timespec(node->i_xtime);
ret = fn (...,
- &node->i_xtime,
+ &ts,
...);
|
+ ts = timespec64_to_timespec(node->i_xtime);
ret = fn (...,
- &node->i_xtime);
+ &ts);
|
+ ts = timespec64_to_timespec(attr->ia_xtime);
ret = fn (...,
- &attr->ia_xtime,
+ &ts,
...);
|
+ ts = timespec64_to_timespec(attr->ia_xtime);
ret = fn (...,
- &attr->ia_xtime);
+ &ts);
|
+ ts = timespec64_to_timespec(stat->xtime);
ret = fn (...,
- &stat->xtime);
+ &ts);
)
...+>
}

@ depends on patch @
struct inode *node;
struct inode *node2;
identifier i_xtime1 =~ "^i_[acm]time$";
identifier i_xtime2 =~ "^i_[acm]time$";
identifier i_xtime3 =~ "^i_[acm]time$";
struct iattr *attrp;
struct iattr *attrp2;
struct iattr attr ;
identifier ia_xtime1 =~ "^ia_[acm]time$";
identifier ia_xtime2 =~ "^ia_[acm]time$";
struct kstat *stat;
struct kstat stat1;
struct timespec64 ts;
identifier xtime =~ "^[acmb]time$";
expression e;
@@
(
( node->i_xtime2 \| attrp->ia_xtime2 \| attr.ia_xtime2 \) = node->i_xtime1  ;
|
 node->i_xtime2 = \( node2->i_xtime1 \| timespec64_trunc(...) \);
|
 node->i_xtime2 = node->i_xtime1 = node->i_xtime3 = \(ts \| current_time(...) \);
|
 node->i_xtime1 = node->i_xtime3 = \(ts \| current_time(...) \);
|
 stat->xtime = node2->i_xtime1;
|
 stat1.xtime = node2->i_xtime1;
|
( node->i_xtime2 \| attrp->ia_xtime2 \) = attrp->ia_xtime1  ;
|
( attrp->ia_xtime1 \| attr.ia_xtime1 \) = attrp2->ia_xtime2;
|
- e = node->i_xtime1;
+ e = timespec64_to_timespec( node->i_xtime1 );
|
- e = attrp->ia_xtime1;
+ e = timespec64_to_timespec( attrp->ia_xtime1 );
|
node->i_xtime1 = current_time(...);
|
 node->i_xtime2 = node->i_xtime1 = node->i_xtime3 =
- e;
+ timespec_to_timespec64(e);
|
 node->i_xtime1 = node->i_xtime3 =
- e;
+ timespec_to_timespec64(e);
|
- node->i_xtime1 = e;
+ node->i_xtime1 = timespec_to_timespec64(e);
)

Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Cc: <anton@tuxera.com>
Cc: <balbi@kernel.org>
Cc: <bfields@fieldses.org>
Cc: <darrick.wong@oracle.com>
Cc: <dhowells@redhat.com>
Cc: <dsterba@suse.com>
Cc: <dwmw2@infradead.org>
Cc: <hch@lst.de>
Cc: <hirofumi@mail.parknet.co.jp>
Cc: <hubcap@omnibond.com>
Cc: <jack@suse.com>
Cc: <jaegeuk@kernel.org>
Cc: <jaharkes@cs.cmu.edu>
Cc: <jslaby@suse.com>
Cc: <keescook@chromium.org>
Cc: <mark@fasheh.com>
Cc: <miklos@szeredi.hu>
Cc: <nico@linaro.org>
Cc: <reiserfs-devel@vger.kernel.org>
Cc: <richard@nod.at>
Cc: <sage@redhat.com>
Cc: <sfrench@samba.org>
Cc: <swhiteho@redhat.com>
Cc: <tj@kernel.org>
Cc: <trond.myklebust@primarydata.com>
Cc: <tytso@mit.edu>
Cc: <viro@zeniv.linux.org.uk>
2018-06-05 16:57:31 -07:00
Trond Myklebust 977294c719 NFSv4: Fix a compiler warning when CONFIG_NFS_V4_1 is undefined
Fix a compiler warning:
fs/nfs/nfs4proc.c:910:13: warning: 'nfs4_layoutget_release' defined but not used [-Wunused-function]
 static void nfs4_layoutget_release(void *calldata)
             ^~~~~~~~~~~~~~~~~~~~~~

Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2018-06-05 10:38:39 -04:00
Trond Myklebust 3f0b3cf46e NFS: Filter cache invalidation when holding a delegation
If the client holds a delegation, then ensure we filter out attempts
to invalidate the size, owner, group owner, or mode unless we made the
change, in which case, check that NFS_INO_REVAL_FORCED is set by the
caller.
Always filter out attempts to invalidate the change attribute and
size, since we are authoritative for those.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2018-06-04 15:03:58 -04:00
Trond Myklebust 4ebe83af20 NFS: Ignore NFS_INO_REVAL_FORCED in nfs_check_inode_attributes()
If we hold a delegation, we should not need to call
nfs_check_inode_attributes() since we already know which attributes
are valid, and which ones may still need revalidation. The state
of the NFS_INO_REVAL_FORCED flag is therefore irrelevant.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2018-06-04 15:03:58 -04:00
Trond Myklebust c80d17c55d NFS: Improve caching while holding a delegation
Make sure that the client completely ignores change attribute and size
changes on the server when it holds a delegation.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2018-06-04 15:03:58 -04:00
Trond Myklebust 0b467264d0 NFS: Fix attribute revalidation
Don't mark attributes as invalid just because they have changed. Instead,
for the purposes of adjusting the attribute cache timeout, keep a
separate variable that tracks whether or not a change occurred.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2018-06-04 15:03:58 -04:00
Trond Myklebust 6a97d02dfe NFS: fix up nfs_setattr_update_inode
Always try to set the attributes, even if we don't have a valid struct
nfs_fattr.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2018-06-04 15:03:58 -04:00
Trond Myklebust 97c2c17af9 NFSv4: Ensure the inode is clean when we set a delegation
If there are attributes that are still invalid when we set a delegation,
then we need to set the NFS_INO_REVAL_FORCED flag.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2018-06-04 15:03:58 -04:00
Trond Myklebust 7c6726546c NFSv4: Ignore NFS_INO_REVAL_FORCED in nfs4_proc_access
If we hold a delegation, we don't need to care about whether or not
the inode attributes are up to date. We know we can cache the results
of this call regardless.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2018-06-04 15:03:20 -04:00
Trond Myklebust 2f28dc385a NFSv4: Don't ask for delegated attributes when adding a hard link
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2018-06-04 12:07:07 -04:00
Trond Myklebust 771734f291 NFSv4: Don't ask for delegated attributes when revalidating the inode
Again, when revalidating the inode, we don't need to ask for attributes
for which we are authoritative.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2018-06-04 12:07:07 -04:00
Trond Myklebust a841b54dbd NFS: Pass the inode down to the getattr() callback
Allow the getattr() callback to check things like whether or not we hold
a delegation so that it can adjust the attributes that it is asking for.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2018-06-04 12:07:07 -04:00
Trond Myklebust 30846df06f NFSv4: Don't request size+change attribute if they are delegated to us
When we hold a delegation, we should not need to request attributes such
as the file size or the change attribute. For some servers, avoiding
asking for these unneeded attributes can improve the overall system
performance.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2018-06-04 12:07:07 -04:00
Trond Myklebust ae55e59da0 pnfs: Don't release the sequence slot until we've processed layoutget on open
If the server recalls the layout that was just handed out, we risk hitting
a race as described in RFC5661 Section 2.10.6.3 unless we ensure that we
release the sequence slot after processing the LAYOUTGET operation that
was sent as part of the OPEN compound.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2018-05-31 15:03:12 -04:00
Trond Myklebust 32f1c28f3d pnfs: Don't call commit on failed layoutget-on-open
If the layoutget on open call failed, we can't really commit the inode,
so don't bother calling it.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2018-05-31 15:03:12 -04:00
Trond Myklebust 64294b08f9 pNFS: Don't send LAYOUTGET on OPEN for read, if we already have cached data
If we're only opening the file for reading, and the file is empty and/or
we already have cached data, then heuristically optimise away the
LAYOUTGET.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2018-05-31 15:03:12 -04:00
Trond Myklebust 8dc96566c0 NFSv4/pnfs: Don't switch off layoutget-on-open for transient errors
Ensure that we only switch off the LAYOUTGET operation in the OPEN
compound when the server is truly broken, and/or it is complaining
that the compound is too large.
Currently, we end up turning off the functionality permanently,
even for transient errors such as EACCES or ENOSPC.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2018-05-31 15:03:11 -04:00
Trond Myklebust d49e0d5b99 NFSv4/pnfs: Ensure pnfs_parse_lgopen() won't try to parse uninitialised data
We need to ensure that pnfs_parse_lgopen() doesn't try to parse a
struct nfs4_layoutget_res that was not filled by a successful call
to decode_layoutget(). This can happen if we performed a cached open,
or if either the OP_ACCESS or OP_GETATTR operations preceding the
OP_LAYOUTGET in the compound returned an error.

By initialising the 'status' field to NFS4ERR_DELAY, we ensure that
pnfs_parse_lgopen() won't try to interpret the structure.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2018-05-31 15:03:11 -04:00
Fred Isaman 30ae2412e9 pnfs: Fix manipulation of NFS_LAYOUT_FIRST_LAYOUTGET
The flag was not always being cleared after LAYOUTGET on OPEN.

Signed-off-by: Fred Isaman <fred.isaman@gmail.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2018-05-31 15:03:11 -04:00
Fred Isaman c49b5209f9 pnfs: Add barrier to prevent lgopen using LAYOUTGET during recall
Since the LAYOUTGET on OPEN can be sent without prior inode information,
existing methods to prevent LAYOUTGET from being sent while processing
CB_LAYOUTRECALL don't work.  Track if a recall occurred while LAYOUTGET
was being sent, and if so ignore the results.

Signed-off-by: Fred Isaman <fred.isaman@gmail.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2018-05-31 15:03:11 -04:00
Fred Isaman 6e01260cee pnfs: Stop attempting LAYOUTGET on OPEN on failure
Signed-off-by: Fred Isaman <fred.isaman@gmail.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2018-05-31 15:03:11 -04:00
Fred Isaman 78746a384c pnfs: Add LAYOUTGET to OPEN of an existing file
Signed-off-by: Fred Isaman <fred.isaman@gmail.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2018-05-31 15:03:11 -04:00
Trond Myklebust 29a8bfe52d pNFS: Refactor nfs4_layoutget_release()
Move the actual freeing of the struct nfs4_layoutget into fs/nfs/pnfs.c
where it can be reused by the layoutget on open code.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2018-05-31 15:03:11 -04:00
Fred Isaman 2409a976a2 pnfs: Add LAYOUTGET to OPEN of a new file
This triggers when have no pre-existing inode to attach to.
The preexisting case is saved for later.

Signed-off-by: Fred Isaman <fred.isaman@gmail.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2018-05-31 15:03:11 -04:00
Fred Isaman 5e36e2a941 pnfs: Change pnfs_alloc_init_layoutget_args call signature
Don't send in a layout, instead use the (possibly NULL) inode.

This is needed for LAYOUTGET attached to an OPEN where the inode is not
yet set.

Signed-off-by: Fred Isaman <fred.isaman@gmail.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2018-05-31 15:03:11 -04:00
Fred Isaman 1b146fcff7 pnfs: Move nfs4_opendata into nfs4_fs.h
It will be needed now by the pnfs code.

Signed-off-by: Fred Isaman <fred.isaman@gmail.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2018-05-31 15:03:11 -04:00
Fred Isaman 56f487f8c8 pnfs: Add conditional encode/decode of LAYOUTGET within OPEN compound
Signed-off-by: Fred Isaman <fred.isaman@gmail.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2018-05-31 15:03:11 -04:00
Fred Isaman dacb452db8 pnfs: move allocations out of nfs4_proc_layoutget
They work better in the new alloc_init function.

Signed-off-by: Fred Isaman <fred.isaman@gmail.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2018-05-31 15:03:11 -04:00
Fred Isaman 587f03deb6 pnfs: refactor send_layoutget
Pull out the alloc/init part for eventual reuse by OPEN.

Signed-off-by: Fred Isaman <fred.isaman@gmail.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2018-05-31 15:03:11 -04:00
Fred Isaman f86c3ac502 pnfs: Add layout driver flag PNFS_LAYOUTGET_ON_OPEN
Driver can set flag to allow LAYOUTGET to be sent with OPEN.

Signed-off-by: Fred Isaman <fred.isaman@gmail.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2018-05-31 15:03:11 -04:00
Fred Isaman 3b65a30df9 NFS4: move ctx into nfs4_run_open_task
Preparing to add conditional LAYOUTGET to OPEN rpc, the LAYOUTGET
will need the ctx info.

Signed-off-by: Fred Isaman <fred.isaman@gmail.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2018-05-31 15:03:11 -04:00
Fred Isaman 808ba32abe pnfs: Store return value of decode_layoutget for later processing
This will be needed to seperate return value of OPEN and LAYOUTGET
when they are combined into a single RPC.

Signed-off-by: Fred Isaman <fred.isaman@gmail.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2018-05-31 15:03:10 -04:00
Fred Isaman 34ec9aac7d pnfs: Remove redundant assignment from nfs4_proc_layoutget().
nfs_init_sequence() will clear this for us.

Signed-off-by: Fred Isaman <fred.isaman@gmail.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2018-05-31 15:03:10 -04:00
Benjamin Coddington a3cf9bca2a NFSv4: Don't add a new lock on an interrupted wait for LOCK
If the wait for a LOCK operation is interrupted, and then the file is
closed, the locks cleanup code will assume that no new locks will be added
to the inode after it has completed.  We already have a mechanism to detect
if there was signal, so let's use that to avoid recreating the local lock
once the RPC completes.  Also skip re-sending the LOCK operation for the
various error cases if we were signaled.

Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
[Trond: Fix inverted test of locks_lock_inode_wait()]
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2018-05-31 15:02:16 -04:00
Trond Myklebust cf61eb2686 NFSv4: Always clear the pNFS layout when handling ESTALE
If we get an ESTALE error in response to an RPC call operating on the
file on the MDS, we should immediately cancel the layout for that file.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2018-05-31 15:02:16 -04:00
Dave Wysochanski d68894800e NFSv4: Fix possible 1-byte stack overflow in nfs_idmap_read_and_verify_message
In nfs_idmap_read_and_verify_message there is an incorrect sprintf '%d'
that converts the __u32 'im_id' from struct idmap_msg to 'id_str', which
is a stack char array variable of length NFS_UINT_MAXLEN == 11.
If a uid or gid value is > 2147483647 = 0x7fffffff, the conversion
overflows into a negative value, for example:
crash> p (unsigned) (0x80000000)
$1 = 2147483648
crash> p (signed) (0x80000000)
$2 = -2147483648
The '-' sign is written to the buffer and this causes a 1 byte overflow
when the NULL byte is written, which corrupts kernel stack memory.  If
CONFIG_CC_STACKPROTECTOR_STRONG is set we see a stack-protector panic:

[11558053.616565] Kernel panic - not syncing: stack-protector: Kernel stack is corrupted in: ffffffffa05b8a8c
[11558053.639063] CPU: 6 PID: 9423 Comm: rpc.idmapd Tainted: G        W      ------------ T 3.10.0-514.el7.x86_64 #1
[11558053.641990] Hardware name: Red Hat OpenStack Compute, BIOS 1.10.2-3.el7_4.1 04/01/2014
[11558053.644462]  ffffffff818c7bc0 00000000b1f3aec1 ffff880de0f9bd48 ffffffff81685eac
[11558053.646430]  ffff880de0f9bdc8 ffffffff8167f2b3 ffffffff00000010 ffff880de0f9bdd8
[11558053.648313]  ffff880de0f9bd78 00000000b1f3aec1 ffffffff811dcb03 ffffffffa05b8a8c
[11558053.650107] Call Trace:
[11558053.651347]  [<ffffffff81685eac>] dump_stack+0x19/0x1b
[11558053.653013]  [<ffffffff8167f2b3>] panic+0xe3/0x1f2
[11558053.666240]  [<ffffffff811dcb03>] ? kfree+0x103/0x140
[11558053.682589]  [<ffffffffa05b8a8c>] ? idmap_pipe_downcall+0x1cc/0x1e0 [nfsv4]
[11558053.689710]  [<ffffffff810855db>] __stack_chk_fail+0x1b/0x30
[11558053.691619]  [<ffffffffa05b8a8c>] idmap_pipe_downcall+0x1cc/0x1e0 [nfsv4]
[11558053.693867]  [<ffffffffa00209d6>] rpc_pipe_write+0x56/0x70 [sunrpc]
[11558053.695763]  [<ffffffff811fe12d>] vfs_write+0xbd/0x1e0
[11558053.702236]  [<ffffffff810acccc>] ? task_work_run+0xac/0xe0
[11558053.704215]  [<ffffffff811fec4f>] SyS_write+0x7f/0xe0
[11558053.709674]  [<ffffffff816964c9>] system_call_fastpath+0x16/0x1b

Fix this by calling the internally defined nfs_map_numeric_to_string()
function which properly uses '%u' to convert this __u32.  For consistency,
also replace the one other place where snprintf is called.

Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
Reported-by: Stephen Johnston <sjohnsto@redhat.com>
Fixes: cf4ab538f1 ("NFSv4: Fix the string length returned by the idmapper")
Cc: stable@vger.kernel.org # v3.4+
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2018-05-31 15:02:16 -04:00
Trond Myklebust d554168f87 NFS: Fix up nfs_post_op_update_inode() to force ctime updates
We do not want to ignore ctime updates that originate from functions
such as link().

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2018-05-31 15:02:16 -04:00
Trond Myklebust 472f761e11 NFS: Ensure we revalidate the inode correctly after setacl
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2018-05-31 15:02:16 -04:00
Trond Myklebust 59a707b0d4 NFS: Ensure we revalidate the inode correctly after remove or rename
We may need to revalidate the change attribute, ctime and the nlinks count.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2018-05-31 15:02:16 -04:00
Trond Myklebust 821a868a23 NFS: Set the force revalidate flag if the inode is not completely initialised
Ensure that a delegation doesn't cause us to skip initialising the inode
if it was incomplete when we exited nfs_fhget()

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2018-05-31 15:02:16 -04:00
Trond Myklebust 3cb3fd6da4 NFS: Fix up sillyrename()
Ensure that we register the fact that the inode ctime has changed.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2018-05-31 15:02:16 -04:00
Trond Myklebust ed7e9ad090 NFSv4: Fix sillyrename to return the delegation when appropriate
Ensure that we pass down the inode of the file being deleted so
that we can return any delegation being held.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2018-05-31 15:02:16 -04:00
Trond Myklebust 991eedb137 NFSv4: Only pass the delegation to setattr if we're sending a truncate
Even then it isn't really necessary. The reason why we may not want to
pass in a stateid in other cases is that we cannot use the delegation
credential.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2018-05-31 15:02:16 -04:00
Anna Schumaker 2f261020b6 NFS: Merge nfs41_free_stateid() with _nfs41_free_stateid()
Having these exist as two functions doesn't seem to add anything useful,
and I think merging them together makes this easier to follow.

Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2018-05-31 15:02:16 -04:00
Anna Schumaker fba83f3411 NFS: Pass "privileged" value to nfs4_init_sequence()
We currently have a separate function just to set this, but I think it
makes more sense to set it at the same time as the other values in
nfs4_init_sequence()

Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2018-05-31 15:02:16 -04:00
Anna Schumaker e9ae1ee2b2 NFS: Move call to nfs4_state_protect() to nfs4_commit_setup()
Rather than doing this in the generic NFS client code.  Let's put this
with the other v4 stuff so it's all in one place.

Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2018-05-31 15:02:16 -04:00
Anna Schumaker fb91fb0ee7 NFS: Move call to nfs4_state_protect_write() to nfs4_write_setup()
This doesn't really need to be in the generic NFS client code, and I
think it makes more sense to keep the v4 code in one place.

Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2018-05-31 15:02:16 -04:00
NeilBrown e04bbf6b1b NFS: Avoid quadratic search when freeing delegations.
There are three places that walk all delegation for an nfs_client and
restart whenever they find something interesting - potentially
resulting in a quadratic search:  If there are 10,000 uninteresting
delegations followed by 10,000 interesting one, then the code
skips over 100,000,000 delegations, which can take a noticeable amount
of time.

Of these nfs_delegation_reap_unclaimed() and
nfs_reap_expired_delegations() are only called during unusual events:
a server reboots or reports expired delegations, probably due to a
network partition.  Optimizing these is not particularly important.

The third, nfs_client_return_marked_delegations(), is called
periodically via nfs_expire_unreferenced_delegations().  It could
cause periodic problems on a busy server.

New delegations are added to the end of the list, so if there are
10,000 open files with delegations, and 10,000 more recently opened files
that received delegations but are now closed, then
nfs_client_return_marked_delegations() can take seconds to skip over
the 10,000 open files 10,000 times.  That is a waste of time.

The avoid this waste a place-holder (an inode) is kept when locks are
dropped, so that the place can usually be found again after taking
rcu_readlock().  This place holder ensure that we find the right
starting point in the list of nfs_servers, and makes is probable that
we find the right starting point in the list of delegations.
We might need to occasionally restart at the head of that list.

It might be possible that the place_holder inode could lose its
delegation separately, and then get a new one using the same (freed
and then reallocated) 'struct nfs_delegation'.  Were this to happen,
the new delegation would be at the end of the list and we would miss
returning some other delegations.  This would have the effect of
unnecessarily delaying the return of some unused delegations until the
next time this function is called - typically 90 seconds later.  As
this is not a correctness issue and is vanishingly unlikely to happen,
it does not seem worth addressing.

Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2018-05-31 15:02:14 -04:00
NeilBrown 3ca951b618 NFS: use cond_resched() when restarting walk of delegation list.
In three places we walk the list of delegations for an nfs_client
until an interesting one is found, then we act of that delegation
and restart the walk.

New delegations are added to the end of a list and the interesting
delegations are usually old, so in many case we won't repeat
a long walk over and over again, but it is possible - particularly if
the first server in the list has a large number of uninteresting
delegations.

In each cache the work done on interesting delegations will often
complete without sleeping, so this could loop many times without
giving up the CPU.

So add a cond_resched() at an appropriate point to avoid hogging the
CPU for too long.

Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2018-05-31 14:59:19 -04:00
NeilBrown f389349142 NFS: slight optimization for walking list for delegations
There are 3 places where we walk the list of delegations
for an nfs_client.
In each case there are two nested loops, one for nfs_servers
and one for nfs_delegations.

When we find an interesting delegation we try to get an active
reference to the server.  If that fails, it is pointless to
continue to look at the other delegation for the server as
we will never be able to get an active reference.
So instead of continuing in the inner loop, break out
and continue in the outer loop.

Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2018-05-31 14:59:19 -04:00
Trond Myklebust 9f6d44d418 NFS: Optimise away lookups for rename targets
We can optimise away any lookup for a rename target, unless we're
being asked to revalidate a dentry that might be in use.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2018-05-28 13:29:19 -04:00
Trond Myklebust 73dd684a4d NFS: If the VFS sets LOOKUP_REVAL then force a lookup of the dentry
If nfs_lookup_revalidate() is called with LOOKUP_REVAL because a
previous path lookup failed, then we ought to force a full lookup
of the component name.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2018-05-28 13:29:19 -04:00
Trond Myklebust 479219218f NFS: Optimise away the close-to-open GETATTR when we have NFSv4 OPEN
NFSv4 should not need to perform an extra close-to-open GETATTR as part
of the process of looking up a regular file, since the OPEN call will
do that for us.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2018-05-28 13:29:19 -04:00
Deepa Dinamani 0a2dfbecb3 fs: nfs: get rid of memcpys for inode times
Subsequent patches in the series convert inode timestamps
to use struct timespec64 instead of struct timespec as
part of solving the y2038 problem.
This will lead to type mismatch for memcpys.
Use regular assignments instead.

Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Cc: trond.myklebust@primarydata.com
2018-05-25 15:31:13 -07:00
Christoph Hellwig c350637227 proc: introduce proc_create_net{,_data}
Variants of proc_create{,_data} that directly take a struct seq_operations
and deal with network namespaces in ->open and ->release.  All callers of
proc_create + seq_open_net converted over, and seq_{open,release}_net are
removed entirely.

Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-05-16 07:24:30 +02:00
Linus Torvalds a1bf4c7da6 NFS client updates for Linux 4.17
Stable bugfixes:
 - xprtrdma: Fix corner cases when handling device removal # v4.12+
 - xprtrdma: Fix latency regression on NUMA NFS/RDMA clients # v4.15+
 
 Features:
 - New sunrpc tracepoint for RPC pings
 - Finer grained NFSv4 attribute checking
 - Don't unnecessarily return NFS v4 delegations
 
 Other bugfixes and cleanups:
 - Several other small NFSoRDMA cleanups
 - Improvements to the sunrpc RTT measurements
 - A few sunrpc tracepoint cleanups
 - Various fixes for NFS v4 lock notifications
 - Various sunrpc and NFS v4 XDR encoding cleanups
 - Switch to the ida_simple API
 - Fix NFSv4.1 exclusive create
 - Forget acl cache after setattr operation
 - Don't advance the nfs_entry readdir cookie if xdr decoding fails
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEnZ5MQTpR7cLU7KEp18tUv7ClQOsFAlrNG1IACgkQ18tUv7Cl
 QOvotw//fQoUgQ/AOJGlZo/4ws2mGJN3dfwwKM8xYOnHaxppOYubZRHwvswK8d22
 +XR/Q6IVbUxI3mJluv1L0d9CJT06s3c9CO90McIJbk4CWihGP19bNIY4JiPlzrbv
 4FDiyOvMBej2UXbHX5EzKj0srxyBoEVf3iUAIa6DaHi3c6EIUo6fP3d2eRNJStqd
 WMyZs+nqr2W9biyClxntT7l/Sk+o+4I7M3Oo9pjjS+PiePYdaMrL5T1kPeHaJshF
 GMGXkbvVdqpDRiXX84R9+2/nuSiA15eEnaR94UNvs84oLR3qob3ZhxhudqFdSPrX
 RS6E7m34gY/EaQm/wbB26PZm+3jHd4Pqm5SKLbyFfoCmG6oMwBvXNRJZas1DFaHM
 CMOECvfAr6kixVLkAN0MNQ2Ku/FuJ52OLP1dRLmxsblocnhEPujc6RSz6Ju/v3a0
 adbpmJMA2IoSGgXMu3g1VGnjHfMj7ZmjtpigXVvlcUqQGCL7t4ngh23cpeTQeJ76
 bMwSHUQu18NbmtJjBTE+PIm7mdCrpQD7ZuOPWpK62zxLYUnnv7nm75m84DrDru7d
 XAmrCmdUJNrVWQs6BAtCXgO4PZ6xNGLosb0xTQXTAQYftc+DRJ9SW/VGc0Mp1L9m
 0G0iz++b8cy4Pih5UCDJcCkpjCIvHLcn72zn1kbufWqG3xr2koc=
 =IlWo
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-4.17-1' of git://git.linux-nfs.org/projects/anna/linux-nfs

Pull NFS client updates from Anna Schumaker:
 "Stable bugfixes:
   - xprtrdma: Fix corner cases when handling device removal # v4.12+
   - xprtrdma: Fix latency regression on NUMA NFS/RDMA clients # v4.15+

  Features:
   - New sunrpc tracepoint for RPC pings
   - Finer grained NFSv4 attribute checking
   - Don't unnecessarily return NFS v4 delegations

  Other bugfixes and cleanups:
   - Several other small NFSoRDMA cleanups
   - Improvements to the sunrpc RTT measurements
   - A few sunrpc tracepoint cleanups
   - Various fixes for NFS v4 lock notifications
   - Various sunrpc and NFS v4 XDR encoding cleanups
   - Switch to the ida_simple API
   - Fix NFSv4.1 exclusive create
   - Forget acl cache after setattr operation
   - Don't advance the nfs_entry readdir cookie if xdr decoding fails"

* tag 'nfs-for-4.17-1' of git://git.linux-nfs.org/projects/anna/linux-nfs: (47 commits)
  NFS: advance nfs_entry cookie only after decoding completes successfully
  NFSv3/acl: forget acl cache after setattr
  NFSv4.1: Fix exclusive create
  NFSv4: Declare the size up to date after it was set.
  nfs: Use ida_simple API
  NFSv4: Fix the nfs_inode_set_delegation() arguments
  NFSv4: Clean up CB_GETATTR encoding
  NFSv4: Don't ask for attributes when ACCESS is protected by a delegation
  NFSv4: Add a helper to encode/decode struct timespec
  NFSv4: Clean up encode_attrs
  NFSv4; Clean up XDR encoding of type bitmap4
  NFSv4: Allow GFP_NOIO sleeps in decode_attr_owner/decode_attr_group
  SUNRPC: Add a helper for encoding opaque data inline
  SUNRPC: Add helpers for decoding opaque and string types
  NFSv4: Ignore change attribute invalidations if we hold a delegation
  NFS: More fine grained attribute tracking
  NFS: Don't force unnecessary cache invalidation in nfs_update_inode()
  NFS: Don't redirty the attribute cache in nfs_wcc_update_inode()
  NFS: Don't force a revalidation of all attributes if change is missing
  NFS: Convert NFS_INO_INVALID flags to unsigned long
  ...
2018-04-12 12:55:50 -07:00
Frank Sorenson 98de9ce6f6 NFS: advance nfs_entry cookie only after decoding completes successfully
In nfs[34]_decode_dirent, the cookie is advanced as soon as it is
read, but decoding may still fail later in the function, returning
an error.  Because the cookie has been advanced, the failing entry
is not re-requested from the server, resulting in a missing directory
entry.

In addition, nfs v3 and v4 read the cookie at different locations
in the xdr_stream, so the behavior of the two can be inconsistent.

Fix these by reading the cookie into a temporary variable, and
only advancing the cookie once the entire entry has been decoded
from the xdr_stream successfully.

Signed-off-by: Frank Sorenson <sorenson@redhat.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-04-10 16:06:22 -04:00
chendt dbc898ae10 NFSv3/acl: forget acl cache after setattr
Sync of ACL with std permissions fail,We need to forget the ACL cache after setattr.

Reproduction:
#!/bin/bash
touch testfile
cat <<EOF >testfile
#!/bin/bash
echo "Test was executed"
EOF
chmod u=rwx testfile
chmod g=rw- testfile
chmod o=r-- testfile

chacl u::r--,g::rwx,o:rw- testfile
chmod u+w testfile
ls -l testfile
chacl -l testfile

Output:
-rw-rwxrw- 1 root root 0 Mar 28 05:29 testfile
testfile [u::r--,g::rwx,o::rw-]

Signed-off-by: chendt.fnst <chendt.fnst@cn.fujitsu.com>
Reviewed-by: Benjamin Coddington <bcodding@redhat.com>
Reviewed-by: Kinglong Mee <Kinglong Mee>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-04-10 16:06:22 -04:00
Trond Myklebust 609339c123 NFSv4.1: Fix exclusive create
When we use EXCLUSIVE4_1 mode, the server returns an attribute mask where
all the bits indicate which attributes were set, and where the verifier
was stored. In order to figure out which attribute we have to resend,
we need to clear out the attributes that are set in exclcreat_bitmask.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
[Anna: Fixed typo NFS4_CREATE_EXCLUSIVE4 -> NFS4_CREATE_EXCLUSIVE]
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-04-10 16:06:22 -04:00
Trond Myklebust f6cdfa6dd6 NFSv4: Declare the size up to date after it was set.
When we've changed the file size, then ensure we declare it to be
up to date in the inode attributes.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-04-10 16:06:22 -04:00
Matthew Wilcox aae5730e2d nfs: Use ida_simple API
Allocate the owner_id when we allocate the state and free it when we free
the state.  That lets us get rid of a gnarly ida_pre_get() / ida_get_new()
loop.

Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-04-10 16:06:22 -04:00
Trond Myklebust 35156bfff3 NFSv4: Fix the nfs_inode_set_delegation() arguments
Neither nfs_inode_set_delegation() nor nfs_inode_reclaim_delegation() are
generic code. They have no business delving into NFSv4 OPEN xdr structures,
so let's replace the "struct nfs_openres" parameter.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-04-10 16:06:22 -04:00
Trond Myklebust 8b06494624 NFSv4: Clean up CB_GETATTR encoding
Replace the open coded bitmap implementation with a generic one.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-04-10 16:06:22 -04:00
Trond Myklebust 8bcbe7d98c NFSv4: Don't ask for attributes when ACCESS is protected by a delegation
If we hold a delegation, then the results of the ACCESS call are protected
anyway.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-04-10 16:06:22 -04:00
Trond Myklebust 36b3743fef NFSv4: Add a helper to encode/decode struct timespec
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-04-10 16:06:22 -04:00
Trond Myklebust 40a3426c75 NFSv4: Clean up encode_attrs
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-04-10 16:06:22 -04:00
Trond Myklebust 37c88763de NFSv4; Clean up XDR encoding of type bitmap4
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-04-10 16:06:22 -04:00
Trond Myklebust e8d8aa46be NFSv4: Allow GFP_NOIO sleeps in decode_attr_owner/decode_attr_group
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-04-10 16:06:22 -04:00
Trond Myklebust d943f2dd8d NFSv4: Ignore change attribute invalidations if we hold a delegation
Don't bother even recording an invalid change attribute if we hold a
delegation since we already know the state of our attribute cache.
We can rely on the fact that we will pick up a copy from the server
when we return the delegation.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-04-10 16:06:22 -04:00
Trond Myklebust 16e1437517 NFS: More fine grained attribute tracking
Currently, if the NFS_INO_INVALID_ATTR flag is set, for instance by
a call to nfs_post_op_update_inode_locked(), then it will not be cleared
until all the attributes have been revalidated. This means, for instance,
that NFSv4 writes will always force a full attribute revalidation.

Track the ctime, mtime, size and change attribute separately from the
other attributes so that we can have nfs_post_op_update_inode_locked()
set them correctly, and later have the cache consistency bitmask be
able to clear them.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-04-10 16:06:22 -04:00
Trond Myklebust cac88f942d NFS: Don't force unnecessary cache invalidation in nfs_update_inode()
If we managed to revalidate all the attributes, then there is no reason
to mark them as invalid again. We do, however want to ensure that we
set nfsi->attrtimeo correctly.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-04-10 16:06:22 -04:00
Trond Myklebust 783b194c6e NFS: Don't redirty the attribute cache in nfs_wcc_update_inode()
If we received weak cache consistency data from the server, then those
attributes are up to date, and there is no reason to mark them as
dirty in the attribute cache.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-04-10 16:06:22 -04:00
Trond Myklebust 8619ddd07b NFS: Don't force a revalidation of all attributes if change is missing
Even if the change attribute is missing, it is still OK to mark the other
attributes as being up to date.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-04-10 16:06:22 -04:00
Trond Myklebust c01d36457d NFSv4: Don't return the delegation when not needed by NFSv4.x (x>0)
Starting with NFSv4.1, the server is able to deduce the client id from
the SEQUENCE op which means it can always figure out whether or not
the client is holding a delegation on a file that is being changed.
For that reason, RFC5661 does not require a delegation to be unconditionally
recalled on operations such as SETATTR, RENAME, or REMOVE.

Note that for now, we continue to return READ delegations since that is
still expected by the Linux knfsd server.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-04-10 16:06:22 -04:00
Trond Myklebust c135cb39a9 NFS: Remove the unused return_delegation() callback
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-04-10 16:06:22 -04:00
Trond Myklebust 199366f017 NFS: Move the delegation return down into _nfs4_do_setattr()
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-04-10 16:06:22 -04:00
Trond Myklebust 977fcc2b0b NFS: Add a delegation return into nfs4_proc_unlink_setup()
Ensure that when we do finally delete the file, then we return the
delegation.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-04-10 16:06:22 -04:00
Trond Myklebust f2c2c552f1 NFS: Move delegation recall into the NFSv4 callback for rename_setup()
Move the delegation recall out of the generic code, and into the NFSv4
specific callback.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-04-10 16:06:22 -04:00
Trond Myklebust 912678dbc5 NFS: Move the delegation return down into nfs4_proc_remove()
Move the delegation return out of generic code and down into the
NFSv4 specific unlink code.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-04-10 16:06:22 -04:00
Trond Myklebust 9f76827287 NFS: Move the delegation return down into nfs4_proc_link()
Move the delegation return out of generic code.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-04-10 16:06:22 -04:00
Trond Myklebust f50862423f NFSv4: Fix nfs4_return_incompatible_delegation
The 'fmode' argument can take an FMODE_EXEC value, which we want to
filter out before comparing to the delegation type.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-04-10 16:06:22 -04:00
Jeff Layton 571745935b nfs4: wake any lock waiters on successful RECLAIM_COMPLETE
If we have a RECLAIM_COMPLETE with a populated cl_lock_waitq, then
that implies that a reconnect has occurred. Since we can't expect a
CB_NOTIFY_LOCK callback at that point, just wake up the entire queue
so that all the tasks can re-poll for their locks.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-04-10 16:06:22 -04:00
Jeff Layton 5656610325 nfs4: don't compare clientid in nfs4_wake_lock_waiter
The task is expected to sleep for a while here, and it's possible that
a new EXCHANGE_ID has occurred in the interim, and we were assigned a
new clientid. Since this is a per-client list, there isn't a lot of
value in vetting the clientid on the incoming request.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-04-10 16:06:22 -04:00
Jeff Layton 41a7462018 nfs4: always reset notified flag to false before repolling for lock
We may get a notification and lose the race to another client. Ensure
that we wait again for a notification in that case.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-04-10 16:06:22 -04:00
Linus Torvalds 62f8e6c5dc fscache development
-----BEGIN PGP SIGNATURE-----
 
 iQIVAwUAWsdxrvu3V2unywtrAQJVmQ/9Fv8d/Ecdwv5nxVBmN7uA8lOYcHEbZWmd
 FhFQE8qYLjKMo9Fy4tPkBbu1l6CVnetaTRE5qwixACJAftrdjABKJAazGR3Uxief
 0jMSWScrV1XCeRErPcczHcx52Hefl8f1DQdA3zpoF0ewz7CjyxMxkl67bsYJbNKE
 T4ebCu5IJk+5PPwwMM3REKjQbunSXXnzgCLUI2cc0Yf76CTVpx6p+NpxV+2wq0p7
 vym83F68qACAEzNH+oozN7IwqjkWyYOnTtCLiMsh4iq30jP6ohtLom6RcRp7QUxM
 Z9hxgG3NptypuVBO1jKxaQ6XZGgAasYmppOmJ/SoALv2PKsAbxi372lTR4ikceKq
 H4oNTbs5tVmyvu3qFwtLN+vX+GdfaoSUnUG8vTvnCB3tHHtYj7q5QeFE0HaX4QSq
 oLANkCOZU8TJsT30pxsCNYiqc5HK9kaLjUQId9K+xq7mM/IuhtNtBQ+ZpqAh5IxB
 4bXKYLdeJ1myZrkYTa6gcTqeFax3djCBJ3UvjTnuqRZAaQg079WkG84Kdq1ZjDRp
 IQpKQnPX9JGhjW1zqLK1Ay8h+HFPgWR5BBVOaLwImr1mH+ccG0iNIeDjrOc8h6J5
 e60XM/x2dIYxpXyFYAkldbAI24aRg1FNzfniG4rSAPecf3SwWrxg/qK7uujLbJHM
 fKNA80yifHo=
 =ukqs
 -----END PGP SIGNATURE-----

Merge tag 'fscache-next-20180406' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs

Pull fscache updates from David Howells:
 "Three patches that fix some of AFS's usage of fscache:

   (1) Need to invalidate the cache if a foreign data change is detected
       on the server.

   (2) Move the vnode ID uniquifier (equivalent to i_generation) from
       the auxiliary data to the index key to prevent a race between
       file delete and a subsequent file create seeing the same index
       key.

   (3) Need to retire cookies that correspond to files that we think got
       deleted on the server.

  Four patches to fix some things in fscache and cachefiles:

   (4) Fix a couple of checker warnings.

   (5) Correctly indicate to the end-of-operation callback whether an
       operation completed or was cancelled.

   (6) Add a check for multiple cookie relinquishment.

   (7) Fix a path through the asynchronous write that doesn't wake up a
       waiter for a page if the cache decides not to write that page,
       but discards it instead.

  A couple of patches to add tracepoints to fscache and cachefiles:

   (8) Add tracepoints for cookie operators, object state machine
       execution, cachefiles object management and cachefiles VFS
       operations.

   (9) Add tracepoints for fscache operation management and page
       wrangling.

  And then three development patches:

  (10) Attach the index key and auxiliary data to the cookie, pass this
       information through various fscache-netfs API functions and get
       rid of the callbacks to the netfs to get it.

       This means that the cache can get at this information, even if
       the netfs goes away. It also means that the cache can be lazy in
       updating the coherency data.

  (11) Pass the object data size through various fscache-netfs API
       rather than calling back to the netfs for it, and store the value
       in the object.

       This makes it easier to correctly resize the object, as the size
       is updated on writes to the cache, rather than calling back out
       to the netfs.

  (12) Maintain a catalogue of allocated cookies. This makes it possible
       to catch cookie collision up front rather than down in the bowels
       of the cache being run from a service thread from the object
       state machine.

       This will also make it possible in the future to reconnect to a
       cookie that's not gone dead yet because it's waiting for
       finalisation of the storage and also make it possible to bring
       cookies online if the cache is added after the cookie has been
       obtained"

* tag 'fscache-next-20180406' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs:
  fscache: Maintain a catalogue of allocated cookies
  fscache: Pass object size in rather than calling back for it
  fscache: Attach the index key and aux data to the cookie
  fscache: Add more tracepoints
  fscache: Add tracepoints
  fscache: Fix hanging wait on page discarded by writeback
  fscache: Detect multiple relinquishment of a cookie
  fscache: Pass the correct cancelled indications to fscache_op_complete()
  fscache, cachefiles: Fix checker warnings
  afs: Be more aggressive in retiring cached vnodes
  afs: Use the vnode ID uniquifier in the cache key not the aux data
  afs: Invalidate cache on server data change
2018-04-07 09:08:24 -07:00
Linus Torvalds 9022ca6b11 Merge branch 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull misc vfs updates from Al Viro:
 "Assorted stuff, including Christoph's I_DIRTY patches"

* 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  fs: move I_DIRTY_INODE to fs.h
  ubifs: fix bogus __mark_inode_dirty(I_DIRTY_SYNC | I_DIRTY_DATASYNC) call
  ntfs: fix bogus __mark_inode_dirty(I_DIRTY_SYNC | I_DIRTY_DATASYNC) call
  gfs2: fix bogus __mark_inode_dirty(I_DIRTY_SYNC | I_DIRTY_DATASYNC) calls
  fs: fold open_check_o_direct into do_dentry_open
  vfs: Replace stray non-ASCII homoglyph characters with their ASCII equivalents
  vfs: make sure struct filename->iname is word-aligned
  get rid of pointless includes of fs_struct.h
  [poll] annotate SAA6588_CMD_POLL users
2018-04-06 11:07:08 -07:00
David Howells ee1235a9a0 fscache: Pass object size in rather than calling back for it
Pass the object size in to fscache_acquire_cookie() and
fscache_write_page() rather than the netfs providing a callback by which it
can be received.  This makes it easier to update the size of the object
when a new page is written that extends the object.

The current object size is also passed by fscache to the check_aux
function, obviating the need to store it in the aux data.

Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Anna Schumaker <anna.schumaker@netapp.com>
Tested-by: Steve Dickson <steved@redhat.com>
2018-04-06 14:05:14 +01:00
David Howells 402cb8dda9 fscache: Attach the index key and aux data to the cookie
Attach copies of the index key and auxiliary data to the fscache cookie so
that:

 (1) The callbacks to the netfs for this stuff can be eliminated.  This
     can simplify things in the cache as the information is still
     available, even after the cache has relinquished the cookie.

 (2) Simplifies the locking requirements of accessing the information as we
     don't have to worry about the netfs object going away on us.

 (3) The cache can do lazy updating of the coherency information on disk.
     As long as the cache is flushed before reboot/poweroff, there's no
     need to update the coherency info on disk every time it changes.

 (4) Cookies can be hashed or put in a tree as the index key is easily
     available.  This allows:

     (a) Checks for duplicate cookies can be made at the top fscache layer
     	 rather than down in the bowels of the cache backend.

     (b) Caching can be added to a netfs object that has a cookie if the
     	 cache is brought online after the netfs object is allocated.

A certain amount of space is made in the cookie for inline copies of the
data, but if it won't fit there, extra memory will be allocated for it.

The downside of this is that live cache operation requires more memory.

Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Anna Schumaker <anna.schumaker@netapp.com>
Tested-by: Steve Dickson <steved@redhat.com>
2018-04-04 13:41:28 +01:00
Linus Torvalds ce6eba3dba Merge branch 'sched-wait-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull wait_var_event updates from Ingo Molnar:
 "This introduces the new wait_var_event() API, which is a more flexible
  waiting primitive than wait_on_atomic_t().

  All wait_on_atomic_t() users are migrated over to the new API and
  wait_on_atomic_t() is removed. The migration fixes one bug and should
  result in no functional changes for the other usecases"

* 'sched-wait-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/wait: Improve __var_waitqueue() code generation
  sched/wait: Remove the wait_on_atomic_t() API
  sched/wait, arch/mips: Fix and convert wait_on_atomic_t() usage to the new wait_var_event() API
  sched/wait, fs/ocfs2: Convert wait_on_atomic_t() usage to the new wait_var_event() API
  sched/wait, fs/nfs: Convert wait_on_atomic_t() usage to the new wait_var_event() API
  sched/wait, fs/fscache: Convert wait_on_atomic_t() usage to the new wait_var_event() API
  sched/wait, fs/btrfs: Convert wait_on_atomic_t() usage to the new wait_var_event() API
  sched/wait, fs/afs: Convert wait_on_atomic_t() usage to the new wait_var_event() API
  sched/wait, drivers/media: Convert wait_on_atomic_t() usage to the new wait_var_event() API
  sched/wait, drivers/drm: Convert wait_on_atomic_t() usage to the new wait_var_event() API
  sched/wait: Introduce wait_var_event()
2018-04-02 16:50:39 -07:00
Peter Zijlstra 723c921e7d sched/wait, fs/nfs: Convert wait_on_atomic_t() usage to the new wait_var_event() API
The old wait_on_atomic_t() is going to get removed, use the more
flexible wait_var_event() API instead.

No change in functionality.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Anna Schumaker <anna.schumaker@netapp.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-03-20 08:23:21 +01:00
Linus Torvalds df09348f78 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs fixes from Al Viro:

 - backport-friendly part of lock_parent() race fix

 - a fix for an assumption in the heurisic used by path_connected() that
   is not true on NFS

 - livelock fixes for d_alloc_parallel()

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  fs: Teach path_connected to handle nfs filesystems with multiple roots.
  fs: dcache: Use READ_ONCE when accessing i_dir_seq
  fs: dcache: Avoid livelock between d_alloc_parallel and __d_add
  lock_parent() needs to recheck if dentry got __dentry_kill'ed under it
2018-03-15 18:57:14 -07:00
Eric W. Biederman 95dd77580c fs: Teach path_connected to handle nfs filesystems with multiple roots.
On nfsv2 and nfsv3 the nfs server can export subsets of the same
filesystem and report the same filesystem identifier, so that the nfs
client can know they are the same filesystem.  The subsets can be from
disjoint directory trees.  The nfsv2 and nfsv3 filesystems provides no
way to find the common root of all directory trees exported form the
server with the same filesystem identifier.

The practical result is that in struct super s_root for nfs s_root is
not necessarily the root of the filesystem.  The nfs mount code sets
s_root to the root of the first subset of the nfs filesystem that the
kernel mounts.

This effects the dcache invalidation code in generic_shutdown_super
currently called shrunk_dcache_for_umount and that code for years
has gone through an additional list of dentries that might be dentry
trees that need to be freed to accomodate nfs.

When I wrote path_connected I did not realize nfs was so special, and
it's hueristic for avoiding calling is_subdir can fail.

The practical case where this fails is when there is a move of a
directory from the subtree exposed by one nfs mount to the subtree
exposed by another nfs mount.  This move can happen either locally or
remotely.  With the remote case requiring that the move directory be cached
before the move and that after the move someone walks the path
to where the move directory now exists and in so doing causes the
already cached directory to be moved in the dcache through the magic
of d_splice_alias.

If someone whose working directory is in the move directory or a
subdirectory and now starts calling .. from the initial mount of nfs
(where s_root == mnt_root), then path_connected as a heuristic will
not bother with the is_subdir check.  As s_root really is not the root
of the nfs filesystem this heuristic is wrong, and the path may
actually not be connected and path_connected can fail.

The is_subdir function might be cheap enough that we can call it
unconditionally.  Verifying that will take some benchmarking and
the result may not be the same on all kernels this fix needs
to be backported to.  So I am avoiding that for now.

Filesystems with snapshots such as nilfs and btrfs do something
similar.  But as the directory tree of the snapshots are disjoint
from one another and from the main directory tree rename won't move
things between them and this problem will not occur.

Cc: stable@vger.kernel.org
Reported-by: Al Viro <viro@ZenIV.linux.org.uk>
Fixes: 397d425dc2 ("vfs: Test for and handle paths that are unreachable from their mnt_root")
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-03-15 18:48:38 -04:00
Trond Myklebust c4f24df942 NFS: Fix unstable write completion
We do want to respect the FLUSH_SYNC argument to nfs_commit_inode() to
ensure that all outstanding COMMIT requests to the inode in question are
complete. Currently we may exit early from both nfs_commit_inode() and
nfs_write_inode() even if there are COMMIT requests in flight, or unstable
writes on the commit list.

In order to get the right semantics w.r.t. sync_inode(), we don't need
to have nfs_commit_inode() reset the inode dirty flags when called from
nfs_wb_page() and/or nfs_wb_all(). We just need to ensure that
nfs_write_inode() leaves them in the right state if there are outstanding
commits, or stable pages.

Reported-by: Scott Mayhew <smayhew@redhat.com>
Fixes: dc4fd9ab01 ("nfs: don't wait on commit in nfs_commit_inode()...")
Cc: stable@vger.kernel.org # v4.14+
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2018-03-08 12:56:32 -05:00
Trond Myklebust 9c6376ebdd pNFS: Prevent the layout header refcount going to zero in pnfs_roc()
Ensure that we hold a reference to the layout header when processing
the pNFS return-on-close so that the refcount value does not inadvertently
go to zero.

Reported-by: Tigran Mkrtchyan <tigran.mkrtchyan@desy.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Cc: stable@vger.kernel.org # v4.10+
Tested-by: Tigran Mkrtchyan <tigran.mkrtchyan@desy.de>
2018-03-08 12:56:31 -05:00
Trond Myklebust d9ee65539d NFS: Fix an incorrect type in struct nfs_direct_req
The start offset needs to be of type loff_t.

Fixed: 5fadeb47dc ("nfs: count DIO good bytes correctly with mirroring")
Cc: stable@vger.kernel.org # v4.0+
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2018-03-08 12:56:31 -05:00
Al Viro 304ec482f5 get rid of pointless includes of fs_struct.h
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-02-22 14:28:50 -05:00
Colin Ian King 1b72040645 NFS: make struct nlmclnt_fl_close_lock_ops static
The structure nlmclnt_fl_close_lock_ops s local to the source and does
not need to be in global scope, so make it static.

Cleans up sparse warning:
fs/nfs/nfs3proc.c:876:33: warning: symbol 'nlmclnt_fl_close_lock_ops' was not
declared. Should it be static?

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2018-02-22 12:23:01 -05:00
Bill.Baker@oracle.com ad86f605c5 nfs: system crashes after NFS4ERR_MOVED recovery
nfs4_update_server unconditionally releases the nfs_client for the
source server. If migration fails, this can cause the source server's
nfs_client struct to be left with a low reference count, resulting in
use-after-free.  Also, adjust reference count handling for ELOOP.

NFS: state manager: migration failed on NFSv4 server nfsvmu10 with error 6
WARNING: CPU: 16 PID: 17960 at fs/nfs/client.c:281 nfs_put_client+0xfa/0x110 [nfs]()
	nfs_put_client+0xfa/0x110 [nfs]
	nfs4_run_state_manager+0x30/0x40 [nfsv4]
	kthread+0xd8/0xf0

BUG: unable to handle kernel NULL pointer dereference at 00000000000002a8
	nfs4_xdr_enc_write+0x6b/0x160 [nfsv4]
	rpcauth_wrap_req+0xac/0xf0 [sunrpc]
	call_transmit+0x18c/0x2c0 [sunrpc]
	__rpc_execute+0xa6/0x490 [sunrpc]
	rpc_async_schedule+0x15/0x20 [sunrpc]
	process_one_work+0x160/0x470
	worker_thread+0x112/0x540
	? rescuer_thread+0x3f0/0x3f0
	kthread+0xd8/0xf0

This bug was introduced by 32e62b7c ("NFS: Add nfs4_update_server"),
but the fix applies cleanly to 52442f9b ("NFS4: Avoid migration loops")

Reported-by: Helen Chao <helen.chao@oracle.com>
Fixes: 52442f9b11 ("NFS4: Avoid migration loops")
Signed-off-by: Bill Baker <bill.baker@oracle.com>
Reviewed-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2018-02-22 12:17:42 -05:00
Trond Myklebust 6d243a2356 NFSv4: Fix broken cast in nfs4_callback_recallany()
Passing a pointer to a unsigned integer to test_bit() is broken.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2018-02-21 16:35:50 -05:00
Goffredo Baroncelli c472c07bfe iversion: Rename make inode_cmp_iversion{+raw} to inode_eq_iversion{+raw}
The function inode_cmp_iversion{+raw} is counter-intuitive, because it
returns true when the counters are different and false when these are equal.

Rename it to inode_eq_iversion{+raw}, which will returns true when
the counters are equal and false otherwise.

Signed-off-by: Goffredo Baroncelli <kreijack@inwind.it>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
2018-02-01 08:15:25 -05:00
Linus Torvalds 19e7b5f994 Merge branch 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull misc vfs updates from Al Viro:
 "All kinds of misc stuff, without any unifying topic, from various
  people.

  Neil's d_anon patch, several bugfixes, introduction of kvmalloc
  analogue of kmemdup_user(), extending bitfield.h to deal with
  fixed-endians, assorted cleanups all over the place..."

* 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (28 commits)
  alpha: osf_sys.c: use timespec64 where appropriate
  alpha: osf_sys.c: fix put_tv32 regression
  jffs2: Fix use-after-free bug in jffs2_iget()'s error handling path
  dcache: delete unused d_hash_mask
  dcache: subtract d_hash_shift from 32 in advance
  fs/buffer.c: fold init_buffer() into init_page_buffers()
  fs: fold __inode_permission() into inode_permission()
  fs: add RWF_APPEND
  sctp: use vmemdup_user() rather than badly open-coding memdup_user()
  snd_ctl_elem_init_enum_names(): switch to vmemdup_user()
  replace_user_tlv(): switch to vmemdup_user()
  new primitive: vmemdup_user()
  memdup_user(): switch to GFP_USER
  eventfd: fold eventfd_ctx_get() into eventfd_ctx_fileget()
  eventfd: fold eventfd_ctx_read() into eventfd_read()
  eventfd: convert to use anon_inode_getfd()
  nfs4file: get rid of pointless include of btrfs.h
  uvc_v4l2: clean copyin/copyout up
  vme_user: don't use __copy_..._user()
  usx2y: don't bother with memdup_user() for 16-byte structure
  ...
2018-01-31 09:25:20 -08:00
Linus Torvalds efd52b5d36 NFS client updates for Linux 4.16
Highlights include:
 
 Stable bugfixes:
 - Fix breakages in the nfsstat utility due to the inclusion of the NFSv4
   LOOKUPP operation.
 - Fix a NULL pointer dereference in nfs_idmap_prepare_pipe_upcall() due to
   nfs_idmap_legacy_upcall() being called without an 'aux' parameter.
 - Fix a refcount leak in the standard O_DIRECT error path.
 - Fix a refcount leak in the pNFS O_DIRECT fallback to MDS path.
 - Fix CPU latency issues with nfs_commit_release_pages()
 - Fix the LAYOUTUNAVAILABLE error case in the file layout type.
 - NFS: Fix a race between mmap() and O_DIRECT
 
 Features:
 - Support the statx() mask and query flags to enable optimisations when
   the user is requesting only attributes that are already up to date in
   the inode cache, or is specifying the AT_STATX_DONT_SYNC flag.
 - Add a module alias for the SCSI pNFS layout type.
 
 Bugfixes:
 - Automounting when resolving a NFSv4 referral should preserve the RDMA
   transport protocol settings.
 - Various other RDMA bugfixes from Chuck.
 - pNFS block layout fixes.
 - Always set NFS_LOCK_LOST when a lock is lost.
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJacHcyAAoJEGcL54qWCgDy5WcP/Aw7GIAVZ+n5B+EIFVvEaWlC
 C++eA8vej433ezKRj8IExeCX1C8OKrQZ4iQ3nqIg0mVVQS/Qk+469OhYP9jDagA+
 tDZeOs3lbl1EUhUWT+GNxw8bZqKGn9fYzcWFjiYeFTjrcfLYfZTW50V0tofjgmlR
 3nQmpPx/56rXE9ZO/EW66HRZWauw7a0hg3/5Ft+F5csqIb3yQOlW8Osp3WClzGBF
 to4lS8/IwHvCn3qWAMuivRaMJDxeKrmoJNQh9Kw1Mw3+vurGAjmKo1a153qKPz4N
 7wjeP+o3ujc/P7WsJLCIgQRimzSm9FZXMqEVmz07+cIhGbERt2yy0RbHev8bpa+U
 3IMj70K9ciPuMZwrAtRAeZL+o9gxlUGUXvTaDUgo4DFgBw9Q5CnMnFn6a725l4h0
 nSZsE+bR8d4l/yEjf77SbTrk7atMLfUG1XnKH20i1CUjtd4CaLLzjn81TlbQrfuI
 XaFdJUUt63dPTIbhPEk7wHFcITkGZiyXhcepgbaXLiDH/3gyZmqTYzJ2EH14sOC5
 NaTueE3ASTiFChvG7jvc89HJN5SN5W11PyzI+GHezx9VkFnPZM3/q2V7oEX2aSld
 tkRlDMO4dVmpAA4LVAAarDr0a8ZsJQOdb+yLn21pbmKNAs1vol7tMfJe57ykEV6v
 WNAgtKJLtZE0Lh1UyEq0
 =5Vv2
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-4.16-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs

Pull NFS client updates from Trond Myklebust:
 "Highlights include:

  Stable bugfixes:

   - Fix breakages in the nfsstat utility due to the inclusion of the
     NFSv4 LOOKUPP operation

   - Fix a NULL pointer dereference in nfs_idmap_prepare_pipe_upcall()
     due to nfs_idmap_legacy_upcall() being called without an 'aux'
     parameter

   - Fix a refcount leak in the standard O_DIRECT error path

   - Fix a refcount leak in the pNFS O_DIRECT fallback to MDS path

   - Fix CPU latency issues with nfs_commit_release_pages()

   - Fix the LAYOUTUNAVAILABLE error case in the file layout type

   - NFS: Fix a race between mmap() and O_DIRECT

  Features:

   - Support the statx() mask and query flags to enable optimisations
     when the user is requesting only attributes that are already up to
     date in the inode cache, or is specifying the AT_STATX_DONT_SYNC
     flag

   - Add a module alias for the SCSI pNFS layout type

  Bugfixes:

   - Automounting when resolving a NFSv4 referral should preserve the
     RDMA transport protocol settings

   - Various other RDMA bugfixes from Chuck

   - pNFS block layout fixes

   - Always set NFS_LOCK_LOST when a lock is lost"

* tag 'nfs-for-4.16-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (69 commits)
  NFS: Fix a race between mmap() and O_DIRECT
  NFS: Remove a redundant call to unmap_mapping_range()
  pnfs/blocklayout: Ensure disk address in block device map
  pnfs/blocklayout: pnfs_block_dev_map uses bytes, not sectors
  lockd: Fix server refcounting
  SUNRPC: Fix null rpc_clnt dereference in rpc_task_queued tracepoint
  SUNRPC: Micro-optimize __rpc_execute
  SUNRPC: task_run_action should display tk_callback
  sunrpc: Format RPC events consistently for display
  SUNRPC: Trace xprt_timer events
  xprtrdma: Correct some documenting comments
  xprtrdma: Fix "bytes registered" accounting
  xprtrdma: Instrument allocation/release of rpcrdma_req/rep objects
  xprtrdma: Add trace points to instrument QP and CQ access upcalls
  xprtrdma: Add trace points in the client-side backchannel code paths
  xprtrdma: Add trace points for connect events
  xprtrdma: Add trace points to instrument MR allocation and recovery
  xprtrdma: Add trace points to instrument memory invalidation
  xprtrdma: Add trace points in reply decoder path
  xprtrdma: Add trace points to instrument memory registration
  ..
2018-01-30 19:03:48 -08:00
Linus Torvalds a4b7fd7d34 inode->i_version rework for v4.16
-----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJabwjlAAoJEAAOaEEZVoIVeEEP/R84kZJjlZV/vNmFFvY46jM+
 0hpMHXRNym+nW1Du1CKNkesEUAY8ACAQIyzJh63Q72341QTDdz3+asHwPYRNOqdC
 PgryidPieojkNKQg+h7dmoKYlYh1xiCicvn66Q5PFb9B0lH36twekOK4X1qqJj8Z
 breRmRoFLka9looMSuYgwbErts023fmASalvGum6T0ZM/7F9hUj4O3OsQtKTLUNM
 VQ+gLJTQrUqrgzvWUwq3WTMa9YAaKP4oad8nsglNSpiVLG7WtURr5HokW9hAziqL
 k99Y+K2ni1wZJlNGJAyV7PyEG2ieI5Xn+LzM2RM+SndD1QHF2QXACmSTDYfL51k5
 G2RsKeTZvQPtX4qx9+vnCp/4oV6JduvCaq2Mt8SQb9nYZxKjs85TNLrARJv+85eQ
 zP0OTxlH1Gfu3j36n3cny4XemyMYYF4hCFYfRPqTGst37fgLBtfIfUSQ6jedoCK2
 Xcyb6ukGXMh6If/A7DSy91hvSSPrWSH7TPPsbfLy6o+wUOtpAGR4eXVlEuAiXrzc
 gnoAz85oIMUQae66LrdrPk1NyE59qOb24g/yU5gyRBSpi2+/aoboNCKaD73tgs/C
 XIMwGXLYmqkcud7IBQF0tHHiM+jsEkbSM4LUqRXSnqMdwNnS18Z4Q+JKqpdP0cii
 eRdenDvUfu8Gu1Y9vWBv
 =iihN
 -----END PGP SIGNATURE-----

Merge tag 'iversion-v4.16-1' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux

Pull inode->i_version rework from Jeff Layton:
 "This pile of patches is a rework of the inode->i_version field. We
  have traditionally incremented that field on every inode data or
  metadata change. Typically this increment needs to be logged on disk
  even when nothing else has changed, which is rather expensive.

  It turns out though that none of the consumers of that field actually
  require this behavior. The only real requirement for all of them is
  that it be different iff the inode has changed since the last time the
  field was checked.

  Given that, we can optimize away most of the i_version increments and
  avoid dirtying inode metadata when the only change is to the i_version
  and no one is querying it. Queries of the i_version field are rather
  rare, so we can help write performance under many common workloads.

  This patch series converts existing accesses of the i_version field to
  a new API, and then converts all of the in-kernel filesystems to use
  it. The last patch in the series then converts the backend
  implementation to a scheme that optimizes away a large portion of the
  metadata updates when no one is looking at it.

  In my own testing this series significantly helps performance with
  small I/O sizes. I also got this email for Christmas this year from
  the kernel test robot (a 244% r/w bandwidth improvement with XFS over
  DAX, with 4k writes):

    https://lkml.org/lkml/2017/12/25/8

  A few of the earlier patches in this pile are also flowing to you via
  other trees (mm, integrity, and nfsd trees in particular)".

* tag 'iversion-v4.16-1' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux: (22 commits)
  fs: handle inode->i_version more efficiently
  btrfs: only dirty the inode in btrfs_update_time if something was changed
  xfs: avoid setting XFS_ILOG_CORE if i_version doesn't need incrementing
  fs: only set S_VERSION when updating times if necessary
  IMA: switch IMA over to new i_version API
  xfs: convert to new i_version API
  ufs: use new i_version API
  ocfs2: convert to new i_version API
  nfsd: convert to new i_version API
  nfs: convert to new i_version API
  ext4: convert to new i_version API
  ext2: convert to new i_version API
  exofs: switch to new i_version API
  btrfs: convert to new i_version API
  afs: convert to new i_version API
  affs: convert to new i_version API
  fat: convert to new i_version API
  fs: don't take the i_lock in inode_inc_iversion
  fs: new API for handling inode->i_version
  ntfs: remove i_version handling
  ...
2018-01-29 13:33:53 -08:00
Jeff Layton 1eb5d98f16 nfs: convert to new i_version API
For NFS, we just use the "raw" API since the i_version is mostly
managed by the server. The exception there is when the client
holds a write delegation, but we only need to bump it once
there anyway to handle CB_GETATTR.

Tested-by: Krzysztof Kozlowski <krzk@kernel.org>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
2018-01-29 06:42:21 -05:00
Trond Myklebust e231c6879c NFS: Fix a race between mmap() and O_DIRECT
When locking the file in order to do O_DIRECT on it, we must unmap
any mmapped ranges on the pagecache so that we can flush out the
dirty data.

Fixes: a5864c999d ("NFS: Do not serialise O_DIRECT reads and writes")
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Cc: stable@vger.kernel.org # v4.8+
2018-01-28 22:00:15 -05:00
Trond Myklebust 128159f292 NFS: Remove a redundant call to unmap_mapping_range()
We don't need to call unmap_mapping_range() prior to calling
nfs_sync_mapping().

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2018-01-28 09:35:54 -05:00
Benjamin Coddington f34462c3c8 pnfs/blocklayout: Ensure disk address in block device map
It's possible that the device map is smaller than the offset into the device
for the I/O we're adding.  Add a check for it and bail out, otherwise we
risk botching the bio calculations that follow.

Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Trond Myklebust <trondmy@gmail.com>
2018-01-25 16:42:35 -05:00
Benjamin Coddington b39604755c pnfs/blocklayout: pnfs_block_dev_map uses bytes, not sectors
Fixup the field types to match their use.

Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Trond Myklebust <trondmy@gmail.com>
2018-01-25 16:42:35 -05:00
Eric Biggers 49686cbbb3 NFS: reject request for id_legacy key without auxdata
nfs_idmap_legacy_upcall() is supposed to be called with 'aux' pointing
to a 'struct idmap', via the call to request_key_with_auxdata() in
nfs_idmap_request_key().

However it can also be reached via the request_key() system call in
which case 'aux' will be NULL, causing a NULL pointer dereference in
nfs_idmap_prepare_pipe_upcall(), assuming that the key description is
valid enough to get that far.

Fix this by making nfs_idmap_legacy_upcall() negate the key if no
auxdata is provided.

As usual, this bug was found by syzkaller.  A simple reproducer using
the command-line keyctl program is:

    keyctl request2 id_legacy uid:0 '' @s

Fixes: 57e62324e4 ("NFS: Store the legacy idmapper result in the keyring")
Reported-by: syzbot+5dfdbcf7b3eb5912abbb@syzkaller.appspotmail.com
Cc: <stable@vger.kernel.org> # v3.4+
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Trond Myklebust <trondmy@gmail.com>
2018-01-22 10:05:11 -05:00
Jan Chochol cbebc6ef4f nfs: Do not convert nfs_idmap_cache_timeout to jiffies
Since commit 57e62324e4 ("NFS: Store the legacy idmapper result in the
keyring") nfs_idmap_cache_timeout changed units from jiffies to seconds.
Unfortunately sysctl interface was not updated accordingly.

As a effect updating /proc/sys/fs/nfs/idmap_cache_timeout with some
value will incorrectly multiply this value by HZ.
Also reading /proc/sys/fs/nfs/idmap_cache_timeout will show real value
divided by HZ.

Fixes: 57e62324e4 ("NFS: Store the legacy idmapper result in the keyring")
Signed-off-by: Jan Chochol <jan@chochol.info>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2018-01-18 15:10:47 -05:00
Chuck Lever 06e1902456 nfs: Use proper enum definitions for nfs_show_stable
Commit 8224b2734a ("NFS: Add static NFS I/O tracepoints") had a
hack to work around some odd behavior observed with
__print_symbolic. I couldn't ever get it to display NFS_FILE_SYNC
when using TRACE_DEFINE_ENUM macros to set up the enum values.

I tracked down the actual bug that forced me to add the workaround.
That issue will be addressed soon, so replace the hack with a proper
implementation.

Fixes: 8224b2734a ("NFS: Add static NFS I/O tracepoints")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2018-01-18 15:01:22 -05:00
Tigran Mkrtchyan 7ff4cff637 nfs41: do not return ENOMEM on LAYOUTUNAVAILABLE
A pNFS server may return LAYOUTUNAVAILABLE error on LAYOUTGET for files
which don't have any layout. In this situation pnfs_update_layout
currently returns NULL. As this NULL is converted into ENOMEM, IO
requests fails instead of falling back to MDS.

Do not return ENOMEM on LAYOUTUNAVAILABLE and let client retry through
MDS.

Fixes 8d40b0f148. I will suggest to backport this fix to affected
stable branches.

Signed-off-by: Tigran Mkrtchyan <tigran.mkrtchyan@desy.de>
[trondmy: Use IS_ERR_OR_NULL()]
Fixes: 8d40b0f148 ("NFS filelayout:call GETDEVICEINFO after...")
Cc: stable@vger.kernel.org # v4.11+
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2018-01-18 12:51:31 -05:00
J. Bruce Fields 1b8d97b0a8 NFS: commit direct writes even if they fail partially
If some of the WRITE calls making up an O_DIRECT write syscall fail,
we neglect to commit, even if some of the WRITEs succeed.

We also depend on the commit code to free the reference count on the
nfs_page taken in the "if (request_commit)" case at the end of
nfs_direct_write_completion().  The problem was originally noticed
because ENOSPC's encountered partway through a write would result in a
closed file being sillyrenamed when it should have been unlinked.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2018-01-16 10:13:23 -05:00
Arnd Bergmann f96adf1ea0 nfs: remove unused label in nfs_encode_fh()
The only reference to the label got removed, so we now get
a harmless compiler warning:

fs/nfs/export.c: In function 'nfs_encode_fh':
fs/nfs/export.c:58:1: error: label 'out' defined but not used [-Werror=unused-label]

Fixes: aaa1500894 ("nfs: remove dead code from nfs_encode_fh()")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2018-01-16 10:12:49 -05:00
Chuck Lever 801b564309 nfs: Update server port after referral or migration
After traversing a referral or recovering from a migration event,
ensure that the server port reported in /proc/mounts is updated
to the correct port setting for the new submount.

Reported-by: Helen Chao <helen.chao@oracle.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2018-01-14 23:06:30 -05:00
Chuck Lever 530ea42192 nfs: Referrals should use the same proto setting as their parent
Helen Chao <helen.chao@oracle.com> noticed that when a user
traverses a referral on an NFS/RDMA mount, the resulting submount
always uses TCP.

This behavior does not match the vers= setting when traversing
a referral (vers=4.1 is preserved). It also does not match the
behavior of crossing from the pseudofs into a real filesystem
(proto=rdma is preserved in that case).

The Linux NFS client does not currently support the
fs_locations_info attribute. The situation is similar for all
NFSv4 servers I know of. Therefore until the community has broad
support for fs_locations_info, when following a referral:

 - First try to connect with RPC-over-RDMA. This will fail quickly
   if the client has no RDMA-capable interfaces.

 - If connecting with RPC-over-RDMA fails, or the RPC-over-RDMA
   transport is not available, use TCP.

Reported-by: Helen Chao <helen.chao@oracle.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2018-01-14 23:06:30 -05:00
Scott Mayhew ba4a76f703 nfs/pnfs: fix nfs_direct_req ref leak when i/o falls back to the mds
Currently when falling back to doing I/O through the MDS (via
pnfs_{read|write}_through_mds), the client frees the nfs_pgio_header
without releasing the reference taken on the dreq
via pnfs_generic_pg_{read|write}pages -> nfs_pgheader_init ->
nfs_direct_pgio_init.  It then takes another reference on the dreq via
nfs_generic_pg_pgios -> nfs_pgheader_init -> nfs_direct_pgio_init and
as a result the requester will become stuck in inode_dio_wait.  Once
that happens, other processes accessing the inode will become stuck as
well.

Ensure that pnfs_read_through_mds() and pnfs_write_through_mds() clean
up correctly by calling hdr->completion_ops->completion() instead of
calling hdr->release() directly.

This can be reproduced (sometimes) by performing "storage failover
takeover" commands on NetApp filer while doing direct I/O from a client.

This can also be reproduced using SystemTap to simulate a failure while
doing direct I/O from a client (from Dave Wysochanski
<dwysocha@redhat.com>):

stap -v -g -e 'probe module("nfs_layout_nfsv41_files").function("nfs4_fl_prepare_ds").return { $return=NULL; exit(); }'

Suggested-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Scott Mayhew <smayhew@redhat.com>
Fixes: 1ca018d28d ("pNFS: Fix a memory leak when attempted pnfs fails")
Cc: stable@vger.kernel.org
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2018-01-14 23:06:29 -05:00
Benjamin Coddington b3dce6a2f0 pnfs/blocklayout: handle transient devices
PNFS block/SCSI layouts should gracefully handle cases where block devices
are not available when a layout is retrieved, or the block devices are
removed while the client holds a layout.

While setting up a layout segment, keep a record of an unavailable or
un-parsable block device in cache with a flag so that subsequent layouts do
not spam the server with GETDEVINFO.  We can reuse the current
NFS_DEVICEID_UNAVAILABLE handling with one variation: instead of reusing
the device, we will discard it and send a fresh GETDEVINFO after the
timeout, since the lookup and validation of the device occurs within the
GETDEVINFO response handling.

A lookup of a layout segment that references an unavailable device will
return a segment with the NFS_LSEG_UNAVAILABLE flag set.  This will allow
the pgio layer to mark the layout with the appropriate fail bit, which
forces subsequent IO to the MDS, and prevents spamming the server with
LAYOUTGET, LAYOUTRETURN.

Finally, when IO to a block device fails, look up the block device(s)
referenced by the pgio header, and mark them as unavailable.

Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2018-01-14 23:06:29 -05:00
Benjamin Coddington d78471d32b pnfs/blocklayout: set PNFS_LAYOUTRETURN_ON_ERROR
If there's an error doing I/O to block device, and the client resends the
I/O to the MDS, the MDS must recall the layout from the client before
processing the I/O.  Let's preempt that exchange by returning the layout
before falling back to the MDS when there's an error.

Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2018-01-14 23:06:29 -05:00
Benjamin Coddington ad6b0241c9 pnfs/blocklayout: Add module alias for LAYOUT4_SCSI
The blocklayout module contains the client support for both block and SCSI
layouts.  Add a module alias for the SCSI layout type so that the module
will be loaded for SCSI layouts.

Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2018-01-14 23:06:29 -05:00
Benjamin Coddington e545735a32 NFS: remove unused offset arg in nfs_pgio_rpcsetup
nfs_pgio_rpcsetup() is always called with an offset of 0, so we should be
able to drop the arguement altogether.

Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2018-01-14 23:06:29 -05:00
NeilBrown dce2630c7d NFSv4: always set NFS_LOCK_LOST when a lock is lost.
There are 2 comments in the NFSv4 code which suggest that
SIGLOST should possibly be sent to a process.  In these
cases a lock has been lost.
The current practice is to set NFS_LOCK_LOST so that
read/write returns EIO when a lock is lost.
So change these comments to code when sets NFS_LOCK_LOST.

One case is when lock recovery after apparent server restart
fails with NFS4ERR_DENIED, NFS4ERR_RECLAIM_BAD, or
NFS4ERRO_RECLAIM_CONFLICT.  The other case is when a lock
attempt as part of lease recovery fails with NFS4ERR_DENIED.

In an ideal world, these should not happen.  However I have
a packet trace showing an NFSv4.1 session getting
NFS4ERR_BADSESSION after an extended network parition.  The
NFSv4.1 client treats this like server reboot until/unless
it get NFS4ERR_NO_GRACE, in which case it switches over to
"nograce" recovery mode.  In this network trace, the client
attempts to recover a lock and the server (incorrectly)
reports NFS4ERR_DENIED rather than NFS4ERR_NO_GRACE.  This
leads to the ineffective comment and the client then
continues to write using the OPEN stateid.

Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2018-01-14 23:06:29 -05:00
NeilBrown aaa1500894 nfs: remove dead code from nfs_encode_fh()
This code can never be used as the IS_AUTOMOUNT(inode)
case has already been handled.
So remove it to avoid confusion.

Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2018-01-14 23:06:29 -05:00
Trond Myklebust 9ccee940bd Support statx() mask and query flags parameters
Support the query flags AT_STATX_FORCE_SYNC by forcing an attribute
revalidation, and AT_STATX_DONT_SYNC by returning cached attributes
only.

Use the mask to optimise away server revalidation for attributes
that are not being requested by the user.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2018-01-14 23:06:29 -05:00
Trond Myklebust 8634ef5e05 NFS: Fix nfsstat breakage due to LOOKUPP
The LOOKUPP operation was inserted into the nfs4_procedures array
rather than being appended, which put /proc/net/rpc/nfs out of
whack, and broke the nfsstat utility.
Fix by moving the LOOKUPP operation to the end of the array, and
by ensuring that it keeps the same length whether or not NFSV4.1
and NFSv4.2 are compiled in.

Fixes: 5b5faaf6df ("nfs4: add NFSv4 LOOKUPP handlers")
Cc: stable@vger.kernel.org # v4.13+
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2018-01-14 23:06:29 -05:00
Trond Myklebust 82571552a0 NFSv4: Convert LOCKU to use nfs4_async_handle_exception()
Convert CLOSE so that it specifies the correct stateid and
inode for the error handling.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2018-01-14 23:06:29 -05:00
Trond Myklebust e0dba0128a NFSv4: Convert DELEGRETURN to use nfs4_handle_exception()
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2018-01-14 23:06:29 -05:00
Trond Myklebust b8b8d22109 NFSv4: Convert CLOSE to use nfs4_async_handle_exception()
Convert CLOSE so that it specifies the correct stateid, state and
inode for the error handling.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2018-01-14 23:06:28 -05:00
Trond Myklebust 7f1bda447c NFS: Add a cond_resched() to nfs_commit_release_pages()
The commit list can get very large, and so we need a cond_resched()
in nfs_commit_release_pages() in order to ensure we don't hog the CPU
for excessive periods of time.

Reported-by: Mike Galbraith <efault@gmx.de>
Cc: stable@vger.kernel.org
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2018-01-14 23:06:28 -05:00
Al Viro 6db620012f nfs4file: get rid of pointless include of btrfs.h
should've been killed by "vfs: pull btrfs clone API to vfs layer"...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-12-30 00:03:39 -05:00
Linus Torvalds d025fbf1a2 NFS client fixes for Linux 4.15-rc4
Stable bugfixes:
 - NFS: Avoid a BUG_ON() in nfs_commit_inode() by not waiting for a
        commit in the case that there were no commit requests.
 - SUNRPC: Fix a race in the receive code path
 
 Other fixes:
 - NFS: Fix a deadlock in nfs client initialization
 - xprtrdma: Fix a performance regression for small IOs
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEnZ5MQTpR7cLU7KEp18tUv7ClQOsFAlo0PdMACgkQ18tUv7Cl
 QOvlUg/+KoXWXNwItHIyyegYgRXcAPpaCtdnCjjOP6R9HEJ+clnLcaqDxdDKVWQ/
 oDvEcQcsBpywbUi7vVrvdar4mofwuyjXPpbcZPlDP1Ru4yyAlyylftwIuQW/nzdd
 vX2tZaVf+B9y1XvSD5NI+2EKWmp7MVrPdNhYxAB39TQZnAAvYDFHhywtZ0UR7vJt
 7YVcZoPtKUhg15jhCOr73eaCT0884/tlgedfd6DkDGR6bCtSQC2PySfqq9Lnnl/1
 ruDzzcgTARzSEzvta/uyBRspOLBHeeBhTdQUp79lMfekC4+68Tx6DFWnydIUttuE
 G7LphN6hfbJLF20U/ENb2H8v10WZsKvGEuxM+fp5PXGcIMSlX4qoJUe/egJFiiSL
 IaikgibvfiKmYSJvwdxTlOcr793X2Ej19HNciNjJQp4pviDOdZixgtGvVVHJBmh6
 LYzE5q9jgbW9wQXwTTeWHp/nyqL80NslX0UARYnS2Ua0B96GRCESXqCUFtxK6tKR
 wbYiHzKc4dOfSxpNlKI+FlX63m5oSAmTEii3ODsWZjObbwYHNX2Zqj2cVFiSLCpv
 ZXgmpNL+tL2zBWxPvn6rzYhpaXo++PqlHK7vv2QVBI6XM2J8ztpj5Wr5zneRoJaE
 ejk8nw/mR43bfdQuUGZRKh/Z+FTqL0/2WbDgJMXl09c+zRz7J2c=
 =XhEC
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-4.15-3' of git://git.linux-nfs.org/projects/anna/linux-nfs

Pull NFS client fixes from Anna Schumaker:
 "This has two stable bugfixes, one to fix a BUG_ON() when
  nfs_commit_inode() is called with no outstanding commit requests and
  another to fix a race in the SUNRPC receive codepath.

  Additionally, there are also fixes for an NFS client deadlock and an
  xprtrdma performance regression.

  Summary:

  Stable bugfixes:
   - NFS: Avoid a BUG_ON() in nfs_commit_inode() by not waiting for a
     commit in the case that there were no commit requests.
   - SUNRPC: Fix a race in the receive code path

  Other fixes:
   - NFS: Fix a deadlock in nfs client initialization
   - xprtrdma: Fix a performance regression for small IOs"

* tag 'nfs-for-4.15-3' of git://git.linux-nfs.org/projects/anna/linux-nfs:
  SUNRPC: Fix a race in the receive code path
  nfs: don't wait on commit in nfs_commit_inode() if there were no commit requests
  xprtrdma: Spread reply processing over more CPUs
  nfs: fix a deadlock in nfs client initialization
2017-12-16 13:12:53 -08:00
Scott Mayhew dc4fd9ab01 nfs: don't wait on commit in nfs_commit_inode() if there were no commit requests
If there were no commit requests, then nfs_commit_inode() should not
wait on the commit or mark the inode dirty, otherwise the following
BUG_ON can be triggered:

[ 1917.130762] kernel BUG at fs/inode.c:578!
[ 1917.130766] Oops: Exception in kernel mode, sig: 5 [#1]
[ 1917.130768] SMP NR_CPUS=2048 NUMA pSeries
[ 1917.130772] Modules linked in: iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi blocklayoutdriver rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache sunrpc sg nx_crypto pseries_rng ip_tables xfs libcrc32c sd_mod crc_t10dif crct10dif_generic crct10dif_common ibmvscsi scsi_transport_srp ibmveth scsi_tgt dm_mirror dm_region_hash dm_log dm_mod
[ 1917.130805] CPU: 2 PID: 14923 Comm: umount.nfs4 Tainted: G               ------------ T 3.10.0-768.el7.ppc64 #1
[ 1917.130810] task: c0000005ecd88040 ti: c00000004cea0000 task.ti: c00000004cea0000
[ 1917.130813] NIP: c000000000354178 LR: c000000000354160 CTR: c00000000012db80
[ 1917.130816] REGS: c00000004cea3720 TRAP: 0700   Tainted: G               ------------ T  (3.10.0-768.el7.ppc64)
[ 1917.130820] MSR: 8000000100029032 <SF,EE,ME,IR,DR,RI>  CR: 22002822  XER: 20000000
[ 1917.130828] CFAR: c00000000011f594 SOFTE: 1
GPR00: c000000000354160 c00000004cea39a0 c0000000014c4700 c0000000018cc750
GPR04: 000000000000c750 80c0000000000000 0600000000000000 04eeb76bea749a03
GPR08: 0000000000000034 c0000000018cc758 0000000000000001 d000000005e619e8
GPR12: c00000000012db80 c000000007b31200 0000000000000000 0000000000000000
GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
GPR20: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
GPR24: 0000000000000000 c000000000dfc3ec 0000000000000000 c0000005eefc02c0
GPR28: d0000000079dbd50 c0000005b94a02c0 c0000005b94a0250 c0000005b94a01c8
[ 1917.130867] NIP [c000000000354178] .evict+0x1c8/0x350
[ 1917.130871] LR [c000000000354160] .evict+0x1b0/0x350
[ 1917.130873] Call Trace:
[ 1917.130876] [c00000004cea39a0] [c000000000354160] .evict+0x1b0/0x350 (unreliable)
[ 1917.130880] [c00000004cea3a30] [c0000000003558cc] .evict_inodes+0x13c/0x270
[ 1917.130884] [c00000004cea3af0] [c000000000327d20] .kill_anon_super+0x70/0x1e0
[ 1917.130896] [c00000004cea3b80] [d000000005e43e30] .nfs_kill_super+0x20/0x60 [nfs]
[ 1917.130900] [c00000004cea3c00] [c000000000328a20] .deactivate_locked_super+0xa0/0x1b0
[ 1917.130903] [c00000004cea3c80] [c00000000035ba54] .cleanup_mnt+0xd4/0x180
[ 1917.130907] [c00000004cea3d10] [c000000000119034] .task_work_run+0x114/0x150
[ 1917.130912] [c00000004cea3db0] [c00000000001ba6c] .do_notify_resume+0xcc/0x100
[ 1917.130916] [c00000004cea3e30] [c00000000000a7b0] .ret_from_except_lite+0x5c/0x60
[ 1917.130919] Instruction dump:
[ 1917.130921] 7fc3f378 486734b5 60000000 387f00a0 38800003 4bdcb365 60000000 e95f00a0
[ 1917.130927] 694a0060 7d4a0074 794ad182 694a0001 <0b0a0000> 892d02a4 2f890000 40de0134

Signed-off-by: Scott Mayhew <smayhew@redhat.com>
Cc: stable@vger.kernel.org # 4.5+
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-12-15 14:31:50 -05:00