Commit Graph

30279 Commits

Author SHA1 Message Date
MITSUNARI Shigeo 7630b661da fs/block_dev.c: page cache wrongly left invalidated after revalidate_disk()
We found that bdev->bd_invalidated was left set once revalidate_disk()
is called, which results in page cache flush every time that device is
open.

Specifically, we found this problem in MD block device.  Once we resize
a MD device, mdadm --monitor periodically flush all page cache for that
device every 60 or 1000 seconds when it opens the device.

This bug lies since at least 3.2.0 till the latest kernel(3.6.2).  Patch
is attached.

The following steps will reproduce the problem.

1. prepair a block device (eg /dev/sdb).

2. create two partitions:

   sudo parted /dev/sdb
   mklabel gpt
   mkpart primary 0% 50%
   mkpart primary 50% 100%

3. create a md device.

   sudo mdadm -C /dev/md/hoge -l 1 -n 2 -e 1.2 --assume-clean --auto=md --symlink=no /dev/sdb1 /dev/sdb2

4. create file system and mount it

   sudo mkfs.ext3 /dev/md/hoge
   sudo mkdir /mnt/test
   sudo mount /dev/md/hoge /mnt/test

5. try to resize the device

   sudo mdadm -G /dev/md/hoge --size=max

6. create a file to fill file cache.

  sudo dd if=/dev/urandom of=/mnt/test/data bs=1M count=10

and verify the current status of file by free command.

7. mdadm monitor will open the md device every 1000 seconds and you
   will find all file cache on the device are cleared.

The timing can be reduced by the following steps.

a) kill mdadm and restart it with --delay option

   /sbin/mdadm --monitor --delay=30 --pid-file /var/run/mdadm/monitor.pid --daemonise --scan --syslog

or open the md device directly.

   sudo dd if=/dev/md/hoge of=/dev/null bs=4096 count=1

Signed-off-by: MITSUNARI Shigeo <herumi@nifty.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-21 17:22:16 -08:00
Jim Somerville 676a0675cf inotify: remove broken mask checks causing unmount to be EINVAL
Running the command:

	inotifywait -e unmount /mnt/disk

immediately aborts with a -EINVAL return code.  This is however a valid
parameter.  This abort occurs only if unmount is the sole event
parameter.  If other event parameters are supplied, then the unmount
event wait will work.

The problem was introduced by commit 44b350fc23 ("inotify: Fix mask
checks").  In that commit, it states:

	The mask checks in inotify_update_existing_watch() and
	inotify_new_watch() are useless because inotify_arg_to_mask()
	sets FS_IN_IGNORED and FS_EVENT_ON_CHILD bits anyway.

But instead of removing the useless checks, it did this:

	        mask = inotify_arg_to_mask(arg);
	-       if (unlikely(!mask))
	+       if (unlikely(!(mask & IN_ALL_EVENTS)))
	                return -EINVAL;

The problem is that IN_ALL_EVENTS doesn't include IN_UNMOUNT, and other
parts of the code keep IN_UNMOUNT separate from IN_ALL_EVENTS.  So the
check should be:

	if (unlikely(!(mask & (IN_ALL_EVENTS | IN_UNMOUNT))))

But inotify_arg_to_mask(arg) always sets the IN_UNMOUNT bit in the mask
anyway, so the check is always going to pass and thus should simply be
removed.  Also note that inotify_arg_to_mask completely controls what
mask bits get set from arg, there's no way for invalid bits to get
enabled there.

Lets fix it by simply removing the useless broken checks.

Signed-off-by: Jim Somerville <Jim.Somerville@windriver.com>
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Jerome Marchand <jmarchan@redhat.com>
Cc: John McCutchan <john@johnmccutchan.com>
Cc: Robert Love <rlove@rlove.org>
Cc: Eric Paris <eparis@parisplace.org>
Cc: <stable@vger.kernel.org>		[2.6.37+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-21 17:22:16 -08:00
Linus Torvalds 024e4ec185 A few fixes to reduce places where pstore might hang
a system in the crash path. Plus a new mountpoint
 (/sys/fs/pstore ... makes more sense then /dev/pstore).
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.11 (GNU/Linux)
 
 iQIcBAABAgAGBQJRJlSgAAoJEKurIx+X31iBHp4P/iTzfOncyI65UaOkKWlBUzAt
 bXoiWVAPll15lJuO0Eq4e58c2UEWy9b52MxgCqK38wqSu2Wkd2JLk/q7LFS3gIxb
 3h2Lx4CdCCZSENfyGuOTs5WNNcekVFt0qReEXlRAQ6QIGSniO9HT1FklVsqzzDro
 HUMoGm0/KSWKuq/ls2MD7y/iH3UYbDQ15ED8n1HxWEbpY0+vxCzCMUzNBSOWUsKb
 HuXeHt06jkvw2DiWuyGBJ10wLHC9c4h37FTdrQFg8YeVBYLZT7boBsnB92Hf12YK
 0P782RPiWCIKI23huVHHgS3Rh7hSlKMLLxvrNgIX+XxRMxxjFZyo1xk2NlhUw8ta
 nGnz7qxqqz+0yCYpF/+0N8ucRH9do64GpCxFXFMTSM9pD4BT7QJAcToswtt3a+Jx
 EJQF0BNQEZCFwlWLoHZD0znuFFL+jv67RKuiytOXRMgKkxca6RrdIIMeKXF4cl8I
 z/qPslK0BDfAs2Go+ApMmJ5G4rGziy+IJDLjuxFIBBaKba+79NWDIFA6IV0BquVq
 idp47OGPTiltb7hOOno+8zRHLQPra7KiWk6Mos2EEb8p+BrrnzspWu1foYEjB4pV
 AIpC5ClEY+MZHXl30htff9PZgnFm1BOEs6OZmzhLTZevme8KhwdFtrWhY2gH0gvs
 OM/0NpeTQJzFhS90AbeC
 =abWF
 -----END PGP SIGNATURE-----

Merge tag 'please-pull-pstore' of git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux

Pull pstore patches from Tony Luck:
 "A few fixes to reduce places where pstore might hang a system in the
  crash path.  Plus a new mountpoint (/sys/fs/pstore ...  makes more
  sense then /dev/pstore)."

Fix up trivial conflict in drivers/firmware/efivars.c

* tag 'please-pull-pstore' of git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux:
  pstore: Create a convenient mount point for pstore
  efi_pstore: Introducing workqueue updating sysfs
  efivars: Disable external interrupt while holding efivars->lock
  efi_pstore: Avoid deadlock in non-blocking paths
  pstore: Avoid deadlock in panic and emergency-restart path
2013-02-21 09:38:18 -08:00
Linus Torvalds 850cb82b75 dlm for 3.9
This includes a single patch to avoid excessive and
 unnecessary scanning of rsbs to free.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.12 (GNU/Linux)
 
 iQIcBAABAgAGBQJREX6ZAAoJEDgbc8f8gGmqDJ4QAKFuq+vwBb+21lNVDk1egBTf
 nDcpuntzZfnqgg+VKC5LJm0bBAvgexH2GrQMIPDFPPMgAOFjTxXDvm93wVGeAK6d
 JpkiZxW1a0lyjR+7NMvDFnoIdVJvKw+GezNITiVo/lMghl51NWOQTH8QmSBKdEdJ
 cxwzm/ZV9gBS5bv5hfeq2hCDVD1q5jWohMjYHb+xyx2E/c4z8L8enEMR1yBBrB/4
 qw3DgGRMc0eoWzZrXKt1F6MYtM8QnU/H+WxxmLUYc4SMbClIonivAzzOa0PHO5nV
 YvewWwpo2VVT50EIxARXi0XQzJ0aYbTwo+E7KoG+MzK5sCmLJaP8inH5UnlltvEu
 OyhZUkh7qPNjhUQokP3FlNWidZkZNM+gJW6hmZwXtSGfGH6pe620QBrbghcV3nCh
 QXyENcKUcyDNJQSBQFKV/s4ql4RI+iDS3PfovyOdNqkEDWurWKx0AYcrn2dSNYt1
 wjFPH94P3NOwz4nIabpYiICrEPXLdXw+CCvHz56DgBL7EFzrWAzC97v5arYpouhZ
 NmsnzV8+PGHDWA+NM73fKGokW259pvk/zOPtM13zoyXP18X6adGKUqMpZA7jZge9
 RHzO/+A+3ScMXaXdb3c2P7GYH/iSSJXBmcTg8BceOYxICTq1wh4HEzvm39vEwMop
 3BMloPmzKxzfis69swIu
 =WtiV
 -----END PGP SIGNATURE-----

Merge tag 'dlm-3.9' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm

Pull dlm update from David Teigland:
 "This includes a single patch to avoid excessive and unnecessary
  scanning of rsbs to free."

* tag 'dlm-3.9' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm:
  dlm: avoid scanning unchanged toss lists
2013-02-21 09:25:23 -08:00
Linus Torvalds 2171ee8f43 NFS client bugfixes for Linux 3.9
- Fix an Oops in the pNFS layoutget code
 - Fix a number of NFSv4 and v4.1 state recovery deadlocks and hangs
   due to the interaction of the session drain lock and state management
   locks.
 - Remove task->tk_xprt, which was hiding a lot of RCU dereferencing bugs
 - Fix a long standing NFSv3 posix lock recovery bug.
 - Revert commit 324d003b0c. It turned out
   that the root cause of the deadlock was due to interactions with the
   workqueues that have now been resolved.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.13 (GNU/Linux)
 
 iQIcBAABAgAGBQJRJUfVAAoJEGcL54qWCgDySIoP/2QC2qNKqfcoIwOqOAR0pu5m
 wHybR67WXt/R6EK17l273apHfnJQb/IFFKRBfAIAW0vy4bJLwHMtutIeUweFWht2
 hmluYqVWAa9lPwaY1YMolo8Mfizaj0cY3QHv2ogF4poY0SLO6L8O/xo8n8nrn/pk
 QJW9tcvz43A2tbW92t833sQMFQSSjs9L2tvxM/BEahLrSEE4dl6WMWscmHQ/IXt/
 Hs3ADc4//dDF36k5lu7tU0qaVNDacjMXCV6Bw0HUCWCwjWzqmPxSlTDKAcer2vpm
 UawKDFTF4X1QYwbtGlNju0xhsSGRg7ZNRVEeRxqSROT280nzpp7C3CjM9sECtM2l
 SR9RqoYfBsEAyWZtsALApuSBZZZHOvUfVJEL1Cm0ZSeEbWSOO/DT07KRVgfb97h4
 hmE7pn5htgXdiGN2YCphMjrqjaoP4aKmULenj1h1jHbDOgrlqp2QLWqg/5NxMsTw
 M3vmYqQWkW60JN6e1pWJbsoyUkSMTVycRNrZ6WgShtd1lrz3tg66rDXYWdIhU6pT
 6qAKrkXnM5JkLoMoImNQod00Vr1ZXpVc8DLoFtug8rGohCdFJhWkYHefqVJ9rFhv
 2S2NpKDsFiA0q3aK0/YAkUnykYDAnLn7BM5SvhJjq8mKru4bP70PrN9LO8aqhE4E
 fFgj2MGG5QwrL0/GMuls
 =dthb
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-3.9-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs

Pull NFS client bugfixes from Trond Myklebust:

 - Fix an Oops in the pNFS layoutget code

 - Fix a number of NFSv4 and v4.1 state recovery deadlocks and hangs due
   to the interaction of the session drain lock and state management
   locks.

 - Remove task->tk_xprt, which was hiding a lot of RCU dereferencing
   bugs

 - Fix a long standing NFSv3 posix lock recovery bug.

 - Revert commit 324d003b0c ("NFS: add nfs_sb_deactive_async to avoid
   deadlock").  It turned out that the root cause of the deadlock was
   due to interactions with the workqueues that have now been resolved.

* tag 'nfs-for-3.9-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (22 commits)
  NLM: Ensure that we resend all pending blocking locks after a reclaim
  umount oops when remove blocklayoutdriver first
  sunrpc: silence build warning in gss_fill_context
  nfs: remove kfree() redundant null checks
  NFSv4.1: Don't decode skipped layoutgets
  NFSv4.1: Fix bulk recall and destroy of layouts
  NFSv4.1: Fix an ABBA locking issue with session and state serialisation
  NFSv4: Fix a reboot recovery race when opening a file
  NFSv4: Ensure delegation recall and byte range lock removal don't conflict
  NFSv4: Fix up the return values of nfs4_open_delegation_recall
  NFSv4.1: Don't lose locks when a server reboots during delegation return
  NFSv4.1: Prevent deadlocks between state recovery and file locking
  NFSv4: Allow the state manager to mark an open_owner as being recovered
  SUNRPC: Add missing static declaration to _gss_mech_get_by_name
  Revert "NFS: add nfs_sb_deactive_async to avoid deadlock"
  SUNRPC: Nuke the tk_xprt macro
  SUNRPC: Avoid RCU dereferences in the transport bind and connect code
  SUNRPC: Fix an RCU dereference in xprt_reserve
  SUNRPC: Pass pointers to struct rpc_xprt to the congestion window
  SUNRPC: Fix an RCU dereference in xs_local_rpcbind
  ...
2013-02-21 09:23:01 -08:00
Linus Torvalds 9b9a72a8a3 Merge git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-3.0-nmw
Pull GFS2 updates from Steven Whitehouse:
 "This is one of the smallest collections of patches for the merge
  window for some time.  There are some clean ups relating to the
  transaction code and the shrinker, which are mostly in preparation for
  further development, but also make the code much easier to follow in
  these areas.

  There is a patch which allows the use of ->writepages even in the
  default ordered write mode for all writebacks.  This results in
  sending larger i/os to the block layer, and a subsequent increase in
  performance.  It also reduces the number of different i/o paths by
  one.

  There is also a bug fix reinstating the withdraw ack system which
  somehow got lost when the lock modules were merged into GFS2."

* git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-3.0-nmw:
  GFS2: Reinstate withdraw ack system
  GFS2: Get a block reservation before resizing a file
  GFS2: Split glock lru processing into two parts
  GFS2: Use ->writepages for ordered writes
  GFS2: Clean up freeze code
  GFS2: Merge gfs2_attach_bufdata() into trans.c
  GFS2: Copy gfs2_trans_add_bh into new data/meta functions
  GFS2: Split gfs2_trans_add_bh() into two
  GFS2: Merge revoke adding functions
  GFS2: Separate LRU scanning from shrinker
2013-02-21 09:21:23 -08:00
Linus Torvalds 736a4c1177 xfs: update for 3.9-rc1
For 3.9-rc1 there are primarily bugfixes and a few cleanups.
 
 - fix(es) for compound buffers
 - remove unused XFS_TRANS_DEBUG routines
 - fix for dquot soft timer asserts due to overflow of d_blk_softlimit
 - don't zero allocation args structure members after they are memset(0)
 - fix for regression in dir v2 code introduced in commit 20f7e9f3
 - remove obsolete simple_strto<foo>
 - fix return value when filesystem probe finds no XFS magic, a
   regression introduced in 9802182.
 - remove boolean_t typedef completely
 - fix stack switch in __xfs_bmapi_allocate by moving the check for stack
   switch up into xfs_bmapi_write.
 - fix build error due to incomplete boolean_t removal
 - fix oops in _xfs_buf_find by validating that the requested block is
   within the filesystem bounds.
 - limit speculative preallocation near ENOSPC.
 - fix an unmount hang in xfs_wait_buftarg by freeing the
   xfs_buf_log_item in xfs_buf_item_unlock.
 - fix a possible use after free with AIO.
 - fix xfs_swap_extents after removal of xfs_flushinval_pages, a
   regression introduced in fb59581404.
 - replace hardcoded 128 with log header size
 - add memory barrier before wake_up_bit in xfs_ifunlock
 - limit speculative preallocation on sparse files
 - fix xa_lock recursion bug introduced in 90810b9e82
 - fix write verifier for symlinks
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.10 (GNU/Linux)
 
 iQIcBAABAgAGBQJRI/pkAAoJENaLyazVq6ZO8F0QAMUgw+edrht4x/vr0stNpLvL
 /htIhJVhRjWthQaSiUkmlHShNZCgvfKSRPG9OiaMo24QuqardB3Ieh90JTW8b5Iu
 63tQ1kKB++n1PnW7OYSxvUj2WNBHwHlI0ueNUiEFXoD9tQOWcyO63CxA/qnTJB1P
 gulI+R7gWafqD1UpH6kP/xCMn3V3PnCYVdX2LJHuEjQoCydsKgtc9dgvRi4e51kf
 OrxnKof6tAlkpNc1+ZmoQch4NS+jRWv90HGdrxOy/BESC0QYSyfrAwxXIVvkwBfu
 NsQg5oDCp4jlDQt9gjgABhPcWvSGc7JhSjqbXTnweRS4t7SOGGvb46HVrLMKZRID
 MFT8tbc1HXzAl2z2k0S0bRy9HNC4QZg5GJZ/MhXMKUQWZLZf/XPALgQ5q95hEjsc
 MUTudN7Xa8Ay+2BPTI34M3f1Mlzmm68N07rzjw8jmJzjVq8v7LJ1wbmpiswjkHiW
 6u6MSwY3NDhjNI0sA+h6ByX3KwYfBW9Y63QI/8Dld7jDFQ5onxYmkIP+fPe1vhBJ
 51gNm7bKXImEFpykAP5YddCnxMWtFUdIH/Wyu7quMpwv5YSIR+zVd3l4edqAKIgh
 YlJ9bLewWP+4B2L4raotPjdaOxDZfrW2nP0vmVcuQWEbMdANZ8XQTNPCULcoKGtY
 iBCDlWS7wAfH8PYnXlbp
 =/Zui
 -----END PGP SIGNATURE-----

Merge tag 'for-linus-v3.9-rc1' of git://oss.sgi.com/xfs/xfs

Pull xfs update from Ben Myers:
 "Primarily bugfixes and a few cleanups:

   - fix(es) for compound buffers

   - remove unused XFS_TRANS_DEBUG routines

   - fix for dquot soft timer asserts due to overflow of d_blk_softlimit

   - don't zero allocation args structure members after they are memset(0)

   - fix for regression in dir v2 code introduced in commit 20f7e9f3

   - remove obsolete simple_strto<foo>

   - fix return value when filesystem probe finds no XFS magic, a
     regression introduced in 9802182.

   - remove boolean_t typedef completely

   - fix stack switch in __xfs_bmapi_allocate by moving the check for
     stack switch up into xfs_bmapi_write.

   - fix build error due to incomplete boolean_t removal

   - fix oops in _xfs_buf_find by validating that the requested block is
     within the filesystem bounds.

   - limit speculative preallocation near ENOSPC.

   - fix an unmount hang in xfs_wait_buftarg by freeing the
     xfs_buf_log_item in xfs_buf_item_unlock.

   - fix a possible use after free with AIO.

   - fix xfs_swap_extents after removal of xfs_flushinval_pages, a
     regression introduced in fb59581404.

   - replace hardcoded 128 with log header size

   - add memory barrier before wake_up_bit in xfs_ifunlock

   - limit speculative preallocation on sparse files

   - fix xa_lock recursion bug introduced in 90810b9e82

   - fix write verifier for symlinks"

Fixed up conflicts in fs/xfs/xfs_buf_item.c (due to bli_format rename in
commit 0f22f9d0cd affecting the removed XFS_TRANS_DEBUG routines in
commit ec47eb6b0b).

* tag 'for-linus-v3.9-rc1' of git://oss.sgi.com/xfs/xfs: (36 commits)
  xfs: xfs_bmap_add_attrfork_local is too generic
  xfs: remove log force from xfs_buf_trylock()
  xfs: recheck buffer pinned status after push trylock failure
  xfs: limit speculative prealloc size on sparse files
  xfs: memory barrier before wake_up_bit()
  xfs: refactor space log reservation for XFS_TRANS_ATTR_SET
  xfs: make use of XFS_SB_LOG_RES() at xfs_fs_log_dummy()
  xfs: make use of XFS_SB_LOG_RES() at xfs_mount_log_sb()
  xfs: make use of XFS_SB_LOG_RES() at xfs_log_sbcount()
  xfs: introduce XFS_SB_LOG_RES() for transactions that modify sb on disk
  xfs: calculate XFS_TRANS_QM_QUOTAOFF_END space log reservation at mount time
  xfs: calculate XFS_TRANS_QM_QUOTAOFF space log reservation at mount time
  xfs: calculate XFS_TRANS_QM_DQALLOC space log reservation at mount time
  xfs: calcuate XFS_TRANS_QM_SETQLIM space log reservation at mount time
  xfs: calculate xfs_qm_write_sb_changes() space log reservation at mount time
  xfs: calculate XFS_TRANS_QM_SBCHANGE space log reservation at mount time
  xfs: make use of xfs_calc_buf_res() in xfs_trans.c
  xfs: add a helper to figure out the space log reservation per item
  xfs: Fix xfs_swap_extents() after removal of xfs_flushinval_pages()
  xfs: Fix possible use-after-free with AIO
  ...
2013-02-21 09:08:45 -08:00
Linus Torvalds c4bc705e45 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse
Pull fuse updates from Miklos Szeredi:
 "The biggest part of this pull request is a patch series from Maxim
  Patlasov to optimize scatter-gather direct IO.  There's also the
  addition of a "readdirplus" API, poll events and various fixes and
  cleanups.

  There's a one line change outside of fuse to mm/filemap.c which makes
  the argument of iov_iter_single_seg_count() const, required by Maxim's
  patches."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse: (22 commits)
  fuse: allow control of adaptive readdirplus use
  Synchronize fuse header with one used in library
  fuse: send poll events
  fuse: don't WARN when nlink is zero
  fuse: avoid out-of-scope stack access
  fuse: bump version for READDIRPLUS
  FUSE: Adapt readdirplus to application usage patterns
  Do not use RCU for current process credentials
  fuse: cleanup fuse_direct_io()
  fuse: optimize __fuse_direct_io()
  fuse: optimize fuse_get_user_pages()
  fuse: pass iov[] to fuse_get_user_pages()
  mm: minor cleanup of iov_iter_single_seg_count()
  fuse: use req->page_descs[] for argpages cases
  fuse: add per-page descriptor <offset, length> to fuse_req
  fuse: rework fuse_do_ioctl()
  fuse: rework fuse_perform_write()
  fuse: rework fuse_readpages()
  fuse: rework fuse_retrieve()
  fuse: categorize fuse_get_req()
  ...
2013-02-21 09:03:54 -08:00
Linus Torvalds 2608e3d0fa Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ericvh/v9fs
Pull v9fs updates from Eric Van Hensbergen:
 "Just fixes and simplifications"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ericvh/v9fs:
  fs/9p: Fix atomic_open
  fs/9p: Don't use O_TRUNC flag in TOPEN and TLOPEN request
  locking in fs/9p ->readdir()
2013-02-21 09:02:13 -08:00
Linus Torvalds a0b1c42951 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next
Pull networking update from David Miller:

 1) Checkpoint/restarted TCP sockets now can properly propagate the TCP
    timestamp offset.  From Andrey Vagin.

 2) VMWARE VM VSOCK layer, from Andy King.

 3) Much improved support for virtual functions and SR-IOV in bnx2x,
    from Ariel ELior.

 4) All protocols on ipv4 and ipv6 are now network namespace aware, and
    all the compatability checks for initial-namespace-only protocols is
    removed.  Thanks to Tom Parkin for helping deal with the last major
    holdout, L2TP.

 5) IPV6 support in netpoll and network namespace support in pktgen,
    from Cong Wang.

 6) Multiple Registration Protocol (MRP) and Multiple VLAN Registration
    Protocol (MVRP) support, from David Ward.

 7) Compute packet lengths more accurately in the packet scheduler, from
    Eric Dumazet.

 8) Use per-task page fragment allocator in skb_append_datato_frags(),
    also from Eric Dumazet.

 9) Add support for connection tracking labels in netfilter, from
    Florian Westphal.

10) Fix default multicast group joining on ipv6, and add anti-spoofing
    checks to 6to4 and 6rd.  From Hannes Frederic Sowa.

11) Make ipv4/ipv6 fragmentation memory limits more reasonable in modern
    times, rearrange inet frag datastructures for better cacheline
    locality, and move more operations outside of locking.  From Jesper
    Dangaard Brouer.

12) Instead of strict master <--> slave relationships, allow arbitrary
    scenerios with "upper device lists".  From Jiri Pirko.

13) Improve rate limiting accuracy in TBF and act_police, also from Jiri
    Pirko.

14) Add a BPF filter netfilter match target, from Willem de Bruijn.

15) Orphan and delete a bunch of pre-historic networking drivers from
    Paul Gortmaker.

16) Add TSO support for GRE tunnels, from Pravin B SHelar.  Although
    this still needs some minor bug fixing before it's %100 correct in
    all cases.

17) Handle unresolved IPSEC states like ARP, with a resolution packet
    queue.  From Steffen Klassert.

18) Remove TCP Appropriate Byte Count support (ABC), from Stephen
    Hemminger.  This was long overdue.

19) Support SO_REUSEPORT, from Tom Herbert.

20) Allow locking a socket BPF filter, so that it cannot change after a
    process drops capabilities.

21) Add VLAN filtering to bridge, from Vlad Yasevich.

22) Bring ipv6 on-par with ipv4 and do not cache neighbour entries in
    the ipv6 routes, from YOSHIFUJI Hideaki.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1538 commits)
  ipv6: fix race condition regarding dst->expires and dst->from.
  net: fix a wrong assignment in skb_split()
  ip_gre: remove an extra dst_release()
  ppp: set qdisc_tx_busylock to avoid LOCKDEP splat
  atl1c: restore buffer state
  net: fix a build failure when !CONFIG_PROC_FS
  net: ipv4: fix waring -Wunused-variable
  net: proc: fix build failed when procfs is not configured
  Revert "xen: netback: remove redundant xenvif_put"
  net: move procfs code to net/core/net-procfs.c
  qmi_wwan, cdc-ether: add ADU960S
  bonding: set sysfs device_type to 'bond'
  bonding: fix bond_release_all inconsistencies
  b44: use netdev_alloc_skb_ip_align()
  xen: netback: remove redundant xenvif_put
  net: fec: Do a sanity check on the gpio number
  ip_gre: propogate target device GSO capability to the tunnel device
  ip_gre: allow CSUM capable devices to handle packets
  bonding: Fix initialize after use for 3ad machine state spinlock
  bonding: Fix race condition between bond_enslave() and bond_3ad_update_lacp_rate()
  ...
2013-02-20 18:58:50 -08:00
Linus Torvalds 8793422fd9 ACPI and power management updates for 3.9-rc1
- Rework of the ACPI namespace scanning code from Rafael J. Wysocki
   with contributions from Bjorn Helgaas, Jiang Liu, Mika Westerberg,
   Toshi Kani, and Yinghai Lu.
 
 - ACPI power resources handling and ACPI device PM update from
   Rafael J. Wysocki.
 
 - ACPICA update to version 20130117 from Bob Moore and Lv Zheng
   with contributions from Aaron Lu, Chao Guan, Jesper Juhl, and
   Tim Gardner.
 
 - Support for Intel Lynxpoint LPSS from Mika Westerberg.
 
 - cpuidle update from Len Brown including Intel Haswell support, C1
   state for intel_idle, removal of global pm_idle.
 
 - cpuidle fixes and cleanups from Daniel Lezcano.
 
 - cpufreq fixes and cleanups from Viresh Kumar and Fabio Baltieri
   with contributions from Stratos Karafotis and Rickard Andersson.
 
 - Intel P-states driver for Sandy Bridge processors from
   Dirk Brandewie.
 
 - cpufreq driver for Marvell Kirkwood SoCs from Andrew Lunn.
 
 - cpufreq fixes related to ordering issues between acpi-cpufreq and
   powernow-k8 from Borislav Petkov and Matthew Garrett.
 
 - cpufreq support for Calxeda Highbank processors from Mark Langsdorf
   and Rob Herring.
 
 - cpufreq driver for the Freescale i.MX6Q SoC and cpufreq-cpu0 update
   from Shawn Guo.
 
 - cpufreq Exynos fixes and cleanups from Jonghwan Choi, Sachin Kamat,
   and Inderpal Singh.
 
 - Support for "lightweight suspend" from Zhang Rui.
 
 - Removal of the deprecated power trace API from Paul Gortmaker.
 
 - Assorted updates from Andreas Fleig, Colin Ian King,
   Davidlohr Bueso, Joseph Salisbury, Kees Cook, Li Fei,
   Nishanth Menon, ShuoX Liu, Srinivas Pandruvada, Tejun Heo,
   Thomas Renninger, and Yasuaki Ishimatsu.
 
 /
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.19 (GNU/Linux)
 
 iQIcBAABAgAGBQJRIsArAAoJEKhOf7ml8uNsD6MP/j7C4NA+GTq6RdwoJt+Yki0K
 9Ep8I4pEuRFoN/oskv24EyQhpGJIk6UxWcJ/DWFBc+1VhmKORta7k2Idv/wlJA77
 s7AcDveA9xcDh+TVfbh87TeuiMSXiSdDZbiaQO+wMizWJAF3F84AnjiAqqqyQcSK
 bA5/Siz/vWlt9PyYDaQtHTVE4lpvPuVcQdYewsdaH2PsmUjvIg/TUzg28CTrdyvv
 eHOdBK9R0/OLQLhzRbL0VOGJ//wEl+HJRO0QEhTKPgdQ1e/VH/4Zu5WSzF8P/x4C
 s2f8U4IKQqulDuDHXtpMpelFm7hRWgsOqZLkcyXLs+0dvSM9CTPO6P0ZaImxUctk
 5daHWEsXUnCErDQawt1mcZP8l6qnxofMQIfLXyPVzvlSnHyToTmrtXa1v2u4AuL/
 hOo4MYWsFNUmRdtGFFGlExGgEDZ4G5NwiYjRBl/6XJ3v4nhnnMbuzxP8scpoe5m1
 8tjroJHZFUUs/mFU/H+oRbHzSzXPmp1sddNaTg4OpVmTn3DDh6ljnFhiItd1Ndw0
 5ldVbSe6ETq5RoK0TbzvQOeVpa9F3JfqbrXLQPqfd2iz/No41LQYG1uShRYuXKuA
 wfEcc+c9VMd3FILu05pGwBnU8VS9VbxTYMz7xDxg6b29Ywnb7u+Q1ycCk2gFYtkS
 E2oZDuyewTJxaskzYsNr
 =wijn
 -----END PGP SIGNATURE-----

Merge tag 'pm+acpi-3.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull ACPI and power management updates from Rafael Wysocki:

 - Rework of the ACPI namespace scanning code from Rafael J.  Wysocki
   with contributions from Bjorn Helgaas, Jiang Liu, Mika Westerberg,
   Toshi Kani, and Yinghai Lu.

 - ACPI power resources handling and ACPI device PM update from Rafael
   J Wysocki.

 - ACPICA update to version 20130117 from Bob Moore and Lv Zheng with
   contributions from Aaron Lu, Chao Guan, Jesper Juhl, and Tim Gardner.

 - Support for Intel Lynxpoint LPSS from Mika Westerberg.

 - cpuidle update from Len Brown including Intel Haswell support, C1
   state for intel_idle, removal of global pm_idle.

 - cpuidle fixes and cleanups from Daniel Lezcano.

 - cpufreq fixes and cleanups from Viresh Kumar and Fabio Baltieri with
   contributions from Stratos Karafotis and Rickard Andersson.

 - Intel P-states driver for Sandy Bridge processors from Dirk
   Brandewie.

 - cpufreq driver for Marvell Kirkwood SoCs from Andrew Lunn.

 - cpufreq fixes related to ordering issues between acpi-cpufreq and
   powernow-k8 from Borislav Petkov and Matthew Garrett.

 - cpufreq support for Calxeda Highbank processors from Mark Langsdorf
   and Rob Herring.

 - cpufreq driver for the Freescale i.MX6Q SoC and cpufreq-cpu0 update
   from Shawn Guo.

 - cpufreq Exynos fixes and cleanups from Jonghwan Choi, Sachin Kamat,
   and Inderpal Singh.

 - Support for "lightweight suspend" from Zhang Rui.

 - Removal of the deprecated power trace API from Paul Gortmaker.

 - Assorted updates from Andreas Fleig, Colin Ian King, Davidlohr Bueso,
   Joseph Salisbury, Kees Cook, Li Fei, Nishanth Menon, ShuoX Liu,
   Srinivas Pandruvada, Tejun Heo, Thomas Renninger, and Yasuaki
   Ishimatsu.

* tag 'pm+acpi-3.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (267 commits)
  PM idle: remove global declaration of pm_idle
  unicore32 idle: delete stray pm_idle comment
  openrisc idle: delete pm_idle
  mn10300 idle: delete pm_idle
  microblaze idle: delete pm_idle
  m32r idle: delete pm_idle, and other dead idle code
  ia64 idle: delete pm_idle
  cris idle: delete idle and pm_idle
  ARM64 idle: delete pm_idle
  ARM idle: delete pm_idle
  blackfin idle: delete pm_idle
  sparc idle: rename pm_idle to sparc_idle
  sh idle: rename global pm_idle to static sh_idle
  x86 idle: rename global pm_idle to static x86_idle
  APM idle: register apm_cpu_idle via cpuidle
  cpufreq / intel_pstate: Add kernel command line option disable intel_pstate.
  cpufreq / intel_pstate: Change to disallow module build
  tools/power turbostat: display SMI count by default
  intel_idle: export both C1 and C1E
  ACPI / hotplug: Fix concurrency issues and memory leaks
  ...
2013-02-20 11:26:56 -08:00
Linus Torvalds 266d7ad7f4 Merge branch 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer changes from Ingo Molnar:
 "Main changes:

   - ntp: Add CONFIG_RTC_SYSTOHC: a generic RTC driver facility
     complementing the existing CONFIG_RTC_HCTOSYS, which uses NTP to
     keep the hardware clock updated.

   - posix-timers: Fix clock_adjtime to always return timex data on
     success.  This is changing the ABI, but no breakage was expected
     and found - caution is warranted nevertheless.

   - platform persistent clock improvements/cleanups.

   - clockevents: refactor timer broadcast handling to be more generic
     and less duplicated with matching architecture code (mostly ARM
     motivated.)

   - various fixes and cleanups"

* 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  timers/x86/hpet: Use HPET_COUNTER to specify the hpet counter in vread_hpet()
  posix-cpu-timers: Fix nanosleep task_struct leak
  clockevents: Fix generic broadcast for FEAT_C3STOP
  time, Fix setting of hardware clock in NTP code
  hrtimer: Prevent hrtimer_enqueue_reprogram race
  clockevents: Add generic timer broadcast function
  clockevents: Add generic timer broadcast receiver
  timekeeping: Switch HAS_PERSISTENT_CLOCK to ALWAYS_USE_PERSISTENT_CLOCK
  x86/time/rtc: Don't print extended CMOS year when reading RTC
  x86: Select HAS_PERSISTENT_CLOCK on x86
  timekeeping: Add CONFIG_HAS_PERSISTENT_CLOCK option
  rtc: Skip the suspend/resume handling if persistent clock exist
  timekeeping: Add persistent_clock_exist flag
  posix-timers: Fix clock_adjtime to always return timex data on success
  Round the calculated scale factor in set_cyc2ns_scale()
  NTP: Add a CONFIG_RTC_SYSTOHC configuration
  MAINTAINERS: Update John Stultz's email
  time: create __getnstimeofday for WARNless calls
2013-02-19 19:05:45 -08:00
Linus Torvalds d652e1eb8e Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler changes from Ingo Molnar:
 "Main changes:

   - scheduler side full-dynticks (user-space execution is undisturbed
     and receives no timer IRQs) preparation changes that convert the
     cputime accounting code to be full-dynticks ready, from Frederic
     Weisbecker.

   - Initial sched.h split-up changes, by Clark Williams

   - select_idle_sibling() performance improvement by Mike Galbraith:

        " 1 tbench pair (worst case) in a 10 core + SMT package:

          pre   15.22 MB/sec 1 procs
          post 252.01 MB/sec 1 procs "

  - sched_rr_get_interval() ABI fix/change.  We think this detail is not
    used by apps (so it's not an ABI in practice), but lets keep it
    under observation.

  - misc RT scheduling cleanups, optimizations"

* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (24 commits)
  sched/rt: Add <linux/sched/rt.h> header to <linux/init_task.h>
  cputime: Remove irqsave from seqlock readers
  sched, powerpc: Fix sched.h split-up build failure
  cputime: Restore CPU_ACCOUNTING config defaults for PPC64
  sched/rt: Move rt specific bits into new header file
  sched/rt: Add a tuning knob to allow changing SCHED_RR timeslice
  sched: Move sched.h sysctl bits into separate header
  sched: Fix signedness bug in yield_to()
  sched: Fix select_idle_sibling() bouncing cow syndrome
  sched/rt: Further simplify pick_rt_task()
  sched/rt: Do not account zero delta_exec in update_curr_rt()
  cputime: Safely read cputime of full dynticks CPUs
  kvm: Prepare to add generic guest entry/exit callbacks
  cputime: Use accessors to read task cputime stats
  cputime: Allow dynamic switch between tick/virtual based cputime accounting
  cputime: Generic on-demand virtual cputime accounting
  cputime: Move default nsecs_to_cputime() to jiffies based cputime file
  cputime: Librarize per nsecs resolution cputime definitions
  cputime: Avoid multiplication overflow on utime scaling
  context_tracking: Export context state for generic vtime
  ...

Fix up conflict in kernel/context_tracking.c due to comment additions.
2013-02-19 18:19:48 -08:00
Trond Myklebust 666b3d803a NLM: Ensure that we resend all pending blocking locks after a reclaim
Currently, nlmclnt_lock will break out of the for(;;) loop when
the reclaimer wakes up the blocking lock thread by setting
nlm_lck_denied_grace_period. This causes the lock request to fail
with an ENOLCK error.
The intention was always to ensure that we resend the lock request
after the grace period has expired.

Reported-by: Wangyuan Zhang <Wangyuan.Zhang@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: stable@vger.kernel.org
2013-02-19 12:18:27 -05:00
Gao feng c2399059a3 net: proc: remove proc_net_remove
proc_net_remove has been replaced by remove_proc_entry.
we can remove it now.

Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-02-18 14:53:08 -05:00
Gao feng b4278c961a net: proc: remove proc_net_fops_create
proc_net_fops_create has been replaced by proc_create,
we can remove it now.

Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-02-18 14:53:08 -05:00
fanchaoting 5a12cca697 umount oops when remove blocklayoutdriver first
now pnfs client uses block layout, maybe we can remove
blocklayoutdriver first. if we umount later,
it can cause oops in unset_pnfs_layoutdriver.
because nfss->pnfs_curr_ld->clear_layoutdriver is invalid.

reproduce it:
 modprobe  blocklayoutdriver
 mount -t nfs4 -o minorversion=1 pnfsip:/ /mnt/
 rmmod blocklayoutdriver
 umount /mnt

then you can see following

CPU 0
Pid: 17023, comm: umount.nfs4 Tainted: GF          O 3.7.0-rc6-pnfs #1 VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform
RIP: 0010:[<ffffffffa04cfe6d>]  [<ffffffffa04cfe6d>] unset_pnfs_layoutdriver+0x1d/0x70 [nfsv4]
RSP: 0018:ffff8800022d9e48  EFLAGS: 00010286
RAX: ffffffffa04a1b00 RBX: ffff88000b013800 RCX: 0000000000000001
RDX: ffffffff81ae8ee0 RSI: ffff880001ee94b8 RDI: ffff88000b013800
RBP: ffff8800022d9e58 R08: 0000000000000001 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: ffff880001ee9400
R13: ffff8800105978c0 R14: 00007fff25846c08 R15: 0000000001bba550
FS:  00007f45ae7f0700(0000) GS:ffff880012c00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: ffffffffa04a1b38 CR3: 0000000002c0c000 CR4: 00000000000006f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process umount.nfs4 (pid: 17023, threadinfo ffff8800022d8000, task ffff880006e48aa0)
Stack:
ffff8800105978c0 ffff88000b013800 ffff8800022d9e78 ffffffffa04cd0ce
ffff8800022d9e78 ffff88000b013800 ffff8800022d9ea8 ffffffffa04755a7
ffff8800022d9ea8 ffff880002f96400 ffff88000b013800 ffff880002f96400
Call Trace:
[<ffffffffa04cd0ce>] nfs4_destroy_server+0x1e/0x30 [nfsv4]
[<ffffffffa04755a7>] nfs_free_server+0xb7/0x150 [nfs]
[<ffffffffa047d4d5>] nfs_kill_super+0x35/0x40 [nfs]
[<ffffffff81178d35>] deactivate_locked_super+0x45/0x70
[<ffffffff8117986a>] deactivate_super+0x4a/0x70
[<ffffffff81193ee2>] mntput_no_expire+0xd2/0x130
[<ffffffff81194d62>] sys_umount+0x72/0xe0
[<ffffffff8154af59>] system_call_fastpath+0x16/0x1b
Code: 06 e1 b8 ea ff ff ff eb 9e 0f 1f 44 00 00 55 48 89 e5 53 48 83 ec 08 66 66 66 66 90 48 8b 87 80 03 00 00 48 89 fb 48 85 c0 74 29 <48> 8b 40 38 48 85 c0 74 02 ff d0 48 8b 03 3e ff 48 04 0f 94 c2
RIP  [<ffffffffa04cfe6d>] unset_pnfs_layoutdriver+0x1d/0x70 [nfsv4]
RSP <ffff8800022d9e48>
CR2: ffffffffa04a1b38
---[ end trace 29f75aaedda058bf ]---

Signed-off-by: fanchaoting<fanchaoting@cn.fujitsu.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: stable@vger.kernel.org
2013-02-17 15:40:15 -05:00
Tim Gardner 96aa1549af nfs: remove kfree() redundant null checks
smatch analysis:

fs/nfs/getroot.c:130 nfs_get_root() info: redundant null
 check on name calling kfree()

fs/nfs/unlink.c:272 nfs_async_unlink() info: redundant null
 check on devname_garbage calling kfree()

Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: linux-nfs@vger.kernel.org
Signed-off-by: Tim Gardner <tim.gardner@canonical.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2013-02-17 15:27:21 -05:00
Weston Andros Adamson 085b7a45c6 NFSv4.1: Don't decode skipped layoutgets
layoutget's prepare hook can call rpc_exit with status = NFS4_OK (0).
Because of this, nfs4_proc_layoutget can't depend on a 0 status to mean
that the RPC was successfully sent, received and parsed.

To fix this, use the result's len member to see if parsing took place.

This fixes the following OOPS -- calling xdr_init_decode() with a buffer length
0 doesn't set the stream's 'p' member and ends up using uninitialized memory
in filelayout_decode_layout.

BUG: unable to handle kernel paging request at 0000000000008050
IP: [<ffffffff81282e78>] memcpy+0x18/0x120
PGD 0
Oops: 0000 [#1] SMP
last sysfs file: /sys/devices/pci0000:00/0000:00:11.0/0000:02:01.0/irq
CPU 1
Modules linked in: nfs_layout_nfsv41_files nfs lockd fscache auth_rpcgss nfs_acl autofs4 sunrpc ipt_REJECT nf_conntrack_ipv4 nf_defrag_ipv4 iptable_filter ip_tables ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables ipv6 dm_mirror dm_region_hash dm_log dm_mod ppdev parport_pc parport snd_ens1371 snd_rawmidi snd_ac97_codec ac97_bus snd_seq snd_seq_device snd_pcm snd_timer snd soundcore snd_page_alloc e1000 microcode vmware_balloon i2c_piix4 i2c_core sg shpchp ext4 mbcache jbd2 sr_mod cdrom sd_mod crc_t10dif pata_acpi ata_generic ata_piix mptspi mptscsih mptbase scsi_transport_spi [last unloaded: speedstep_lib]

Pid: 1665, comm: flush-0:22 Not tainted 2.6.32-356-test-2 #2 VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform
RIP: 0010:[<ffffffff81282e78>]  [<ffffffff81282e78>] memcpy+0x18/0x120
RSP: 0018:ffff88003dfab588  EFLAGS: 00010206
RAX: ffff88003dc42000 RBX: ffff88003dfab610 RCX: 0000000000000009
RDX: 000000003f807ff0 RSI: 0000000000008050 RDI: ffff88003dc42000
RBP: ffff88003dfab5b0 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000080 R12: 0000000000000024
R13: ffff88003dc42000 R14: ffff88003f808030 R15: ffff88003dfab6a0
FS:  0000000000000000(0000) GS:ffff880003420000(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 0000000000008050 CR3: 000000003bc92000 CR4: 00000000001407e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process flush-0:22 (pid: 1665, threadinfo ffff88003dfaa000, task ffff880037f77540)
Stack:
ffffffffa0398ac1 ffff8800397c5940 ffff88003dfab610 ffff88003dfab6a0
<d> ffff88003dfab5d0 ffff88003dfab680 ffffffffa01c150b ffffea0000d82e70
<d> 000000508116713b 0000000000000000 0000000000000000 0000000000000000
Call Trace:
[<ffffffffa0398ac1>] ? xdr_inline_decode+0xb1/0x120 [sunrpc]
[<ffffffffa01c150b>] filelayout_decode_layout+0xeb/0x350 [nfs_layout_nfsv41_files]
[<ffffffffa01c17fc>] filelayout_alloc_lseg+0x8c/0x3c0 [nfs_layout_nfsv41_files]
[<ffffffff8150e6ce>] ? __wait_on_bit+0x7e/0x90

Signed-off-by: Weston Andros Adamson <dros@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: stable@vger.kernel.org
2013-02-17 15:24:16 -05:00
Dave Chinner 1e82379b01 xfs: xfs_bmap_add_attrfork_local is too generic
When we are converting local data to an extent format as a result of
adding an attribute, the type of data contained in the local fork
determines the behaviour that needs to occur.

xfs_bmap_add_attrfork_local() already handles the directory data
case specially by using S_ISDIR() and calling out to
xfs_dir2_sf_to_block(), but with verifiers we now need to handle
each different type of metadata specially and different metadata
formats require different verifiers (and eventually block header
initialisation).

There is only a single place that we add and attribute fork to
the inode, but that is in the attribute code and it knows nothing
about the specific contents of the data fork. It is only the case of
local data that is the issue here, so adding code to hadnle this
case in the attribute specific code is wrong. Hence we are really
stuck trying to detect the data fork contents in
xfs_bmap_add_attrfork_local() and performing the correct callout
there.

Luckily the current cases can be determined by S_IS* macros, and we
can push the work off to data specific callouts, but each of those
callouts does a lot of work in common with
xfs_bmap_local_to_extents(). The only reason that this fails for
symlinks right now is is that xfs_bmap_local_to_extents() assumes
the data fork contains extent data, and so attaches a a bmap extent
data verifier to the buffer and simply copies the data fork
information straight into it.

To fix this, allow us to pass a "formatting" callback into
xfs_bmap_local_to_extents() which is responsible for setting the
buffer type, initialising it and copying the data fork contents over
to the new buffer. This allows callers to specify how they want to
format the new buffer (which is necessary for the upcoming CRC
enabled metadata blocks) and hence make xfs_bmap_local_to_extents()
useful for any type of data fork content.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com> 
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-02-14 17:35:51 -06:00
Brian Foster fa5566e4ff xfs: remove log force from xfs_buf_trylock()
The trylock log force invoked via xfs_buf_item_push() can attempt
to acquire xa_lock, thus leading to a recursion bug when called
with xa_lock held.

This log force was originally added to xfs_buf_trylock() to address
xfsaild stalls due to pinned and stale buffers. Since the addition
of this behavior, the log item pushing code had been reworked to
detect and track pinned items to inform xfsaild to issue a log
force itself when necessary. As such, the log force on trylock
failure is redundant and safe to remove.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-02-14 17:24:53 -06:00
Brian Foster 5337fe9b10 xfs: recheck buffer pinned status after push trylock failure
The buffer pinned check and trylock sequence in xfs_buf_item_push()
can race with an active transaction on marking the buffer pinned.
This can result in the buffer becoming pinned and stale after the
initial check and the trylock failure, but before the check in
xfs_buf_trylock() that issues a log force. If the log force is
issued from this context, a spinlock recursion occurs on xa_lock.

Prepare xfs_buf_item_push() to handle the race by detecting a
pinned buffer after the trylock failure so xfsaild issues a log
force from a safe context. This, along with various previous fixes,
renders the log force in xfs_buf_trylock() redundant.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-02-14 17:23:42 -06:00
Dave Chinner a1e16c2666 xfs: limit speculative prealloc size on sparse files
Speculative preallocation based on the current file size works well
for contiguous files, but is sub-optimal for sparse files where the
EOF preallocation can fill holes and result in large amounts of
zeros being written when it is not necessary.

The algorithm is modified to prevent EOF speculative preallocation
from triggering larger allocations on IO patterns of
truncate--to-zero-seek-write-seek-write-....  which results in
non-sparse files for large files. This, unfortunately, is the way cp
now behaves when copying sparse files and so needs to be fixed.

What this code does is that it looks at the existing extent adjacent
to the current EOF and if it determines that it is a hole we disable
speculative preallocation altogether. To avoid the next write from
doing a large prealloc, it takes the size of subsequent
preallocations from the current size of the existing EOF extent.
IOWs, if you leave a hole in the file, it resets preallocation
behaviour to the same as if it was a zero size file.

Example new behaviour:

$ xfs_io -f -c "pwrite 0 31m" \
            -c "pwrite 33m 1m" \
            -c "pwrite 128m 1m" \
            -c "fiemap -v" /mnt/scratch/blah
wrote 32505856/32505856 bytes at offset 0
31 MiB, 7936 ops; 0.0000 sec (1.608 GiB/sec and 421432.7439 ops/sec)
wrote 1048576/1048576 bytes at offset 34603008
1 MiB, 256 ops; 0.0000 sec (1.462 GiB/sec and 383233.5329 ops/sec)
wrote 1048576/1048576 bytes at offset 134217728
1 MiB, 256 ops; 0.0000 sec (1.719 GiB/sec and 450704.2254 ops/sec)
/mnt/scratch/blah:
 EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
   0: [0..65535]:      96..65631        65536   0x0
   1: [65536..67583]:  hole              2048
   2: [67584..69631]:  67680..69727      2048   0x0
   3: [69632..262143]: hole             192512
   4: [262144..264191]: 262240..264287    2048   0x1

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-02-14 17:21:32 -06:00
Trond Myklebust fd9a8d7160 NFSv4.1: Fix bulk recall and destroy of layouts
The current code in pnfs_destroy_all_layouts() assumes that removing
the layout from the server->layouts list is sufficient to make it
invisible to other processes. This ignores the fact that most
users access the layout through the nfs_inode->layout...
There is further breakage due to lack of reference counting of the
layouts, meaning that the whole thing Oopses at the drop of a hat.

The code in initiate_bulk_draining() is almost correct, and can be
used as a model for pnfs_destroy_all_layouts(), so move that
code to pnfs.c, and refactor the code to allow us to choose between
a single filesystem bulk recall, and a recall of all layouts.
Also note that initiate_bulk_draining() currently calls iput() while
holding locks. Fix that too.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: stable@vger.kernel.org
2013-02-14 13:22:50 -05:00
Steven Whitehouse fd95e81cb1 GFS2: Reinstate withdraw ack system
This patch reinstates the ack system which withdraw should be using. It
appears to have been accidentally forgotten when the lock module was
merged into GFS2, due to two different sysfs files having the same name.

Reported-by: David Teigland <teigland@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2013-02-13 12:21:40 +00:00
Josh Boyer fb0af3f2b1 pstore: Create a convenient mount point for pstore
Using /dev/pstore as a mount point for the pstore filesystem is slightly
awkward.  We don't normally mount filesystems in /dev/ and the /dev/pstore
file isn't created automatically by anything.  While this method will
still work, we can create a persistent mount point in sysfs.  This will
put pstore on par with things like cgroups and efivarfs.

Signed-off-by: Josh Boyer <jwboyer@redhat.com>
Acked-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Tony Luck <tony.luck@intel.com>
2013-02-12 13:07:22 -08:00
Trond Myklebust c8da19b986 NFSv4.1: Fix an ABBA locking issue with session and state serialisation
Ensure that if nfs_wait_on_sequence() causes our rpc task to wait for
an NFSv4 state serialisation lock, then we also drop the session slot.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: stable@vger.kernel.org
2013-02-11 19:04:25 -05:00
Trond Myklebust c21443c2c7 NFSv4: Fix a reboot recovery race when opening a file
If the server reboots after it has replied to our OPEN, but before we
call nfs4_opendata_to_nfs4_state(), then the reboot recovery thread
will not see a stateid for this open, and so will fail to recover it.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2013-02-11 15:33:14 -05:00
Trond Myklebust 65b62a29f7 NFSv4: Ensure delegation recall and byte range lock removal don't conflict
Add a mutex to the struct nfs4_state_owner to ensure that delegation
recall doesn't conflict with byte range lock removal.

Note that we nest the new mutex _outside_ the state manager reclaim
protection (nfsi->rwsem) in order to avoid deadlocks.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2013-02-11 15:33:13 -05:00
Trond Myklebust 37380e4264 NFSv4: Fix up the return values of nfs4_open_delegation_recall
Adjust the return values so that they return EAGAIN to the caller in
cases where we might want to retry the delegation recall after
the state recovery has run.
Note that we can't wait and retry in this routine, because the caller
may be the state manager thread.

If delegation recall fails due to a session or reboot related issue,
also ensure that we mark the stateid as delegated so that
nfs_delegation_claim_opens can find it again later.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2013-02-11 15:33:13 -05:00
Trond Myklebust d25be546a8 NFSv4.1: Don't lose locks when a server reboots during delegation return
If the server reboots while we are converting a delegation into
OPEN/LOCK stateids as part of a delegation return, the current code
will simply exit with an error. This causes us to lose both
delegation state and locking state (i.e. locking atomicity).

Deal with this by exposing the delegation stateid during delegation
return, so that we can recover the delegation, and then resume
open/lock recovery.

Note that not having to hold the nfs_inode->rwsem across the
calls to nfs_delegation_claim_opens() also fixes a deadlock against
the NFSv4.1 reboot recovery code.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2013-02-11 15:33:12 -05:00
Trond Myklebust 9a99af494b NFSv4.1: Prevent deadlocks between state recovery and file locking
We currently have a deadlock in which the state recovery thread
ends up blocking due to one of the locks which it is trying to
recover holding the nfs_inode->rwsem.
The situation is as follows: the state recovery thread is
scheduled in order to recover from a reboot. It immediately
drains the session, forcing all ordinary NFSv4.1 calls to
nfs41_setup_sequence() to be put to sleep.  This includes the
file locking process that holds the nfs_inode->rwsem.
When the thread gets to nfs4_reclaim_locks(), it tries to
grab a write lock on nfs_inode->rwsem, and boom...

Fix is to have the lock drop the nfs_inode->rwsem while it is
doing RPC calls. We use a sequence lock in order to signal to
the locking process whether or not a state recovery thread has
run on that inode, in which case it should retry the lock.

Reported-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2013-02-11 15:33:12 -05:00
Trond Myklebust c137afabe3 NFSv4: Allow the state manager to mark an open_owner as being recovered
This patch adds a seqcount_t lock for use by the state manager to
signal that an open owner has been recovered. This mechanism will be
used by the delegation, open and byte range lock code in order to
figure out if they need to replay requests due to collisions with
lock recovery.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2013-02-11 15:33:11 -05:00
Rafael J. Wysocki a9834cb205 Merge branch 'acpi-pm'
* acpi-pm: (35 commits)
  ACPI / PM: Handle missing _PSC in acpi_bus_update_power()
  ACPI / PM: Do not power manage devices in unknown initial states
  ACPI / PM: Fix acpi_bus_get_device() check in drivers/acpi/device_pm.c
  ACPI / PM: Fix /proc/acpi/wakeup for devices w/o bus or parent
  ACPI / PM: Fix consistency check for power resources during resume
  ACPI / PM: Expose lists of device power resources to user space
  sysfs: Functions for adding/removing symlinks to/from attribute groups
  ACPI / PM: Expose current status of ACPI power resources
  ACPI / PM: Expose power states of ACPI devices to user space
  ACPI / scan: Prevent device add uevents from racing with user space
  ACPI / PM: Fix device power state value after transitions to D3cold
  ACPI / PM: Use string "D3cold" to represent ACPI_STATE_D3_COLD
  ACPI / PM: Sanitize checks in acpi_power_on_resources()
  ACPI / PM: Always evaluate _PSn after setting power resources
  ACPI / PM: Introduce helper for executing _PSn methods
  ACPI / PM: Make acpi_bus_init_power() more robust
  ACPI / PM: Fix build for unusual combination of Kconfig options
  ACPI / PM: remove leading whitespace from #ifdef
  ACPI / PM: Consolidate suspend-specific and hibernate-specific code
  ACPI / PM: Move device power management functions to device_pm.c
  ...
2013-02-11 13:20:56 +01:00
M. Mohan Kumar b6f4bee02f fs/9p: Fix atomic_open
Return EEXISTS if requested file already exists, without this patch open
call will always succeed even if the file exists and user specified
O_CREAT|O_EXCL.

Following test code can be used to verify this patch. Without this patch
executing following test code on 9p mount will result in printing 'test case
failed' always.

main()
{
        int fd;

        /* first create the file */
        fd = open("./file", O_CREAT|O_WRONLY);
        if (fd < 0) {
                perror("open");
                return -1;
        }
        close(fd);

        /* Now opening same file with O_CREAT|O_EXCL should fail */
        fd = open("./file", O_CREAT|O_EXCL);
        if (fd < 0 && errno == EEXIST)
	        printf("test case pass\n");
        else
	        printf("test case failed\n");
        close(fd);
        return 0;
}

Signed-off-by: M. Mohan Kumar <mohan@in.ibm.com>
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
2013-02-10 16:29:59 -06:00
Aneesh Kumar K.V 03f0e02273 fs/9p: Don't use O_TRUNC flag in TOPEN and TLOPEN request
We do the truncate via setattr request, hence don't pass the O_TRUNC flag in
open request. Without this patch we end up sending zero sized write request
to server when we try to truncate. Some servers (VirtFS) were not handling that
properly.

Reported-by: M. Mohan Kumar <mohan@in.ibm.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
2013-02-10 16:29:47 -06:00
Al Viro 7ffdea7ea3 locking in fs/9p ->readdir()
... is really excessive.  First of all, ->readdir() is serialized by
file->f_path.dentry->d_inode->i_mutex; playing with file->f_path.dentry->d_lock
is not buying you anything.  Moreover, rdir->mutex is pointless for exactly
the same reason - you'll never see contention on it.

	While we are at it, there's no point in having rdir->buf a pointer -
you have it point just past the end of rdir, so it might as well be a flex
array (and no, it's not a gccism).

	Absolutely untested patch follows:

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
2013-02-10 16:29:33 -06:00
Linus Torvalds 8d19514fad Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs fixes from Chris Mason:
 "We've got corner cases for updating i_size that ceph was hitting,
  error handling for quotas when we run out of space, a very subtle
  snapshot deletion race, a crash while removing devices, and one
  deadlock between subvolume creation and the sb_internal code (thanks
  lockdep)."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  Btrfs: move d_instantiate outside the transaction during mksubvol
  Btrfs: fix EDQUOT handling in btrfs_delalloc_reserve_metadata
  Btrfs: fix possible stale data exposure
  Btrfs: fix missing i_size update
  Btrfs: fix race between snapshot deletion and getting inode
  Btrfs: fix missing release of the space/qgroup reservation in start_transaction()
  Btrfs: fix wrong sync_writers decrement in btrfs_file_aio_write()
  Btrfs: do not merge logged extents if we've removed them from the tree
  btrfs: don't try to notify udev about missing devices
2013-02-08 12:06:46 +11:00
Clark Williams 8bd75c77b7 sched/rt: Move rt specific bits into new header file
Move rt scheduler definitions out of include/linux/sched.h into
new file include/linux/sched/rt.h

Signed-off-by: Clark Williams <williams@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Link: http://lkml.kernel.org/r/20130207094707.7b9f825f@riff.lan
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-02-07 20:51:08 +01:00
Alex Elder 311f08acde xfs: memory barrier before wake_up_bit()
In xfs_ifunlock() there is a call to wake_up_bit() after clearing
the flush lock on the xfs inode.  This is not guaranteed to be safe,
as noted in the comments above wake_up_bit() beginning with:

    In order for this to function properly, as it uses
    waitqueue_active() internally, some kind of memory
    barrier must be done prior to calling this.


Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-02-07 09:39:48 -06:00
Eric Wong 634734b63a fuse: allow control of adaptive readdirplus use
For some filesystems (e.g. GlusterFS), the cost of performing a
normal readdir and readdirplus are identical.  Since adaptively
using readdirplus has no benefit for those systems, give
users/filesystems the option to control adaptive readdirplus use.

v2 of this patch incorporates Miklos's suggestion to simplify the code,
as well as improving consistency of macro names and documentation.

Signed-off-by: Eric Wong <normalperson@yhbt.net>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2013-02-07 14:25:44 +01:00
Chris Mason 1a65e24b0b Btrfs: move d_instantiate outside the transaction during mksubvol
Dave Sterba triggered a lockdep complaint about lock ordering
between the sb_internal lock and the cleaner semaphore.

btrfs_lookup_dentry() checks for orphans if we're looking up
the inode for a subvolume, and subvolume creation is triggering
the lookup with a transaction running.

This commit moves the d_instantiate after the transaction closes.

Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-02-06 12:11:10 -05:00
Jan Schmidt eb6b88d92c Btrfs: fix EDQUOT handling in btrfs_delalloc_reserve_metadata
When btrfs_qgroup_reserve returned a failure, we were missing a counter
operation for BTRFS_I(inode)->outstanding_extents++, leading to warning
messages about outstanding extents and space_info->bytes_may_use != 0.
Additionally, the error handling code didn't take into account that we
dropped the inode lock which might require more cleanup.

Luckily, all the cleanup code we need is already there and can be shared
with reserve_metadata_bytes, which is exactly what this patch does.

Reported-by: Lev Vainblat <lev@zadarastorage.com>
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-02-06 09:24:40 -05:00
Chris Mason 24f8ebe918 Merge git://git.kernel.org/pub/scm/linux/kernel/git/josef/btrfs-next.git for-chris into for-linus 2013-02-05 19:24:44 -05:00
Josef Bacik 59fe4f4197 Btrfs: fix possible stale data exposure
We specifically do not update the disk i_size if there are ordered extents
outstanding for any area between the current disk_i_size and our ordered
extent so that we do not expose stale data.  The problem is the check we
have only checks if the ordered extent starts at or after the current
disk_i_size, which doesn't take into account an ordered extent that starts
before the current disk_i_size and ends past the disk_i_size.  Fix this by
checking if the extent ends past the disk_i_size.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-02-05 16:09:16 -05:00
Josef Bacik 5d1f40202b Btrfs: fix missing i_size update
If we have an ordered extent before the ordered extent we are currently
completing that is after the current disk_i_size we will put our i_size
update into that ordered extent so that we do not expose stale data.  The
problem is that if our disk i_size is updated past the previous ordered
extent we won't update the i_size with the pending i_size update.  So check
the pending i_size update and if its above the current disk i_size we need
to go ahead and try to update.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-02-05 16:09:14 -05:00
Liu Bo 6f1c36055f Btrfs: fix race between snapshot deletion and getting inode
While running snapshot testscript created by Mitch and David,
the race between autodefrag and snapshot deletion can lead to
corruption of dead_root list so that we can get crash on
btrfs_clean_old_snapshots().

And besides autodefrag, scrub also does the same thing, ie. read
root first and get inode.

Here is the story(take autodefrag as an example):
(1) when we delete a snapshot or subvolume, it will set its root's
refs to zero and do a iput() on its own inode, and if this inode happens
to be the only active in-meory one in root's inode rbtree, it will add
itself to the global dead_roots list for later cleanup.

(2) after (1), the autodefrag thread may read another inode for defrag
and the inode is just in the deleted snapshot/subvolume, but all of these
are without checking if the root is still valid(refs > 0).  So the end up
result is adding the deleted snapshot/subvolume's root to the global
dead_roots list AGAIN.

Fortunately, we already have a srcu lock to avoid the race, ie. subvol_srcu.

So all we need to do is to take the lock to protect 'read root and get inode',
since we synchronize to wait for the rcu grace period before adding something
to the global dead_roots list.

Reported-by: Mitch Harder <mitch.harder@sabayonlinux.org>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-02-05 16:09:13 -05:00
Miao Xie 843fcf3573 Btrfs: fix missing release of the space/qgroup reservation in start_transaction()
When we fail to start a transaction, we need to release the reserved free space
and qgroup space, fix it.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Reviewed-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-02-05 16:09:11 -05:00
Miao Xie 0a3404dcff Btrfs: fix wrong sync_writers decrement in btrfs_file_aio_write()
If the checks at the beginning of btrfs_file_aio_write() fail, we needn't
decrease ->sync_writers, because we have not increased it. Fix it.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-02-05 16:09:10 -05:00
Josef Bacik 222c81dc38 Btrfs: do not merge logged extents if we've removed them from the tree
You can run into this problem where if somebody is fsyncing and writing out
the existing extents you will have removed the extent map from the em tree,
but it's still valid for the current fsync so we go ahead and write it.  The
problem is we unconditionally try to merge it back into the em tree, but if
we've removed it from the em tree that will cause use after free problems.
Fix this to only merge if we are still a part of the tree.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-02-05 16:09:03 -05:00