Commit Graph

44996 Commits

Author SHA1 Message Date
Michal Hocko 8a5c743e30 mm, memcg: use consistent gfp flags during readahead
Vladimir has noticed that we might declare memcg oom even during
readahead because read_pages only uses GFP_KERNEL (with mapping_gfp
restriction) while __do_page_cache_readahead uses
page_cache_alloc_readahead which adds __GFP_NORETRY to prevent from
OOMs.  This gfp mask discrepancy is really unfortunate and easily
fixable.  Drop page_cache_alloc_readahead() which only has one user and
outsource the gfp_mask logic into readahead_gfp_mask and propagate this
mask from __do_page_cache_readahead down to read_pages.

This alone would have only very limited impact as most filesystems are
implementing ->readpages and the common implementation mpage_readpages
does GFP_KERNEL (with mapping_gfp restriction) again.  We can tell it to
use readahead_gfp_mask instead as this function is called only during
readahead as well.  The same applies to read_cache_pages.

ext4 has its own ext4_mpage_readpages but the path which has pages !=
NULL can use the same gfp mask.  Btrfs, cifs, f2fs and orangefs are
doing a very similar pattern to mpage_readpages so the same can be
applied to them as well.

[akpm@linux-foundation.org: coding-style fixes]
[mhocko@suse.com: restrict gfp mask in mpage_alloc]
  Link: http://lkml.kernel.org/r/20160610074223.GC32285@dhcp22.suse.cz
Link: http://lkml.kernel.org/r/1465301556-26431-1-git-send-email-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Cc: Vladimir Davydov <vdavydov@parallels.com>
Cc: Chris Mason <clm@fb.com>
Cc: Steve French <sfrench@samba.org>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Jan Kara <jack@suse.cz>
Cc: Mike Marshall <hubcap@omnibond.com>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Cc: Changman Lee <cm224.lee@samsung.com>
Cc: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-26 16:19:19 -07:00
Vladimir Davydov d86133bd39 pipe: account to kmemcg
Pipes can consume a significant amount of system memory, hence they
should be accounted to kmemcg.

This patch marks pipe_inode_info and anonymous pipe buffer page
allocations as __GFP_ACCOUNT so that they would be charged to kmemcg.
Note, since a pipe buffer page can be "stolen" and get reused for other
purposes, including mapping to userspace, we clear PageKmemcg thus
resetting page->_mapcount and uncharge it in anon_pipe_buf_steal, which
is introduced by this patch.

A note regarding anon_pipe_buf_steal implementation.  We allow to steal
the page if its ref count equals 1.  It looks racy, but it is correct
for anonymous pipe buffer pages, because:

 - We lock out all other pipe users, because ->steal is called with
   pipe_lock held, so the page can't be spliced to another pipe from
   under us.

 - The page is not on LRU and it never was.

 - Thus a parallel thread can access it only by PFN. Although this is
   quite possible (e.g. see page_idle_get_page and balloon_page_isolate)
   this is not dangerous, because all such functions do is increase page
   ref count, check if the page is the one they are looking for, and
   decrease ref count if it isn't. Since our page is clean except for
   PageKmemcg mark, which doesn't conflict with other _mapcount users,
   the worst that can happen is we see page_count > 2 due to a transient
   ref, in which case we false-positively abort ->steal, which is still
   fine, because ->steal is not guaranteed to succeed.

Link: http://lkml.kernel.org/r/20160527150313.GD26059@esperanza
Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-26 16:19:19 -07:00
Brian Foster 9a46b04f16 fs/fs-writeback.c: inode writeback list tracking tracepoints
The per-sb inode writeback list tracks inodes currently under writeback
to facilitate efficient sync processing.  In particular, it ensures that
sync only needs to walk through a list of inodes that were cleaned by
the sync.

Add a couple tracepoints to help identify when inodes are added/removed
to and from the writeback lists.  Piggyback off of the writeback
lazytime tracepoint template as it already tracks the relevant inode
information.

Link: http://lkml.kernel.org/r/1466594593-6757-3-git-send-email-bfoster@redhat.com
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Dave Chinner <dchinner@redhat.com>
cc: Josef Bacik <jbacik@fb.com>
Cc: Holger Hoffstätte <holger.hoffstaette@applied-asynchrony.com>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-26 16:19:19 -07:00
Dave Chinner 6c60d2b574 fs/fs-writeback.c: add a new writeback list for sync
wait_sb_inodes() currently does a walk of all inodes in the filesystem
to find dirty one to wait on during sync.  This is highly inefficient
and wastes a lot of CPU when there are lots of clean cached inodes that
we don't need to wait on.

To avoid this "all inode" walk, we need to track inodes that are
currently under writeback that we need to wait for.  We do this by
adding inodes to a writeback list on the sb when the mapping is first
tagged as having pages under writeback.  wait_sb_inodes() can then walk
this list of "inodes under IO" and wait specifically just for the inodes
that the current sync(2) needs to wait for.

Define a couple helpers to add/remove an inode from the writeback list
and call them when the overall mapping is tagged for or cleared from
writeback.  Update wait_sb_inodes() to walk only the inodes under
writeback due to the sync.

With this change, filesystem sync times are significantly reduced for
fs' with largely populated inode caches and otherwise no other work to
do.  For example, on a 16xcpu 2GHz x86-64 server, 10TB XFS filesystem
with a ~10m entry inode cache, sync times are reduced from ~7.3s to less
than 0.1s when the filesystem is fully clean.

Link: http://lkml.kernel.org/r/1466594593-6757-2-git-send-email-bfoster@redhat.com
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Tested-by: Holger Hoffstätte <holger.hoffstaette@applied-asynchrony.com>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-26 16:19:19 -07:00
piaojun 7d65b27448 ocfs2/cluster: clean up unnecessary assignment for 'ret'
Clean up unnecessary assignment for 'ret'.

Link: http://lkml.kernel.org/r/578C61F6.4080403@huawei.com
Signed-off-by: Jun Piao <piaojun@huawei.com>
Reviewed-by: Joseph Qi <joseph.qi@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-26 16:19:19 -07:00
Joseph Qi e81f1c5c4a ocfs2: remove obscure BUG_ON in dlmglue
These BUG_ON(!inode) are obscure because we have already used inode to
get osb.  And actually we can guarantee here inode is valid in the
context.  So we can safely remove them.

Link: http://lkml.kernel.org/r/5776336A.6030104@huawei.com
Signed-off-by: Joseph Qi <joseph.qi@huawei.com>
Reviewed-by: Eric Ren <zren@suse.com>
Cc: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-26 16:19:19 -07:00
Joseph Qi 698d44b43a ocfs2: cleanup implemented prototypes
Several prototypes in inode.h are just defined but not actually
implemented and used, so remove them.

Link: http://lkml.kernel.org/r/57763787.4020706@huawei.com
Signed-off-by: Joseph Qi <joseph.qi@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-26 16:19:19 -07:00
Joseph Qi 8ec7b17a66 ocfs2/dlm: fix memory leak of dlm_debug_ctxt
dlm_debug_ctxt->debug_refcnt is initialized to 1 and then increased to 2
by dlm_debug_get in dlm_debug_init.  But dlm_debug_put is called only
once in dlm_debug_shutdown during unregister dlm, which leads to
dlm_debug_ctxt leaked.

Link: http://lkml.kernel.org/r/577BB755.4030900@huawei.com
Signed-off-by: Joseph Qi <joseph.qi@huawei.com>
Reviewed-by: Jiufei Xue <xuejiufei@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-26 16:19:19 -07:00
Joseph Qi a8f24f1b3f ocfs2: cleanup unneeded goto in ocfs2_create_new_inode_locks
The last goto is unneeded, so remove it.

Link: http://lkml.kernel.org/r/576213D3.6080002@huawei.com
Signed-off-by: Joseph Qi <joseph.qi@huawei.com>
Reviewed-by: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-26 16:19:19 -07:00
Junxiao Bi 0b492f68bb ocfs2: improve recovery performance
Journal replay will be run when performing recovery for a dead node.  To
avoid the stale cache impact, all blocks of dead node's journal inode
were reloaded from disk.  This hurts the performance.  Check whether one
block is cached before reloading it can improve performance a lot.  In
my test env, the time doing recovery was improved from 120s to 1s.

[akpm@linux-foundation.org: clean up the for loop p_blkno handling]
Link: http://lkml.kernel.org/r/1466155682-24656-1-git-send-email-junxiao.bi@oracle.com
Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>
Reviewed-by: Joseph Qi <joseph.qi@huawei.com>
Cc: "Gang He" <ghe@suse.com>
Cc: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-26 16:19:19 -07:00
Eric Ren 191df2b513 ocfs2: fix a redundant re-initialization
Obviously, memset() has zeroed the whole struct locking_max_version.
So, it's no need to zero its two fields individually.

Link: http://lkml.kernel.org/r/1463970605-18354-1-git-send-email-zren@suse.com
Signed-off-by: Eric Ren <zren@suse.com>
Reviewed-by: Joseph Qi <joseph.qi@huawei.com>
Reviewed-by: Gang He <ghe@suse.com>
Cc: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-26 16:19:19 -07:00
Ross Zwisler 6b524995a7 dax: remote unused fault wrappers
Remove the unused wrappers dax_fault() and dax_pmd_fault().  After this
removal, rename __dax_fault() and __dax_pmd_fault() to dax_fault() and
dax_pmd_fault() respectively, and update all callers.

The dax_fault() and dax_pmd_fault() wrappers were initially intended to
capture some filesystem independent functionality around page faults
(calling sb_start_pagefault() & sb_end_pagefault(), updating file mtime
and ctime).

However, the following commits:

   5726b27b09 ("ext2: Add locking for DAX faults")
   ea3d7209ca ("ext4: fix races between page faults and hole punching")

added locking to the ext2 and ext4 filesystems after these common
operations but before __dax_fault() and __dax_pmd_fault() were called.
This means that these wrappers are no longer used, and are unlikely to
be used in the future.

XFS has had locking analogous to what was recently added to ext2 and
ext4 since DAX support was initially introduced by:

   6b698edeee ("xfs: add DAX file operations support")

Link: http://lkml.kernel.org/r/20160714214049.20075-2-ross.zwisler@linux.intel.com
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Jonathan Corbet <corbet@lwn.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-26 16:19:19 -07:00
Linus Torvalds 55392c4c06 Merge branch 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer updates from Thomas Gleixner:
 "This update provides the following changes:

   - The rework of the timer wheel which addresses the shortcomings of
     the current wheel (cascading, slow search for next expiring timer,
     etc).  That's the first major change of the wheel in almost 20
     years since Finn implemted it.

   - A large overhaul of the clocksource drivers init functions to
     consolidate the Device Tree initialization

   - Some more Y2038 updates

   - A capability fix for timerfd

   - Yet another clock chip driver

   - The usual pile of updates, comment improvements all over the place"

* 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (130 commits)
  tick/nohz: Optimize nohz idle enter
  clockevents: Make clockevents_subsys static
  clocksource/drivers/time-armada-370-xp: Fix return value check
  timers: Implement optimization for same expiry time in mod_timer()
  timers: Split out index calculation
  timers: Only wake softirq if necessary
  timers: Forward the wheel clock whenever possible
  timers/nohz: Remove pointless tick_nohz_kick_tick() function
  timers: Optimize collect_expired_timers() for NOHZ
  timers: Move __run_timers() function
  timers: Remove set_timer_slack() leftovers
  timers: Switch to a non-cascading wheel
  timers: Reduce the CPU index space to 256k
  timers: Give a few structs and members proper names
  hlist: Add hlist_is_singular_node() helper
  signals: Use hrtimer for sigtimedwait()
  timers: Remove the deprecated mod_timer_pinned() API
  timers, net/ipv4/inet: Initialize connection request timers as pinned
  timers, drivers/tty/mips_ejtag: Initialize the poll timer as pinned
  timers, drivers/tty/metag_da: Initialize the poll timer as pinned
  ...
2016-07-25 20:43:12 -07:00
Linus Torvalds 0f657262d5 Merge branch 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 mm updates from Ingo Molnar:
 "Various x86 low level modifications:

   - preparatory work to support virtually mapped kernel stacks (Andy
     Lutomirski)

   - support for 64-bit __get_user() on 32-bit kernels (Benjamin
     LaHaise)

   - (involved) workaround for Knights Landing CPU erratum (Dave Hansen)

   - MPX enhancements (Dave Hansen)

   - mremap() extension to allow remapping of the special VDSO vma, for
     purposes of user level context save/restore (Dmitry Safonov)

   - hweight and entry code cleanups (Borislav Petkov)

   - bitops code generation optimizations and cleanups with modern GCC
     (H. Peter Anvin)

   - syscall entry code optimizations (Paolo Bonzini)"

* 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (43 commits)
  x86/mm/cpa: Add missing comment in populate_pdg()
  x86/mm/cpa: Fix populate_pgd(): Stop trying to deallocate failed PUDs
  x86/syscalls: Add compat_sys_preadv64v2/compat_sys_pwritev64v2
  x86/smp: Remove unnecessary initialization of thread_info::cpu
  x86/smp: Remove stack_smp_processor_id()
  x86/uaccess: Move thread_info::addr_limit to thread_struct
  x86/dumpstack: Rename thread_struct::sig_on_uaccess_error to sig_on_uaccess_err
  x86/uaccess: Move thread_info::uaccess_err and thread_info::sig_on_uaccess_err to thread_struct
  x86/dumpstack: When OOPSing, rewind the stack before do_exit()
  x86/mm/64: In vmalloc_fault(), use CR3 instead of current->active_mm
  x86/dumpstack/64: Handle faults when printing the "Stack: " part of an OOPS
  x86/dumpstack: Try harder to get a call trace on stack overflow
  x86/mm: Remove kernel_unmap_pages_in_pgd() and efi_cleanup_page_tables()
  x86/mm/cpa: In populate_pgd(), don't set the PGD entry until it's populated
  x86/mm/hotplug: Don't remove PGD entries in remove_pagetable()
  x86/mm: Use pte_none() to test for empty PTE
  x86/mm: Disallow running with 32-bit PTEs to work around erratum
  x86/mm: Ignore A/D bits in pte/pmd/pud_none()
  x86/mm: Move swap offset/type up in PTE to work around erratum
  x86/entry: Inline enter_from_user_mode()
  ...
2016-07-25 15:34:18 -07:00
Linus Torvalds 9d0be76f52 Char/Misc driver patches for 4.8-rc1
Here is the big char/misc driver update for 4.8-rc1.
 
 Not a lot of stuff, but it's all over the place, full details are in the
 shortlog below.  All of these have been in linux-next with no reported
 issues for a while.
 
 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iFYEABECABYFAleVPBsPHGdyZWdAa3JvYWguY29tAAoJEDFH1A3bLfspEQgAoJOX
 nSWKA7j4JMGy1v+uNIqsgUmUAJsFyS388N+Faa2K4uyp7CYQ6jaAZw==
 =0Ofd
 -----END PGP SIGNATURE-----

Merge tag 'char-misc-4.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc

Pull char/misc driver updates from Greg KH:
 "Here is the big char/misc driver update for 4.8-rc1.

  Not a lot of stuff, but it's all over the place, full details are in
  the shortlog.  All of these have been in linux-next with no reported
  issues for a while"

* tag 'char-misc-4.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc: (49 commits)
  lkdtm: silence warnings about function declarations
  lkdtm: hide unused functions
  intel_th: pci: Add Kaby Lake PCH-H support
  intel_th: Fix a deadlock in modprobing
  dsp56k: prevent a harmless underflow
  chardev: add missing line break in pr_warn
  lkdtm: use struct arrays instead of enums
  lkdtm: move jprobe entry points to start of source
  lkdtm: reorganize module paramaters
  lkdtm: rename globals for clarity
  lkdtm: rename "count" to "crash_count"
  lkdtm: remove intentional off-by-one array access
  lkdtm: split remaining logic bug tests to separate file
  lkdtm: split heap corruption tests to separate file
  lkdtm: split memory permissions tests to separate file
  lkdtm: split usercopy tests to separate file
  lkdtm: drop "alloc_size" parameter
  lkdtm: add usercopy test for blocking kernel text
  extcon: adc-jack: add suspend/resume support
  extcon: add missing of_node_put after calling of_parse_phandle
  ...
2016-07-24 16:26:26 -07:00
Linus Torvalds b403f23044 We've got ten patches this time, half of which are related to a plethora
of nasty outcomes when inodes are transitioned from the unlinked state
 to the free state. Small file systems are particularly vulnerable to these
 problems, and it can manifest as mainly hangs, but also file system
 corruption. The patches have been tested for literally many weeks, with a
 very gruelling test, so I have a high level of confidence.
 
 - Andreas Gruenbacher wrote a series of 5 patches for various lockups
   during the transition of inodes from unlinked to free. The main patch
   is titled "Fix gfs2_lookup_by_inum lock inversion" and the other 4 are
   support and cleanup patches related to that.
 - Ben Marzinski contributed 2 patches with regard to a recreatable
   problem when gfs2 tries to write a page to a file that is being
   truncated, resulting in a BUG() in gfs2_remove_from_journal.
   Note that Ben had to export vfs function __block_write_full_page to get
   this to work properly. It's been posted a long time and he talked to
   various VFS people about it, and nobody seemed to mind.
 - I contributed 3 patches. (1) The first one fixes a memory corruptor:
   a race in which one process can overwrite the gl_object pointer set by
   another process, causing kernel panic and other symptoms. (2) The second
   patch fixes another race that resulted in a false-positive BUG_ON. This
   occurred when resource group reservations were freed by one process
   while another process was trying to grab a new reservation in the same
   resource group. (3) The third patch fixes a problem with doing journal
   replay when the journals are not all the same size.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQEcBAABAgAGBQJXklXIAAoJENeLYdPf93o7AbIIAImLEixK+4CaItEArAKG9TXv
 WbO+eDJfo6AOtAteB6+MdX2UxXAHJsCY6RmiEIAi5LzlVFiiCgRo4z/QgDARAw3c
 2RxlndElaESh82S27sLiFbgZeY7GZv04C0t6AzMkc830BLXiKMs6bXfeq1fzW8Sf
 AgAneACVsX0faRWo/XDuQcK81dwZ+qdOnR2+FvtOSFl1KgV0BrtnsW7IHv+5MIot
 SREDN7VvSQwQrLgwMlC0PvhwK3KCVvuO9ZziLEPpYJONESJfEmuCpG265+tUJNTw
 dIcW3p/vvgow8fb56fSnAxaeplPLlF9qJCq1M9fWZrKVbHg2uyCZMx4P52Fnmz4=
 =uUVs
 -----END PGP SIGNATURE-----

Merge tag 'gfs2-4.7.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2

Pull gfs2 updates from Bob Peterson:
 "We've got ten patches this time, half of which are related to a
  plethora of nasty outcomes when inodes are transitioned from the
  unlinked state to the free state.  Small file systems are particularly
  vulnerable to these problems, and it can manifest as mainly hangs, but
  also file system corruption.  The patches have been tested for
  literally many weeks, with a very gruelling test, so I have a high
  level of confidence.

   - Andreas Gruenbacher wrote a series of five patches for various
     lockups during the transition of inodes from unlinked to free.

     The main patch is titled "Fix gfs2_lookup_by_inum lock inversion"
     and the other four are support and cleanup patches related to that.

   - Ben Marzinski contributed two patches with regard to a recreatable
     problem when gfs2 tries to write a page to a file that is being
     truncated, resulting in a BUG() in gfs2_remove_from_journal.

     Note that Ben had to export vfs function __block_write_full_page to
     get this to work properly.  It's been posted a long time and he
     talked to various VFS people about it, and nobody seemed to mind.

   - I contributed 3 patches:
       o The first one fixes a memory corruptor: a race in which one
         process can overwrite the gl_object pointer set by another
         process, causing kernel panic and other symptoms.
       o The second patch fixes another race that resulted in a
         false-positive BUG_ON.  This occurred when resource group
         reservations were freed by one process while another process
         was trying to grab a new reservation in the same resource
         group.
       o The third patch fixes a problem with doing journal replay when
         the journals are not all the same size"

* tag 'gfs2-4.7.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2:
  GFS2: Fix gfs2_replay_incr_blk for multiple journal sizes
  GFS2: Check rs_free with rd_rsspin protection
  gfs2: writeout truncated pages
  fs: export __block_write_full_page
  gfs2: Lock holder cleanup
  gfs2: Large-filesystem fix for 32-bit systems
  gfs2: Get rid of gfs2_ilookup
  gfs2: Fix gfs2_lookup_by_inum lock inversion
  gfs2: Initialize iopen glock holder for new inodes
  GFS2: don't set rgrp gl_object until it's inserted into rgrp tree
2016-07-24 16:07:52 -07:00
Linus Torvalds 88083e9845 Merge branch 'overlayfs-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs
Pull overlayfs fixes from Miklos Szeredi:
 "This contains a fix for a potential crash/corruption issue and another
  where the suid/sgid bits weren't cleared on write"

* 'overlayfs-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs:
  ovl: verify upper dentry in ovl_remove_and_whiteout()
  ovl: Copy up underlying inode's ->i_mode to overlay inode
  ovl: handle ATTR_KILL*
2016-07-23 14:25:02 +09:00
Maxim Patlasov cfc9fde0b0 ovl: verify upper dentry in ovl_remove_and_whiteout()
The upper dentry may become stale before we call ovl_lock_rename_workdir.
For example, someone could (mistakenly or maliciously) manually unlink(2)
it directly from upperdir.

To ensure it is not stale, let's lookup it after ovl_lock_rename_workdir
and and check if it matches the upper dentry.

Essentially, it is the same problem and similar solution as in
commit 11f3710417 ("ovl: verify upper dentry before unlink and rename").

Signed-off-by: Maxim Patlasov <mpatlasov@virtuozzo.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Cc: <stable@vger.kernel.org>
2016-07-22 10:54:20 +02:00
Bob Peterson e1cb6be9e1 GFS2: Fix gfs2_replay_incr_blk for multiple journal sizes
Before this patch, if you used gfs2_jadd to add new journals of a
size smaller than the existing journals, replaying those new journals
would withdraw. That's because function gfs2_replay_incr_blk was
using the number of journal blocks (jd_block) from the superblock's
journal pointer. In other words, "My journal's max size" rather than
"the journal we're replaying's size." This patch changes the function
to use the size of the pertinent journal rather than always using the
journal we happen to be using.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2016-07-21 13:02:44 -05:00
Jann Horn 3e0a396546 xfs: fix type confusion in xfs_ioc_swapext
Without this check, the following XFS_I invocations would return bad
pointers when used on non-XFS inodes (perhaps pointers into preceding
allocator chunks).

This could be used by an attacker to trick xfs_swap_extents into
performing locking operations on attacker-chosen structures in kernel
memory, potentially leading to code execution in the kernel.  (I have
not investigated how likely this is to be usable for an attack in
practice.)

Signed-off-by: Jann Horn <jann@thejh.net>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Chinner <david@fromorbit.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-16 06:30:06 +09:00
H.J. Lu 3ebfd81f7f x86/syscalls: Add compat_sys_preadv64v2/compat_sys_pwritev64v2
Don't use the same syscall numbers for 2 different syscalls:

 534	x32	preadv			compat_sys_preadv64
 535	x32	pwritev			compat_sys_pwritev64
 534	x32	preadv2			compat_sys_preadv2
 535	x32	pwritev2		compat_sys_pwritev2

Add compat_sys_preadv64v2() and compat_sys_pwritev64v2() so that 64-bit offset
is passed in one 64-bit register on x32, similar to compat_sys_preadv64()
and compat_sys_pwritev64().

Signed-off-by: H.J. Lu <hjl.tools@gmail.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/CAMe9rOovCMf-RQfx_n1U_Tu_DX1BYkjtFr%3DQ4-_PFVSj9BCzUA@mail.gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-07-15 10:30:26 +02:00
Fengguang Wu 077e2642fb chardev: add missing line break in pr_warn
To fix super long dmesg error lines like

  CHRDEV "dummy_stm.0" major number 224 goes below the dynamic allocation rangeCHRDEV "dummy_stm.1" major number 223 goes below the dynamic allocation rangeswapper: page allocation failure: order:8, mode:0x26040c0(GFP_KERNEL|__GFP_COMP|__GFP_NOTRACK)

After fix, it should look like

  CHRDEV "dummy_stm.0" major number 224 goes below the dynamic allocation range
  CHRDEV "dummy_stm.1" major number 223 goes below the dynamic allocation range
  swapper: page allocation failure: order:8, mode:0x26040c0(GFP_KERNEL|__GFP_COMP|__GFP_NOTRACK)

Reported-by: Philip Li <philip.li@intel.com>
Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-07-14 16:21:53 +09:00
Bob Peterson 44f52122a2 GFS2: Check rs_free with rd_rsspin protection
For the last process to close a file opened for write, function
gfs2_rsqa_delete was deleting the file's inode's block reservation
out of the rgrp reservations tree. Then it was checking to make sure
rs_free was 0, but it was performing the check outside the protection
of rd_rsspin spin_lock. The rd_rsspin spin_lock protection is needed
to prevent a race between the process freeing the reservation and
another who is allocating a new set of blocks inside the same rgrp
for the same inode, thus changing its value.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2016-07-12 11:48:22 -05:00
Linus Torvalds 08d27eb206 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs fixes from Al Viro.

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  posix_acl: de-union a_refcount and a_rcu
  nfs_atomic_open(): prevent parallel nfs_lookup() on a negative hashed
  Use the right predicate in ->atomic_open() instances
2016-07-12 16:49:01 +09:00
Linus Torvalds b987c759d2 eCryptfs fixes for 4.7-rc7:
- Provide a more concise fix for CVE-2016-1583
   + Additionally fixes linux-stable regressions caused by the cherry-picking of
     the original fix
 - Some very minor changes that have queued up
   + Fix typos in code comments
   + Remove unnecessary check for NULL before destroying kmem_cache
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABCgAGBQJXf8nnAAoJENaSAD2qAscKwXgP/0awhY1z40dL/igP6fPv2ack
 HbqrOjUVO2DzxinvKB3vRLNy93zwESxe8UpwPsl84IJ85zOQjkUkJ8PYk1oyBf0N
 dVWqO11g6AKNZ+VQFspconvMhZATwSrsv8z3BzvwNGLsPhPuUQ+JmbBe8xMdrsZ5
 qVaWswsMtMlhM3p/zFh57vWO64fT1xiabpxSkKpG2LHJN6h6QAQxkfBfa2FuXCsN
 hZIw+ULcUJfdawXGq8lAfcYzbDmFpNt70fFquJgfJHrXFrOuensYfLcWUvhrSNbc
 HZ6imRK9LCG4IKjJTBNmCmBR8ho71yGzdKuup81Eap+2zx2kC7twokS1d5fha8iL
 Kzkx0NMDriY2N+tIfufHYk2IIenFzWG6Yuj0STswtJX4YhQGBc0H6VxcgrxE0PgW
 k1iKUV7jnJGxxN+d6lmV4+fX0vKGgBMsQq1Q76CkYLN1BAvdwz6GnWSfqP8hWz3o
 sNVyNtYh+/TXY8JMWKDBlps7Ib8W88qDW3K7YcAf2VPYAqIWm5Va1MR5m5s+UIeR
 QiCD32X/0PfDp13QRiKAHJ6C9CInyu0r+fF/g8ZMqLuWgLxoahxpr6ML/CnHoGl5
 IcDydJO3/bLBq9If8WxYsOQvVKCa4e7N7o7ZHPKd8U7O39mCGNfbQx7/FlMjtvf6
 +4HAxamUC1ogpLTkpWxI
 =Bt4P
 -----END PGP SIGNATURE-----

Merge tag 'ecryptfs-4.7-rc7-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tyhicks/ecryptfs

Pull eCryptfs fixes from Tyler Hicks:
 "Provide a more concise fix for CVE-2016-1583:
   - Additionally fixes linux-stable regressions caused by the
     cherry-picking of the original fix

  Some very minor changes that have queued up:
   - Fix typos in code comments
   - Remove unnecessary check for NULL before destroying kmem_cache"

* tag 'ecryptfs-4.7-rc7-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tyhicks/ecryptfs:
  ecryptfs: don't allow mmap when the lower fs doesn't support it
  Revert "ecryptfs: forbid opening files without mmap handler"
  ecryptfs: fix spelling mistakes
  eCryptfs: fix typos in comment
  ecryptfs: drop null test before destroy functions
2016-07-08 09:48:28 -07:00
Jeff Mahoney f0fe970df3 ecryptfs: don't allow mmap when the lower fs doesn't support it
There are legitimate reasons to disallow mmap on certain files, notably
in sysfs or procfs.  We shouldn't emulate mmap support on file systems
that don't offer support natively.

CVE-2016-1583

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Cc: stable@vger.kernel.org
[tyhicks: clean up f_op check by using ecryptfs_file_to_lower()]
Signed-off-by: Tyler Hicks <tyhicks@canonical.com>
2016-07-08 10:35:28 -05:00
Jeff Mahoney 78c4e17241 Revert "ecryptfs: forbid opening files without mmap handler"
This reverts commit 2f36db7100.

It fixed a local root exploit but also introduced a dependency on
the lower file system implementing an mmap operation just to open a file,
which is a bit of a heavy hammer.  The right fix is to have mmap depend
on the existence of the mmap handler instead.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Cc: stable@vger.kernel.org
Signed-off-by: Tyler Hicks <tyhicks@canonical.com>
2016-07-07 18:47:57 -05:00
Linus Torvalds ac904ae6e6 Merge branch 'for-linus' of git://git.kernel.dk/linux-block
Pull block IO fixes from Jens Axboe:
 "Three small fixes that have been queued up and tested for this series:

   - A bug fix for xen-blkfront from Bob Liu, fixing an issue with
     incomplete requests during migration.

   - A fix for an ancient issue in retrieving the IO priority of a
     different PID than self, preventing that task from going away while
     we access it.  From Omar.

   - A writeback fix from Tahsin, fixing a case where we'd call ihold()
     with a zero ref count inode"

* 'for-linus' of git://git.kernel.dk/linux-block:
  block: fix use-after-free in sys_ioprio_get()
  writeback: inode cgroup wb switch should not call ihold()
  xen-blkfront: save uncompleted reqs in blkfront_resume()
2016-07-07 15:34:09 -07:00
Linus Torvalds 4c2a8499a4 Configfs fixes for Linux 4.7:
- a fix from Marek for ppos handling in configfs_write_bin_file,
    which was introduced in Linux 4.5, but didn't have any users
    until recently.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJXfjK1AAoJEA+eU2VSBFGDJ0kQAMZHONK7pHwM+IxDUeTsDTfa
 FX+EplF1rLEtmUGOl01XbjgQp7acsP19YWikQfC09+ZjF6Vn1zFAFlNoU3qM+i/2
 zdukzIRBaSM+4w4HFDQ548zGGc8e9mesIIUHrHt6n/nL0OLKTU0XzbmRMXmvXAUJ
 u0nuB0OYlFJEVQFIlDYfG6E2rJy37FPilToonfw+AryVDenRm9iiUt0iMFoA8wPG
 EpogUinelxrKZ+ysOEeibaTGxxLLbd3AbWeUQbkhmsk4FfxuV7GFSfGPbhmJ1LeU
 n5X3LbK8lixG8goGdAW1NYulLnTjprZ6emUL0jwxYmI+MzOP7DUcsLqsGmdh5LEa
 Uw5gzQnzNDOUFR8XFt5CgARUHRXeDyJfzKeWvKFd4JUink6wE9R0yQJ2bPBXu3pY
 7t0P4qQSKWdeGjmYg54/JBamMba7BLQOLZuuiAplTTAt5Dg4tEi9Zuje2sUmBcDn
 3YnG4dnGxPeP9EKCElh1WWtwiRItUKT+YtaqSinL1Rh2j1aWu9WJQ3M1C+s3hFQ3
 vGR/CchllLtP4xmpY9TXEpUbBp4ZnTercWAxLRczi1/kOm3SdMIMUak26U6pTBmN
 QfkEzSx+K5F1wGaoqa2j7MuTMSNsxfT1R0xA6aBid4oOyCoPyHOVJ3mqV0AziyOl
 c7B92HgR/aGFMEevVA1G
 =mZ2H
 -----END PGP SIGNATURE-----

Merge tag 'configfs-for-4.7' of git://git.infradead.org/users/hch/configfs

Pull configfs fix from Christoph Hellwig:
 "A fix from Marek for ppos handling in configfs_write_bin_file, which
  was introduced in Linux 4.5, but didn't have any users until recently"

* tag 'configfs-for-4.7' of git://git.infradead.org/users/hch/configfs:
  configfs: Remove ppos increment in configfs_write_bin_file
2016-07-07 15:32:17 -07:00
Ingo Molnar 4b4b20852d Merge branch 'timers/fast-wheel' into timers/core 2016-07-07 10:35:28 +02:00
Al Viro c94c09535c nfs_atomic_open(): prevent parallel nfs_lookup() on a negative hashed
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-07-05 16:02:31 -04:00
Al Viro 00699ad857 Use the right predicate in ->atomic_open() instances
->atomic_open() can be given an in-lookup dentry *or* a negative one
found in dcache.  Use d_in_lookup() to tell one from another, rather
than d_unhashed().

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-07-05 16:02:23 -04:00
Vivek Goyal 07a2daab49 ovl: Copy up underlying inode's ->i_mode to overlay inode
Right now when a new overlay inode is created, we initialize overlay
inode's ->i_mode from underlying inode ->i_mode but we retain only
file type bits (S_IFMT) and discard permission bits.

This patch changes it and retains permission bits too. This should allow
overlay to do permission checks on overlay inode itself in task context.

[SzM] It also fixes clearing suid/sgid bits on write.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Reported-by: Eryu Guan <eguan@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Fixes: 4bacc9c923 ("overlayfs: Make f_path always point to the overlay and f_inode to the underlay")
Cc: <stable@vger.kernel.org>
2016-07-04 16:49:48 +02:00
Miklos Szeredi b99c2d9138 ovl: handle ATTR_KILL*
Before 4bacc9c923 ("overlayfs: Make f_path...") file->f_path pointed to
the underlying file, hence suid/sgid removal on write worked fine.

After that patch file->f_path pointed to the overlay file, and the file
mode bits weren't copied to overlay_inode->i_mode.  So the suid/sgid
removal simply stopped working.

The fix is to copy the mode bits, but then ovl_setattr() needs to clear
ATTR_MODE to avoid the BUG() in notify_change().  So do this first, then in
the next patch copy the mode.

Reported-by: Eryu Guan <eguan@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Fixes: 4bacc9c923 ("overlayfs: Make f_path always point to the overlay and f_inode to the underlay")
Cc: <stable@vger.kernel.org>
2016-07-04 16:49:48 +02:00
Linus Torvalds 0b295dd5b8 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse
Pull fuse fix from Miklos Szeredi:
 "This makes sure userspace filesystems are not broken by the parallel
  lookups and readdir feature"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse:
  fuse: serialize dirops by default
2016-07-03 12:02:00 -07:00
Linus Torvalds 236bfd8ed8 Merge branch 'overlayfs-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs
Pull overlayfs fixes from Miklos Szeredi:
 "This contains fixes for a dentry leak, a regression in 4.6 noticed by
  Docker users and missing write access checking in truncate"

* 'overlayfs-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs:
  ovl: warn instead of error if d_type is not supported
  ovl: get_write_access() in truncate
  ovl: fix dentry leak for default_permissions
2016-07-03 11:57:09 -07:00
Vivek Goyal e7c0b5991d ovl: warn instead of error if d_type is not supported
overlay needs underlying fs to support d_type. Recently I put in a
patch in to detect this condition and started failing mount if
underlying fs did not support d_type.

But this breaks existing configurations over kernel upgrade. Those who
are running docker (partially broken configuration) with xfs not
supporting d_type, are surprised that after kernel upgrade docker does
not run anymore.

https://github.com/docker/docker/issues/22937#issuecomment-229881315

So instead of erroring out, detect broken configuration and warn
about it. This should allow existing docker setups to continue
working after kernel upgrade.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Fixes: 45aebeaf4f ("ovl: Ensure upper filesystem supports d_type")
Cc: <stable@vger.kernel.org> 4.6
2016-07-03 09:39:31 +02:00
Linus Torvalds 48c4565ed6 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs fixes from Al Viro:
 "Tmpfs readdir throughput regression fix (this cycle) + some -stable
  fodder all over the place.

  One missing bit is Miklos' tonight locks.c fix - NFS folks had already
  grabbed that one by the time I woke up ;-)"

[ The locks.c fix came through the nfsd tree just moments ago ]

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  namespace: update event counter when umounting a deleted dentry
  9p: use file_dentry()
  ceph: fix d_obtain_alias() misuses
  lockless next_positive()
  libfs.c: new helper - next_positive()
  dcache_{readdir,dir_lseek}(): don't bother with nested ->d_lock
2016-07-01 15:20:11 -07:00
Linus Torvalds 2728c57fda One fix for lockd soft lookups in an error path, and one fix for file
leases on overlayfs.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJXdr1UAAoJECebzXlCjuG+QlsQAJiZTmio6k9tupN5+iKsZNL3
 m919ooj8GYsXxlC0OTFfi09dUi1yeF8MEE3j9egk/+qxEtsdmKdEOIy0RVdcLSfd
 HeGjXgLh79hVcGxgyBP+pdax2XhZ3RVisg8F5gTw2GPo+FPFZfrEuO5h7ctn+t45
 MCQ+4yqYqzEhYnoPyo5XKh5Aj6wBWiaTzg3/jSe6uSuSfuBfyaMaBPq7l7ayGra/
 5El+tu61o/SrJ41N2EayWSj/bOFJE92LIuGOh8NdfANuuP70JhxlwgVSldah3CCQ
 6PymXAVcjw0+gJ00mokzKfTJW5FPfasxckMHaOvcFsSONy4rlmrwwqUr9C2AFzTE
 wQGIzibCDYOSI8uF+//Oe+dh8JWp2TF8rfJcmKLyJMIcCq/Xl6cNx1qrq/oSWvuk
 WKOv1otJrPeT31h/s5iLjr/E/Po1eX+d2ySJdvUHYu/5aZwFgWPnVXwJk0s9bLow
 auZU85tPnuz+tbS2pWEK+el7LMgDBdzraVRogMdH1c+m3+G9pr53EzmpYovkZ2X8
 duVJ2Leslyya347TnJAgEY47Fbeu26JaoeIChGVhKcEyCENlcqAJWaGVECrxvs3y
 p/Y2MYMkO8YrCz5wQXPiLFiG4rAc+jSIn1Q+vRGl2Pkel0y7AgJNNMFtANjMCSIO
 pg6BqUjOyKt8cXVy4UW7
 =K0mr
 -----END PGP SIGNATURE-----

Merge tag 'nfsd-4.7-3' of git://linux-nfs.org/~bfields/linux

Pull lockd/locks fixes from Bruce Fields:
 "One fix for lockd soft lookups in an error path, and one fix for file
  leases on overlayfs"

* tag 'nfsd-4.7-3' of git://linux-nfs.org/~bfields/linux:
  locks: use file_inode()
  lockd: unregister notifier blocks if the service fails to come up completely
2016-07-01 15:18:49 -07:00
Linus Torvalds f3683ccd12 Merge branch 'libnvdimm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm
Pull libnvdimm fixes from Dan Williams:
 "1/ Two regression fixes since v4.6: one for the byte order of a sysfs
     attribute (bz121161) and another for QEMU 2.6's NVDIMM _DSM (ACPI
     Device Specific Method) implementation that gets tripped up by new
     auto-probing behavior in the NFIT driver.

  2/ A fix tagged for -stable that stops the kernel from
     clobbering/ignoring changes to the configuration of a 'pfn'
     instance ("struct page" driver).  For example changing the
     alignment from 2M to 1G may silently revert to 2M if that value is
     currently stored on media.

  3/ A fix from Eric for an xfstests failure in dax.  It is not
     currently tagged for -stable since it requires an 8-exabyte file
     system to trigger, and there appear to be no user visible side
     effects"

* 'libnvdimm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
  nfit: fix format interface code byte order
  dax: fix offset overflow in dax_io
  acpi, nfit: fix acpi_check_dsm() vs zero functions implemented
  libnvdimm, pfn, dax: fix initialization vs autodetect for mode + alignment
2016-07-01 15:15:03 -07:00
Miklos Szeredi 6343a21208 locks: use file_inode()
(Another one for the f_path debacle.)

ltp fcntl33 testcase caused an Oops in selinux_file_send_sigiotask.

The reason is that generic_add_lease() used filp->f_path.dentry->inode
while all the others use file_inode().  This makes a difference for files
opened on overlayfs since the former will point to the overlay inode the
latter to the underlying inode.

So generic_add_lease() added the lease to the overlay inode and
generic_delete_lease() removed it from the underlying inode.  When the file
was released the lease remained on the overlay inode's lock list, resulting
in use after free.

Reported-by: Eryu Guan <eguan@redhat.com>
Fixes: 4bacc9c923 ("overlayfs: Make f_path always point to the overlay and f_inode to the underlay")
Cc: <stable@vger.kernel.org>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2016-07-01 10:24:18 -04:00
Andrey Ulanov e06b933e6d namespace: update event counter when umounting a deleted dentry
- m_start() in fs/namespace.c expects that ns->event is incremented each
  time a mount added or removed from ns->list.
- umount_tree() removes items from the list but does not increment event
  counter, expecting that it's done before the function is called.
- There are some codepaths that call umount_tree() without updating
  "event" counter. e.g. from __detach_mounts().
- When this happens m_start may reuse a cached mount structure that no
  longer belongs to ns->list (i.e. use after free which usually leads
  to infinite loop).

This change fixes the above problem by incrementing global event counter
before invoking umount_tree().

Change-Id: I622c8e84dcb9fb63542372c5dbf0178ee86bb589
Cc: stable@vger.kernel.org
Signed-off-by: Andrey Ulanov <andreyu@google.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-06-30 23:28:30 -04:00
Miklos Szeredi b403f0e37a 9p: use file_dentry()
v9fs may be used as lower layer of overlayfs and accessing f_path.dentry
can lead to a crash.  In this case it's a NULL pointer dereference in
p9_fid_create().

Fix by replacing direct access of file->f_path.dentry with the
file_dentry() accessor, which will always return a native object.

Reported-by: Alessio Igor Bogani <alessioigorbogani@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Tested-by: Alessio Igor Bogani <alessioigorbogani@gmail.com>
Fixes: 4bacc9c923 ("overlayfs: Make f_path always point to the overlay and f_inode to the underlay")
Cc: <stable@vger.kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-06-30 23:28:09 -04:00
Scott Mayhew cb7d224f82 lockd: unregister notifier blocks if the service fails to come up completely
If the lockd service fails to start up then we need to be sure that the
notifier blocks are not registered, otherwise a subsequent start of the
service could cause the same notifier to be registered twice, leading to
soft lockups.

Signed-off-by: Scott Mayhew <smayhew@redhat.com>
Cc: stable@vger.kernel.org
Fixes: 0751ddf77b "lockd: Register callbacks on the inetaddr_chain..."
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2016-06-30 16:35:07 -04:00
Tahsin Erdogan 7452495555 writeback: inode cgroup wb switch should not call ihold()
Asynchronous wb switching of inodes takes an additional ref count on an
inode to make sure inode remains valid until switchover is completed.

However, anyone calling ihold() must already have a ref count on inode,
but in this case inode->i_count may already be zero:

------------[ cut here ]------------
WARNING: CPU: 1 PID: 917 at fs/inode.c:397 ihold+0x2b/0x30
CPU: 1 PID: 917 Comm: kworker/u4:5 Not tainted 4.7.0-rc2+ #49
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs
01/01/2011
Workqueue: writeback wb_workfn (flush-8:16)
 0000000000000000 ffff88007ca0fb58 ffffffff805990af 0000000000000000
 0000000000000000 ffff88007ca0fb98 ffffffff80268702 0000018d000004e2
 ffff88007cef40e8 ffff88007c9b89a8 ffff880079e3a740 0000000000000003
Call Trace:
 [<ffffffff805990af>] dump_stack+0x4d/0x6e
 [<ffffffff80268702>] __warn+0xc2/0xe0
 [<ffffffff802687d8>] warn_slowpath_null+0x18/0x20
 [<ffffffff8035b4ab>] ihold+0x2b/0x30
 [<ffffffff80367ecc>] inode_switch_wbs+0x11c/0x180
 [<ffffffff80369110>] wbc_detach_inode+0x170/0x1a0
 [<ffffffff80369abc>] writeback_sb_inodes+0x21c/0x530
 [<ffffffff80369f7e>] wb_writeback+0xee/0x1e0
 [<ffffffff8036a147>] wb_workfn+0xd7/0x280
 [<ffffffff80287531>] ? try_to_wake_up+0x1b1/0x2b0
 [<ffffffff8027bb09>] process_one_work+0x129/0x300
 [<ffffffff8027be06>] worker_thread+0x126/0x480
 [<ffffffff8098cde7>] ? __schedule+0x1c7/0x561
 [<ffffffff8027bce0>] ? process_one_work+0x300/0x300
 [<ffffffff80280ff4>] kthread+0xc4/0xe0
 [<ffffffff80335578>] ? kfree+0xc8/0x100
 [<ffffffff809903cf>] ret_from_fork+0x1f/0x40
 [<ffffffff80280f30>] ? __kthread_parkme+0x70/0x70
---[ end trace aaefd2fd9f306bc4 ]---

Signed-off-by: Tahsin Erdogan <tahsin@google.com>
Acked-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-30 13:58:41 -06:00
Miklos Szeredi 5c672ab3f0 fuse: serialize dirops by default
Negotiate with userspace filesystems whether they support parallel readdir
and lookup.  Disable parallelism by default for fear of breaking fuse
filesystems.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Fixes: 9902af79c0 ("parallel lookups: actual switch to rwsem")
Fixes: d9b3dbdcfd ("fuse: switch to ->iterate_shared()")
2016-06-30 13:10:49 +02:00
Marek Vasut f8608985f8 configfs: Remove ppos increment in configfs_write_bin_file
The simple_write_to_buffer() already increments the @ppos on success,
see fs/libfs.c simple_write_to_buffer() comment:

"
On success, the number of bytes written is returned and the offset @ppos
advanced by this number, or negative value is returned on error.
"

If the configfs_write_bin_file() is invoked with @count smaller than the
total length of the written binary file, it will be invoked multiple times.
Since configfs_write_bin_file() increments @ppos on success, after calling
simple_write_to_buffer(), the @ppos is incremented twice.

Subsequent invocation of configfs_write_bin_file() will result in the next
piece of data being written to the offset twice as long as the length of
the previous write, thus creating buffer with "holes" in it.

The simple testcase using DTO follows:
  $ mkdir /sys/kernel/config/device-tree/overlays/1
  $ dd bs=1 if=foo.dtbo of=/sys/kernel/config/device-tree/overlays/1/dtbo
Without this patch, the testcase will result in twice as big buffer in the
kernel, which is then passed to the cfs_overlay_item_dtbo_write() .

Signed-off-by: Marek Vasut <marex@denx.de>
Cc: Geert Uytterhoeven <geert+renesas@glider.be>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Pantelis Antoniou <pantelis.antoniou@konsulko.com>
2016-06-30 11:28:55 +02:00
Linus Torvalds e7bdea7750 NFS client bugfixes for Linux 4.7
Stable bugfixes:
 - Fix _cancel_empty_pagelist
 - Fix a double page unlock
 - Make nfs_atomic_open() call d_drop() on all ->open_context() errors.
 - Fix another OPEN_DOWNGRADE bug
 
 Other bugfixes:
 - Ensure we handle delegation errors in nfs4_proc_layoutget()
 - Layout stateids start out as being invalid
 - Add sparse lock annotations for pnfs_find_alloc_layout
 - Handle bad delegation stateids in nfs4_layoutget_handle_exception
 - Fix up O_DIRECT results
 - Fix potential use after free of state in nfs4_do_reclaim.
 - Mark the layout stateid invalid when all segments are removed
 - Don't let readdirplus revalidate an inode that was marked as stale
 - Fix potential race in nfs_fhget()
 - Fix an unused variable warning
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQIcBAABCAAGBQJXdDWkAAoJENfLVL+wpUDrpsUP/1F2zu12r/Zkv3ZEFhShcpQc
 2N1TRD9X7Lruod2pUD95qqjjdw+/vu3LjcyJljrasRaJENijvZ2GQhKkB7xPODlu
 qxZcmnQsH+WmpmJKcqAByAW1czcNGMoMHnt4tV0gG21NH+XUb92fgn+aGeIJDVrK
 Hcd9d8TfnFWO70ZgTUXW/hv0CXwu4MEJhN2JfF4lolbxUkmjLHHLoxSDDm0AdXGC
 EE8f0V9/7xurvOeLe5bQOQXfZPedBydsLNXa1ZacMGKgmBUoRNxJ5yCpPUtcTVBx
 HwbiY+WDFQ7MdKTzUQqqbnrIqKw8Hu4SugIV/vHRqR+Lhc6u29YGOqdU4d2G8IKW
 Nh8MBqS+dDefCkL3TJoE7MpjhP3EOO6HXnv5FjMZLOuu2X2o+Sz3+DkhYCq6pj/g
 fFh480vZfZYaTsfDf1ttvVN8kIvQ+1Uk3LK6aC2EVwgPrv+0OIRu36F0JQQimxOp
 EbYDlhVk7mzH/ZQ31GmPPbSIk+3sm2V58lqXnUMovoqPFPiN3xZDuBcnlyFrrzaI
 NjOvsdVxkOdHWbYZyBQzj16Vo651EYbAAUEwsud70N8C3aCgkTxCZ30Q0v+KqqxU
 pP5kz3zYUdkXQeHxE6T0iXG9fGcv/nGS21hTfT01YJCK7v67K8TYRNMrOEVURVgk
 LSD/CZJXJHVJn1Vr4F7o
 =IGNO
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-4.7-2' of git://git.linux-nfs.org/projects/anna/linux-nfs

Pull NFS client bugfixes from Anna Schumaker:
 "Stable bugfixes:
   - Fix _cancel_empty_pagelist
   - Fix a double page unlock
   - Make nfs_atomic_open() call d_drop() on all ->open_context() errors.
   - Fix another OPEN_DOWNGRADE bug

  Other bugfixes:
   - Ensure we handle delegation errors in nfs4_proc_layoutget()
   - Layout stateids start out as being invalid
   - Add sparse lock annotations for pnfs_find_alloc_layout
   - Handle bad delegation stateids in nfs4_layoutget_handle_exception
   - Fix up O_DIRECT results
   - Fix potential use after free of state in nfs4_do_reclaim.
   - Mark the layout stateid invalid when all segments are removed
   - Don't let readdirplus revalidate an inode that was marked as stale
   - Fix potential race in nfs_fhget()
   - Fix an unused variable warning"

* tag 'nfs-for-4.7-2' of git://git.linux-nfs.org/projects/anna/linux-nfs:
  NFS: Fix another OPEN_DOWNGRADE bug
  make nfs_atomic_open() call d_drop() on all ->open_context() errors.
  NFS: Fix an unused variable warning
  NFS: Fix potential race in nfs_fhget()
  NFS: Don't let readdirplus revalidate an inode that was marked as stale
  NFSv4.1/pnfs: Mark the layout stateid invalid when all segments are removed
  NFS: Fix a double page unlock
  pnfs_nfs: fix _cancel_empty_pagelist
  nfs4: Fix potential use after free of state in nfs4_do_reclaim.
  NFS: Fix up O_DIRECT results
  NFS/pnfs: handle bad delegation stateids in nfs4_layoutget_handle_exception
  NFSv4.1/pnfs: Add sparse lock annotations for pnfs_find_alloc_layout
  NFSv4.1/pnfs: Layout stateids start out as being invalid
  NFSv4.1/pnfs: Ensure we handle delegation errors in nfs4_proc_layoutget()
2016-06-29 15:30:26 -07:00
Miklos Szeredi 03bea60409 ovl: get_write_access() in truncate
When truncating a file we should check write access on the underlying
inode.  And we should do so on the lower file as well (before copy-up) for
consistency.

Original patch and test case by Aihua Zhang.

 - - >o >o - - test.c - - >o >o - -
#include <stdio.h>
#include <errno.h>
#include <unistd.h>

int main(int argc, char *argv[])
{
	int ret;

	ret = truncate(argv[0], 4096);
	if (ret != -1) {
		fprintf(stderr, "truncate(argv[0]) should have failed\n");
		return 1;
	}
	if (errno != ETXTBSY) {
		perror("truncate(argv[0])");
		return 1;
	}

	return 0;
}
 - - >o >o - - >o >o - - >o >o - -

Reported-by: Aihua Zhang <zhangaihua1@huawei.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Cc: <stable@vger.kernel.org>
2016-06-29 16:03:55 +02:00
Miklos Szeredi a4859d7594 ovl: fix dentry leak for default_permissions
When using the 'default_permissions' mount option, ovl_permission() on
non-directories was missing a dput(alias), resulting in "BUG Dentry still
in use".

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Fixes: 8d3095f4ad ("ovl: default permissions")
Cc: <stable@vger.kernel.org> # v4.5+
2016-06-29 08:26:59 +02:00