Commit Graph

7916 Commits

Author SHA1 Message Date
Johannes Weiner 6dbaf22ce1 mm: shmem: save one radix tree lookup when truncating swapped pages
Page cache radix tree slots are usually stabilized by the page lock, but
shmem's swap cookies have no such thing.  Because the overall truncation
loop is lockless, the swap entry is currently confirmed by a tree lookup
and then deleted by another tree lookup under the same tree lock region.

Use radix_tree_delete_item() instead, which does the verification and
deletion with only one lookup.  This also allows removing the
delete-only special case from shmem_radix_tree_replace().

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Minchan Kim <minchan@kernel.org>
Reviewed-by: Rik van Riel <riel@redhat.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Bob Liu <bob.liu@oracle.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jan Kara <jack@suse.cz>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Luigi Semenzato <semenzato@google.com>
Cc: Metin Doslu <metin@citusdata.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Ozgun Erdogan <ozgun@citusdata.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Roman Gushchin <klamm@yandex-team.ru>
Cc: Ryan Mallon <rmallon@gmail.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-03 16:21:00 -07:00
Vladimir Davydov d5bc5fd3fc mm: vmscan: shrink_slab: rename max_pass -> freeable
The name `max_pass' is misleading, because this variable actually keeps
the estimate number of freeable objects, not the maximal number of
objects we can scan in this pass, which can be twice that.  Rename it to
reflect its actual meaning.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-03 16:21:00 -07:00
Davidlohr Bueso 8382d914eb mm, hugetlb: improve page-fault scalability
The kernel can currently only handle a single hugetlb page fault at a
time.  This is due to a single mutex that serializes the entire path.
This lock protects from spurious OOM errors under conditions of low
availability of free hugepages.  This problem is specific to hugepages,
because it is normal to want to use every single hugepage in the system
- with normal pages we simply assume there will always be a few spare
pages which can be used temporarily until the race is resolved.

Address this problem by using a table of mutexes, allowing a better
chance of parallelization, where each hugepage is individually
serialized.  The hash key is selected depending on the mapping type.
For shared ones it consists of the address space and file offset being
faulted; while for private ones the mm and virtual address are used.
The size of the table is selected based on a compromise of collisions
and memory footprint of a series of database workloads.

Large database workloads that make heavy use of hugepages can be
particularly exposed to this issue, causing start-up times to be
painfully slow.  This patch reduces the startup time of a 10 Gb Oracle
DB (with ~5000 faults) from 37.5 secs to 25.7 secs.  Larger workloads
will naturally benefit even more.

NOTE:
The only downside to this patch, detected by Joonsoo Kim, is that a
small race is possible in private mappings: A child process (with its
own mm, after cow) can instantiate a page that is already being handled
by the parent in a cow fault.  When low on pages, can trigger spurious
OOMs.  I have not been able to think of a efficient way of handling
this...  but do we really care about such a tiny window? We already
maintain another theoretical race with normal pages.  If not, one
possible way to is to maintain the single hash for private mappings --
any workloads that *really* suffer from this scaling problem should
already use shared mappings.

[akpm@linux-foundation.org: remove stray + characters, go BUG if hugetlb_init() kmalloc fails]
Signed-off-by: Davidlohr Bueso <davidlohr@hp.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: David Gibson <david@gibson.dropbear.id.au>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-03 16:21:00 -07:00
Joonsoo Kim 4e35f48385 mm, hugetlb: use vma_resv_map() map types
Util now, we get a resv_map by two ways according to each mapping type.
This makes code dirty and unreadable.  Unify it.

[davidlohr@hp.com: code cleanups]
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Davidlohr Bueso <davidlohr@hp.com>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-03 16:20:59 -07:00
Joonsoo Kim f031dd274c mm, hugetlb: remove resv_map_put
This is a preparation patch to unify the use of vma_resv_map()
regardless of the map type.  This patch prepares it by removing
resv_map_put(), which only works for HPAGE_RESV_OWNER's resv_map, not
for all resv_maps.

[davidlohr@hp.com: update changelog]
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Davidlohr Bueso <davidlohr@hp.com>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-03 16:20:59 -07:00
Davidlohr Bueso 7b24d8616b mm, hugetlb: fix race in region tracking
There is a race condition if we map a same file on different processes.
Region tracking is protected by mmap_sem and hugetlb_instantiation_mutex.
When we do mmap, we don't grab a hugetlb_instantiation_mutex, but only
mmap_sem (exclusively).  This doesn't prevent other tasks from modifying
the region structure, so it can be modified by two processes
concurrently.

To solve this, introduce a spinlock to resv_map and make region
manipulation function grab it before they do actual work.

[davidlohr@hp.com: updated changelog]
Signed-off-by: Davidlohr Bueso <davidlohr@hp.com>
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Suggested-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Acked-by: David Gibson <david@gibson.dropbear.id.au>
Cc: David Gibson <david@gibson.dropbear.id.au>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-03 16:20:59 -07:00
Joonsoo Kim 1406ec9ba6 mm, hugetlb: improve, cleanup resv_map parameters
To change a protection method for region tracking to find grained one,
we pass the resv_map, instead of list_head, to region manipulation
functions.

This doesn't introduce any functional change, and it is just for
preparing a next step.

[davidlohr@hp.com: update changelog]
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Davidlohr Bueso <davidlohr@hp.com>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-03 16:20:59 -07:00
Joonsoo Kim 9119a41e90 mm, hugetlb: unify region structure handling
Currently, to track reserved and allocated regions, we use two different
ways, depending on the mapping.  For MAP_SHARED, we use
address_mapping's private_list and, while for MAP_PRIVATE, we use a
resv_map.

Now, we are preparing to change a coarse grained lock which protect a
region structure to fine grained lock, and this difference hinder it.
So, before changing it, unify region structure handling, consistently
using a resv_map regardless of the kind of mapping.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Davidlohr Bueso <davidlohr@hp.com>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-03 16:20:59 -07:00
Mel Gorman d26914d117 mm: optimize put_mems_allowed() usage
Since put_mems_allowed() is strictly optional, its a seqcount retry, we
don't need to evaluate the function if the allocation was in fact
successful, saving a smp_rmb some loads and comparisons on some relative
fast-paths.

Since the naming, get/put_mems_allowed() does suggest a mandatory
pairing, rename the interface, as suggested by Mel, to resemble the
seqcount interface.

This gives us: read_mems_allowed_begin() and read_mems_allowed_retry(),
where it is important to note that the return value of the latter call
is inverted from its previous incarnation.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-03 16:20:58 -07:00
David Rientjes 91ca918648 mm, compaction: ignore pageblock skip when manually invoking compaction
The cached pageblock hint should be ignored when triggering compaction
through /proc/sys/vm/compact_memory so all eligible memory is isolated.
Manually invoking compaction is known to be expensive, there's no need
to skip pageblocks based on heuristics (mainly for debugging).

Signed-off-by: David Rientjes <rientjes@google.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-03 16:20:58 -07:00
Vladimir Davydov 3115cd9145 mm: vmscan: remove shrink_control arg from do_try_to_free_pages()
There is no need passing on a shrink_control struct from
try_to_free_pages() and friends to do_try_to_free_pages() and then to
shrink_zones(), because it is only used in shrink_zones() and the only
field initialized on the top level is gfp_mask, which is always equal to
scan_control.gfp_mask.  So let's move shrink_control initialization to
shrink_zones().

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Glauber Costa <glommer@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-03 16:20:58 -07:00
Vladimir Davydov 65ec02cb9a mm: vmscan: move call to shrink_slab() to shrink_zones()
This reduces the indentation level of do_try_to_free_pages() and removes
extra loop over all eligible zones counting the number of on-LRU pages.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Reviewed-by: Glauber Costa <glommer@gmail.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-03 16:20:58 -07:00
Vladimir Davydov 99120b772b mm: vmscan: respect NUMA policy mask when shrinking slab on direct reclaim
When direct reclaim is executed by a process bound to a set of NUMA
nodes, we should scan only those nodes when possible, but currently we
will scan kmem from all online nodes even if the kmem shrinker is NUMA
aware.  That said, binding a process to a particular NUMA node won't
prevent it from shrinking inode/dentry caches from other nodes, which is
not good.  Fix this.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Glauber Costa <glommer@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-03 16:20:58 -07:00
Li Zefan 8910ae896c kmemleak: change some global variables to int
They don't have to be atomic_t, because they are simple boolean toggles.

Signed-off-by: Li Zefan <lizefan@huawei.com>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-03 16:20:50 -07:00
Li Zefan 5f3bf19aeb kmemleak: remove redundant code
Remove kmemleak_padding() and kmemleak_release().

Signed-off-by: Li Zefan <lizefan@huawei.com>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-03 16:20:50 -07:00
Li Zefan c89da70c73 kmemleak: allow freeing internal objects after kmemleak was disabled
Currently if kmemleak is disabled, the kmemleak objects can never be
freed, no matter if it's disabled by a user or due to fatal errors.

Those objects can be a big waste of memory.

    OBJS ACTIVE  USE OBJ SIZE  SLABS OBJ/SLAB CACHE SIZE NAME
  1200264 1197433  99%    0.30K  46164       26    369312K kmemleak_object

With this patch, after kmemleak was disabled you can reclaim memory
with:

	# echo clear > /sys/kernel/debug/kmemleak

Also inform users about this with a printk.

Signed-off-by: Li Zefan <lizefan@huawei.com>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-03 16:20:50 -07:00
Li Zefan dc9b3f4249 kmemleak: free internal objects only if there're no leaks to be reported
Currently if you stop kmemleak thread before disabling kmemleak,
kmemleak objects will be freed and so you won't be able to check
previously reported leaks.

With this patch, kmemleak objects won't be freed if there're leaks that
can be reported.

Signed-off-by: Li Zefan <lizefan@huawei.com>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-03 16:20:50 -07:00
Jan Kara 5acda9d12d bdi: avoid oops on device removal
After commit 839a8e8660 ("writeback: replace custom worker pool
implementation with unbound workqueue") when device is removed while we
are writing to it we crash in bdi_writeback_workfn() ->
set_worker_desc() because bdi->dev is NULL.

This can happen because even though bdi_unregister() cancels all pending
flushing work, nothing really prevents new ones from being queued from
balance_dirty_pages() or other places.

Fix the problem by clearing BDI_registered bit in bdi_unregister() and
checking it before scheduling of any flushing work.

Fixes: 839a8e8660

Reviewed-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jan Kara <jack@suse.cz>
Cc: Derek Basehore <dbasehore@chromium.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-03 16:20:49 -07:00
Derek Basehore 6ca738d60c backing_dev: fix hung task on sync
bdi_wakeup_thread_delayed() used the mod_delayed_work() function to
schedule work to writeback dirty inodes.  The problem with this is that
it can delay work that is scheduled for immediate execution, such as the
work from sync_inodes_sb().  This can happen since mod_delayed_work()
can now steal work from a work_queue.  This fixes the problem by using
queue_delayed_work() instead.  This is a regression caused by commit
839a8e8660 ("writeback: replace custom worker pool implementation with
unbound workqueue").

The reason that this causes a problem is that laptop-mode will change
the delay, dirty_writeback_centisecs, to 60000 (10 minutes) by default.
In the case that bdi_wakeup_thread_delayed() races with
sync_inodes_sb(), sync will be stopped for 10 minutes and trigger a hung
task.  Even if dirty_writeback_centisecs is not long enough to cause a
hung task, we still don't want to delay sync for that long.

We fix the problem by using queue_delayed_work() when we want to
schedule writeback sometime in future.  This function doesn't change the
timer if it is already armed.

For the same reason, we also change bdi_writeback_workfn() to
immediately queue the work again in the case that the work_list is not
empty.  The same problem can happen if the sync work is run on the
rescue worker.

[jack@suse.cz: update changelog, add comment, use bdi_wakeup_thread_delayed()]
Signed-off-by: Derek Basehore <dbasehore@chromium.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Alexander Viro <viro@zento.linux.org.uk>
Reviewed-by: Tejun Heo <tj@kernel.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
Cc: Derek Basehore <dbasehore@chromium.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Benson Leung <bleung@chromium.org>
Cc: Sonny Rao <sonnyrao@chromium.org>
Cc: Luigi Semenzato <semenzato@chromium.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Dave Chinner <david@fromorbit.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-03 16:20:49 -07:00
Linus Torvalds 159d8133d0 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial
Pull trivial tree updates from Jiri Kosina:
 "Usual rocket science -- mostly documentation and comment updates"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial:
  sparse: fix comment
  doc: fix double words
  isdn: capi: fix "CAPI_VERSION" comment
  doc: DocBook: Fix typos in xml and template file
  Bluetooth: add module name for btwilink
  driver core: unexport static function create_syslog_header
  mmc: core: typo fix in printk specifier
  ARM: spear: clean up editing mistake
  net-sysfs: fix comment typo 'CONFIG_SYFS'
  doc: Insert MODULE_ in module-signing macros
  Documentation: update URL to hfsplus Technote 1150
  gpio: update path to documentation
  ixgbe: Fix format string in ixgbe_fcoe.
  Kconfig: Remove useless "default N" lines
  user_namespace.c: Remove duplicated word in comment
  CREDITS: fix formatting
  treewide: Fix typo in Documentation/DocBook
  mm: Fix warning on make htmldocs caused by slab.c
  ata: ata-samsung_cf: cleanup in header file
  idr: remove unused prototype of idr_free()
2014-04-02 16:23:38 -07:00
Linus Torvalds c6f21243ce Merge branch 'x86-vdso-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 vdso changes from Peter Anvin:
 "This is the revamp of the 32-bit vdso and the associated cleanups.

  This adds timekeeping support to the 32-bit vdso that we already have
  in the 64-bit vdso.  Although 32-bit x86 is legacy, it is likely to
  remain in the embedded space for a very long time to come.

  This removes the traditional COMPAT_VDSO support; the configuration
  variable is reused for simply removing the 32-bit vdso, which will
  produce correct results but obviously suffer a performance penalty.
  Only one beta version of glibc was affected, but that version was
  unfortunately included in one OpenSUSE release.

  This is not the end of the vdso cleanups.  Stefani and Andy have
  agreed to continue work for the next kernel cycle; in fact Andy has
  already produced another set of cleanups that came too late for this
  cycle.

  An incidental, but arguably important, change is that this ensures
  that unused space in the VVAR page is properly zeroed.  It wasn't
  before, and would contain whatever garbage was left in memory by BIOS
  or the bootloader.  Since the VVAR page is accessible to user space
  this had the potential of information leaks"

* 'x86-vdso-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (23 commits)
  x86, vdso: Fix the symbol versions on the 32-bit vDSO
  x86, vdso, build: Don't rebuild 32-bit vdsos on every make
  x86, vdso: Actually discard the .discard sections
  x86, vdso: Fix size of get_unmapped_area()
  x86, vdso: Finish removing VDSO32_PRELINK
  x86, vdso: Move more vdso definitions into vdso.h
  x86: Load the 32-bit vdso in place, just like the 64-bit vdsos
  x86, vdso32: handle 32 bit vDSO larger one page
  x86, vdso32: Disable stack protector, adjust optimizations
  x86, vdso: Zero-pad the VVAR page
  x86, vdso: Add 32 bit VDSO time support for 64 bit kernel
  x86, vdso: Add 32 bit VDSO time support for 32 bit kernel
  x86, vdso: Patch alternatives in the 32-bit VDSO
  x86, vdso: Introduce VVAR marco for vdso32
  x86, vdso: Cleanup __vdso_gettimeofday()
  x86, vdso: Replace VVAR(vsyscall_gtod_data) by gtod macro
  x86, vdso: __vdso_clock_gettime() cleanup
  x86, vdso: Revamp vclock_gettime.c
  mm: Add new func _install_special_mapping() to mmap.c
  x86, vdso: Make vsyscall_gtod_data handling x86 generic
  ...
2014-04-02 12:26:43 -07:00
Li Zhong c800bcd5f5 sparse: fix comment
retmain -> remain

Signed-off-by: Li Zhong <zhong@linux.vnet.ibm.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2014-04-02 09:16:17 +02:00
Linus Torvalds cf6fafcf05 Merge branch 'for-3.15' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu
Pull percpu changes from Tejun Heo:
 "The percpu allocation is now popular enough for the extremely naive
  range allocator to cause scalability issues.

  The existing allocator linearly scanned the allocation map on both
  alloc and free without making use of hint or anything.  Al
  reimplemented the range allocator so that it can use binary search
  instead of linear scan during free and alloc path uses simple hinting
  to avoid scanning in common cases.  Combined, the new allocator
  resolves the scalability issue percpu allocator was showing during
  container benchmark workload"

* 'for-3.15' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu:
  percpu: renew the max_contig if we merge the head and previous block
  percpu: allocation size should be even
  percpu: speed alloc_pcpu_area() up
  percpu: store offsets instead of lengths in ->map[]
  perpcu: fold pcpu_split_block() into the only caller
2014-03-31 15:07:43 -07:00
Linus Torvalds 1f8c538ed6 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux
Pull s390 updates from Martin Schwidefsky:
 "There are two memory management related changes, the CMMA support for
  KVM to avoid swap-in of freed pages and the split page table lock for
  the PMD level.  These two come with common code changes in mm/.

  A fix for the long standing theoretical TLB flush problem, this one
  comes with a common code change in kernel/sched/.

  Another set of changes is Heikos uaccess work, included is the initial
  set of patches with more to come.

  And fixes and cleanups as usual"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: (36 commits)
  s390/con3270: optionally disable auto update
  s390/mm: remove unecessary parameter from pgste_ipte_notify
  s390/mm: remove unnecessary parameter from gmap_do_ipte_notify
  s390/mm: fixing comment so that parameter name match
  s390/smp: limit number of cpus in possible cpu mask
  hypfs: Add clarification for "weight_min" attribute
  s390: update defconfigs
  s390/ptrace: add support for PTRACE_SINGLEBLOCK
  s390/perf: make print_debug_cf() static
  s390/topology: Remove call to update_cpu_masks()
  s390/compat: remove compat exec domain
  s390: select CONFIG_TTY for use of tty in unconditional keyboard driver
  s390/appldata_os: fix cpu array size calculation
  s390/checksum: remove memset() within csum_partial_copy_from_user()
  s390/uaccess: remove copy_from_user_real()
  s390/sclp_early: Return correct HSA block count also for zero
  s390: add some drivers/subsystems to the MAINTAINERS file
  s390: improve debug feature usage
  s390/airq: add support for irq ranges
  s390/mm: enable split page table lock for PMD level
  ...
2014-03-31 14:35:30 -07:00
Linus Torvalds 190f918660 Merge branch 'compat' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux
Pull s390 compat wrapper rework from Heiko Carstens:
 "S390 compat system call wrapper simplification work.

  The intention of this work is to get rid of all hand written assembly
  compat system call wrappers on s390, which perform proper sign or zero
  extension, or pointer conversion of compat system call parameters.
  Instead all of this should be done with C code eg by using Al's
  COMPAT_SYSCALL_DEFINEx() macro.

  Therefore all common code and s390 specific compat system calls have
  been converted to the COMPAT_SYSCALL_DEFINEx() macro.

  In order to generate correct code all compat system calls may only
  have eg compat_ulong_t parameters, but no unsigned long parameters.
  Those patches which change parameter types from unsigned long to
  compat_ulong_t parameters are separate in this series, but shouldn't
  cause any harm.

  The only compat system calls which intentionally have 64 bit
  parameters (preadv64 and pwritev64) in support of the x86/32 ABI
  haven't been changed, but are now only available if an architecture
  defines __ARCH_WANT_COMPAT_SYS_PREADV64/PWRITEV64.

  System calls which do not have a compat variant but still need proper
  zero extension on s390, like eg "long sys_brk(unsigned long brk)" will
  get a proper wrapper function with the new s390 specific
  COMPAT_SYSCALL_WRAPx() macro:

     COMPAT_SYSCALL_WRAP1(brk, unsigned long, brk);

  which generates the following code (simplified):

     asmlinkage long sys_brk(unsigned long brk);
     asmlinkage long compat_sys_brk(long brk)
     {
         return sys_brk((u32)brk);
     }

  Given that the C file which contains all the COMPAT_SYSCALL_WRAP lines
  includes both linux/syscall.h and linux/compat.h, it will generate
  build errors, if the declaration of sys_brk() doesn't match, or if
  there exists a non-matching compat_sys_brk() declaration.

  In addition this will intentionally result in a link error if
  somewhere else a compat_sys_brk() function exists, which probably
  should have been used instead.  Two more BUILD_BUG_ONs make sure the
  size and type of each compat syscall parameter can be handled
  correctly with the s390 specific macros.

  I converted the compat system calls step by step to verify the
  generated code is correct and matches the previous code.  In fact it
  did not always match, however that was always a bug in the hand
  written asm code.

  In result we get less code, less bugs, and much more sanity checking"

* 'compat' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: (44 commits)
  s390/compat: add copyright statement
  compat: include linux/unistd.h within linux/compat.h
  s390/compat: get rid of compat wrapper assembly code
  s390/compat: build error for large compat syscall args
  mm/compat: convert to COMPAT_SYSCALL_DEFINE with changing parameter types
  kexec/compat: convert to COMPAT_SYSCALL_DEFINE with changing parameter types
  net/compat: convert to COMPAT_SYSCALL_DEFINE with changing parameter types
  ipc/compat: convert to COMPAT_SYSCALL_DEFINE with changing parameter types
  fs/compat: convert to COMPAT_SYSCALL_DEFINE with changing parameter types
  ipc/compat: convert to COMPAT_SYSCALL_DEFINE
  fs/compat: convert to COMPAT_SYSCALL_DEFINE
  security/compat: convert to COMPAT_SYSCALL_DEFINE
  mm/compat: convert to COMPAT_SYSCALL_DEFINE
  net/compat: convert to COMPAT_SYSCALL_DEFINE
  kernel/compat: convert to COMPAT_SYSCALL_DEFINE
  fs/compat: optional preadv64/pwrite64 compat system calls
  ipc/compat_sys_msgrcv: change msgtyp type from long to compat_long_t
  s390/compat: partial parameter conversion within syscall wrappers
  s390/compat: automatic zero, sign and pointer conversion of syscalls
  s390/compat: add sync_file_range and fallocate compat syscalls
  ...
2014-03-31 14:32:17 -07:00
Linus Torvalds 971eae7c99 Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler changes from Ingo Molnar:
 "Bigger changes:

   - sched/idle restructuring: they are WIP preparation for deeper
     integration between the scheduler and idle state selection, by
     Nicolas Pitre.

   - add NUMA scheduling pseudo-interleaving, by Rik van Riel.

   - optimize cgroup context switches, by Peter Zijlstra.

   - RT scheduling enhancements, by Thomas Gleixner.

  The rest is smaller changes, non-urgnt fixes and cleanups"

* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (68 commits)
  sched: Clean up the task_hot() function
  sched: Remove double calculation in fix_small_imbalance()
  sched: Fix broken setscheduler()
  sparc64, sched: Remove unused sparc64_multi_core
  sched: Remove unused mc_capable() and smt_capable()
  sched/numa: Move task_numa_free() to __put_task_struct()
  sched/fair: Fix endless loop in idle_balance()
  sched/core: Fix endless loop in pick_next_task()
  sched/fair: Push down check for high priority class task into idle_balance()
  sched/rt: Fix picking RT and DL tasks from empty queue
  trace: Replace hardcoding of 19 with MAX_NICE
  sched: Guarantee task priority in pick_next_task()
  sched/idle: Remove stale old file
  sched: Put rq's sched_avg under CONFIG_FAIR_GROUP_SCHED
  cpuidle/arm64: Remove redundant cpuidle_idle_call()
  cpuidle/powernv: Remove redundant cpuidle_idle_call()
  sched, nohz: Exclude isolated cores from load balancing
  sched: Fix select_task_rq_fair() description comments
  workqueue: Replace hardcoding of -20 and 19 with MIN_NICE and MAX_NICE
  sys: Replace hardcoding of -20 and 19 with MIN_NICE and MAX_NICE
  ...
2014-03-31 11:21:19 -07:00
Jianyu Zhan 21ddfd38ee percpu: renew the max_contig if we merge the head and previous block
During pcpu_alloc_area(), we might merge the current head with the
previous block. Since we have calculated the max_contig using the
size of previous block before we skip it, and now we update the size
of previous block, so we should renew the max_contig.

Signed-off-by: Jianyu Zhan <nasa4836@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2014-03-29 09:29:42 -04:00
Hugh Dickins 7e09e738af mm: fix swapops.h:131 bug if remap_file_pages raced migration
Add remove_linear_migration_ptes_from_nonlinear(), to fix an interesting
little include/linux/swapops.h:131 BUG_ON(!PageLocked) found by trinity:
indicating that remove_migration_ptes() failed to find one of the
migration entries that was temporarily inserted.

The problem comes from remap_file_pages()'s switch from vma_interval_tree
(good for inserting the migration entry) to i_mmap_nonlinear list (no good
for locating it again); but can only be a problem if the remap_file_pages()
range does not cover the whole of the vma (zap_pte() clears the range).

remove_migration_ptes() needs a file_nonlinear method to go down the
i_mmap_nonlinear list, applying linear location to look for migration
entries in those vmas too, just in case there was this race.

The file_nonlinear method does need rmap_walk_control.arg to do this;
but it never needed vma passed in - vma comes from its own iteration.

Reported-and-tested-by: Dave Jones <davej@redhat.com>
Reported-and-tested-by: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-03-20 22:09:09 -07:00
Hugh Dickins 887843961c mm: fix bad rss-counter if remap_file_pages raced migration
Fix some "Bad rss-counter state" reports on exit, arising from the
interaction between page migration and remap_file_pages(): zap_pte()
must count a migration entry when zapping it.

And yes, it is possible (though very unusual) to find an anon page or
swap entry in a VM_SHARED nonlinear mapping: coming from that horrid
get_user_pages(write, force) case which COWs even in a shared mapping.

Signed-off-by: Hugh Dickins <hughd@google.com>
Tested-by: Sasha Levin sasha.levin@oracle.com>
Tested-by: Dave Jones davej@redhat.com>
Cc: Cyrill Gorcunov <gorcunov@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-03-19 16:21:49 -07:00
Stefani Seibold 3935ed6a3a mm: Add new func _install_special_mapping() to mmap.c
The _install_special_mapping() is the new base function for
install_special_mapping(). This function will return a pointer of the
created VMA or a error code in an ERR_PTR()

This new function will be needed by the for the vdso 32 bit support to map the
additonal vvar and hpet pages into the 32 bit address space. This will be done
with io_remap_pfn_range() and remap_pfn_range, which requieres a vm_area_struct.

Reviewed-by: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: Stefani Seibold <stefani@seibold.net>
Link: http://lkml.kernel.org/r/1395094933-14252-3-git-send-email-stefani@seibold.net
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2014-03-18 12:51:56 -07:00
Viro 2f69fa829c percpu: allocation size should be even
723ad1d90b ("percpu: store offsets instead of lengths in ->map[]")
updated percpu area allocator to use the lowest bit, instead of sign,
to signify whether the area is occupied and forced min align to 2;
unfortunately, it forgot to force the allocation size to be even
causing malfunctions for the very rare odd-sized allocations.

Always force the allocations to be even sized.

tj: Wrote patch description.

Original-patch-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Tejun Heo <tj@kernel.org>
2014-03-17 16:10:29 -04:00
Ingo Molnar a02ed5e3e0 Merge branch 'sched/urgent' into sched/core
Pick up fixes before queueing up new changes.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-03-11 11:34:27 +01:00
Ben Hutchings 2216ee8530 mm/Kconfig: fix URL for zsmalloc benchmark
The help text for CONFIG_PGTABLE_MAPPING has an incorrect URL.  While
we're at it, remove the unnecessary footnote notation.

Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Acked-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-03-10 17:26:20 -07:00
Laura Abbott 2af120bc04 mm/compaction: break out of loop on !PageBuddy in isolate_freepages_block
We received several reports of bad page state when freeing CMA pages
previously allocated with alloc_contig_range:

    BUG: Bad page state in process Binder_A  pfn:63202
    page:d21130b0 count:0 mapcount:1 mapping:  (null) index:0x7dfbf
    page flags: 0x40080068(uptodate|lru|active|swapbacked)

Based on the page state, it looks like the page was still in use.  The
page flags do not make sense for the use case though.  Further debugging
showed that despite alloc_contig_range returning success, at least one
page in the range still remained in the buddy allocator.

There is an issue with isolate_freepages_block.  In strict mode (which
CMA uses), if any pages in the range cannot be isolated,
isolate_freepages_block should return failure 0.  The current check
keeps track of the total number of isolated pages and compares against
the size of the range:

        if (strict && nr_strict_required > total_isolated)
                total_isolated = 0;

After taking the zone lock, if one of the pages in the range is not in
the buddy allocator, we continue through the loop and do not increment
total_isolated.  If in the last iteration of the loop we isolate more
than one page (e.g.  last page needed is a higher order page), the check
for total_isolated may pass and we fail to detect that a page was
skipped.  The fix is to bail out if the loop immediately if we are in
strict mode.  There's no benfit to continuing anyway since we need all
pages to be isolated.  Additionally, drop the error checking based on
nr_strict_required and just check the pfn ranges.  This matches with
what isolate_freepages_range does.

Signed-off-by: Laura Abbott <lauraa@codeaurora.org>
Acked-by: Minchan Kim <minchan@kernel.org>
Cc: Mel Gorman <mgorman@suse.de>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Acked-by: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
Acked-by: Michal Nazarewicz <mina86@mina86.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-03-10 17:26:20 -07:00
Johannes Weiner e97ca8e5b8 mm: fix GFP_THISNODE callers and clarify
GFP_THISNODE is for callers that implement their own clever fallback to
remote nodes.  It restricts the allocation to the specified node and
does not invoke reclaim, assuming that the caller will take care of it
when the fallback fails, e.g.  through a subsequent allocation request
without GFP_THISNODE set.

However, many current GFP_THISNODE users only want the node exclusive
aspect of the flag, without actually implementing their own fallback or
triggering reclaim if necessary.  This results in things like page
migration failing prematurely even when there is easily reclaimable
memory available, unless kswapd happens to be running already or a
concurrent allocation attempt triggers the necessary reclaim.

Convert all callsites that don't implement their own fallback strategy
to __GFP_THISNODE.  This restricts the allocation a single node too, but
at the same time allows the allocator to enter the slowpath, wake
kswapd, and invoke direct reclaim if necessary, to make the allocation
happen when memory is full.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Jan Stancek <jstancek@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-03-10 17:26:19 -07:00
Al Viro 3d331ad74f percpu: speed alloc_pcpu_area() up
If we know that first N areas are all in use, we can obviously skip
them when searching for a free one.  And that kind of hint is very
easy to maintain.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Tejun Heo <tj@kernel.org>
2014-03-07 07:52:26 -05:00
Al Viro 723ad1d90b percpu: store offsets instead of lengths in ->map[]
Current code keeps +-length for each area in chunk->map[].  It has
several unpleasant consequences:
	* even if we know that first 50 areas are all in use, allocation
still needs to go through all those areas just to sum their sizes, just
to get the offset of free one.
	* freeing needs to find the array entry refering to the area
in question; again, the need to sum the sizes until we reach the offset
we are interested in.  Note that offsets are monotonous, so simple
binary search would do here.

	New data representation: array of <offset,in-use flag> pairs.
Each pair is represented by one int - we use offset|1 for <offset, in use>
and offset for <offset, free> (we make sure that all offsets are even).
In the end we put a sentry entry - <total size, in use>.  The first
entry is <0, flag>; it would be possible to store together the flag
for Nth area and offset for N+1st, but that leads to much hairier code.

In other words, where the old variant would have
	4, -8, -4, 4, -12, 100
(4 bytes free, 8 in use, 4 in use, 4 free, 12 in use, 100 free) we store
	<0,0>, <4,1>, <12,1>, <16,0>, <20,1>, <32,0>, <132,1>
i.e.
	0, 5, 13, 16, 21, 32, 133

This commit switches to new data representation and takes care of a couple
of low-hanging fruits in free_pcpu_area() - one is the switch to binary
search, another is not doing two memmove() when one would do.  Speeding
the alloc side up (by keeping track of how many areas in the beginning are
known to be all in use) also becomes possible - that'll be done in the next
commit.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Tejun Heo <tj@kernel.org>
2014-03-07 07:52:26 -05:00
Al Viro 706c16f237 perpcu: fold pcpu_split_block() into the only caller
... and simplify the results a bit.  Makes the next step easier
to deal with - we will be changing the data representation for
chunk->map[] and it's easier to do if the code in question is
not split between pcpu_alloc_area() and pcpu_split_block().

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Tejun Heo <tj@kernel.org>
2014-03-07 07:52:26 -05:00
Heiko Carstens 2f2728f6de mm/compat: convert to COMPAT_SYSCALL_DEFINE with changing parameter types
In order to allow the COMPAT_SYSCALL_DEFINE macro generate code that
performs proper zero and sign extension convert all 64 bit parameters
to their corresponding 32 bit compat counterparts.

Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
2014-03-06 16:30:47 +01:00
Heiko Carstens c93e0f6c89 mm/compat: convert to COMPAT_SYSCALL_DEFINE
Convert all compat system call functions where all parameter types
have a size of four or less than four bytes, or are pointer types
to COMPAT_SYSCALL_DEFINE.
The implicit casts within COMPAT_SYSCALL_DEFINE will perform proper
zero and sign extension to 64 bit of all parameters if needed.

Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
2014-03-06 16:30:42 +01:00
Johannes Weiner 27329369c9 mm: page_alloc: exempt GFP_THISNODE allocations from zone fairness
Jan Stancek reports manual page migration encountering allocation
failures after some pages when there is still plenty of memory free, and
bisected the problem down to commit 81c0a2bb51 ("mm: page_alloc: fair
zone allocator policy").

The problem is that GFP_THISNODE obeys the zone fairness allocation
batches on one hand, but doesn't reset them and wake kswapd on the other
hand.  After a few of those allocations, the batches are exhausted and
the allocations fail.

Fixing this means either having GFP_THISNODE wake up kswapd, or
GFP_THISNODE not participating in zone fairness at all.  The latter
seems safer as an acute bugfix, we can clean up later.

Reported-by: Jan Stancek <jstancek@redhat.com>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: <stable@kernel.org>		[3.12+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-03-04 07:55:50 -08:00
Vlastimil Babka 9050d7eba4 mm: include VM_MIXEDMAP flag in the VM_SPECIAL list to avoid m(un)locking
Daniel Borkmann reported a VM_BUG_ON assertion failing:

  ------------[ cut here ]------------
  kernel BUG at mm/mlock.c:528!
  invalid opcode: 0000 [#1] SMP
  Modules linked in: ccm arc4 iwldvm [...]
   video
  CPU: 3 PID: 2266 Comm: netsniff-ng Not tainted 3.14.0-rc2+ #8
  Hardware name: LENOVO 2429BP3/2429BP3, BIOS G4ET37WW (1.12 ) 05/29/2012
  task: ffff8801f87f9820 ti: ffff88002cb44000 task.ti: ffff88002cb44000
  RIP: 0010:[<ffffffff81171ad0>]  [<ffffffff81171ad0>] munlock_vma_pages_range+0x2e0/0x2f0
  Call Trace:
    do_munmap+0x18f/0x3b0
    vm_munmap+0x41/0x60
    SyS_munmap+0x22/0x30
    system_call_fastpath+0x1a/0x1f
  RIP   munlock_vma_pages_range+0x2e0/0x2f0
  ---[ end trace a0088dcf07ae10f2 ]---

because munlock_vma_pages_range() thinks it's unexpectedly in the middle
of a THP page.  This can be reproduced with default config since 3.11
kernels.  A reproducer can be found in the kernel's selftest directory
for networking by running ./psock_tpacket.

The problem is that an order=2 compound page (allocated by
alloc_one_pg_vec_page() is part of the munlocked VM_MIXEDMAP vma (mapped
by packet_mmap()) and mistaken for a THP page and assumed to be order=9.

The checks for THP in munlock came with commit ff6a6da60b ("mm:
accelerate munlock() treatment of THP pages"), i.e.  since 3.9, but did
not trigger a bug.  It just makes munlock_vma_pages_range() skip such
compound pages until the next 512-pages-aligned page, when it encounters
a head page.  This is however not a problem for vma's where mlocking has
no effect anyway, but it can distort the accounting.

Since commit 7225522bb4 ("mm: munlock: batch non-THP page isolation
and munlock+putback using pagevec") this can trigger a VM_BUG_ON in
PageTransHuge() check.

This patch fixes the issue by adding VM_MIXEDMAP flag to VM_SPECIAL, a
list of flags that make vma's non-mlockable and non-mergeable.  The
reasoning is that VM_MIXEDMAP vma's are similar to VM_PFNMAP, which is
already on the VM_SPECIAL list, and both are intended for non-LRU pages
where mlocking makes no sense anyway.  Related Lkml discussion can be
found in [2].

 [1] tools/testing/selftests/net/psock_tpacket
 [2] https://lkml.org/lkml/2014/1/10/427

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Reported-by: Daniel Borkmann <dborkman@redhat.com>
Tested-by: Daniel Borkmann <dborkman@redhat.com>
Cc: Thomas Hellstrom <thellstrom@vmware.com>
Cc: John David Anglin <dave.anglin@bell.net>
Cc: HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com>
Cc: Konstantin Khlebnikov <khlebnikov@openvz.org>
Cc: Carsten Otte <cotte@de.ibm.com>
Cc: Jared Hulbert <jaredeh@gmail.com>
Tested-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: <stable@vger.kernel.org> [3.11.x+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-03-04 07:55:48 -08:00
Filipe Brandenburger 4fb1a86fb5 memcg: reparent charges of children before processing parent
Sometimes the cleanup after memcg hierarchy testing gets stuck in
mem_cgroup_reparent_charges(), unable to bring non-kmem usage down to 0.

There may turn out to be several causes, but a major cause is this: the
workitem to offline parent can get run before workitem to offline child;
parent's mem_cgroup_reparent_charges() circles around waiting for the
child's pages to be reparented to its lrus, but it's holding
cgroup_mutex which prevents the child from reaching its
mem_cgroup_reparent_charges().

Further testing showed that an ordered workqueue for cgroup_destroy_wq
is not always good enough: percpu_ref_kill_and_confirm's call_rcu_sched
stage on the way can mess up the order before reaching the workqueue.

Instead, when offlining a memcg, call mem_cgroup_reparent_charges() on
all its children (and grandchildren, in the correct order) to have their
charges reparented first.

Fixes: e5fca243ab ("cgroup: use a dedicated workqueue for cgroup destruction")
Signed-off-by: Filipe Brandenburger <filbranden@google.com>
Signed-off-by: Hugh Dickins <hughd@google.com>
Reviewed-by: Tejun Heo <tj@kernel.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: <stable@vger.kernel.org>	[v3.10+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-03-04 07:55:48 -08:00
Hugh Dickins ce48225fe3 memcg: fix endless loop in __mem_cgroup_iter_next()
Commit 0eef615665 ("memcg: fix css reference leak and endless loop in
mem_cgroup_iter") got the interaction with the commit a few before it
d8ad305597 ("mm/memcg: iteration skip memcgs not yet fully
initialized") slightly wrong, and we didn't notice at the time.

It's elusive, and harder to get than the original, but for a couple of
days before rc1, I several times saw a endless loop similar to that
supposedly being fixed.

This time it was a tighter loop in __mem_cgroup_iter_next(): because we
can get here when our root has already been offlined, and the ordering
of conditions was such that we then just cycled around forever.

Fixes: 0eef615665 ("memcg: fix css reference leak and endless loop in mem_cgroup_iter").
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Greg Thelen <gthelen@google.com>
Cc: <stable@vger.kernel.org>	[3.12+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-03-04 07:55:47 -08:00
David Rientjes 668f9abbd4 mm: close PageTail race
Commit bf6bddf192 ("mm: introduce compaction and migration for
ballooned pages") introduces page_count(page) into memory compaction
which dereferences page->first_page if PageTail(page).

This results in a very rare NULL pointer dereference on the
aforementioned page_count(page).  Indeed, anything that does
compound_head(), including page_count() is susceptible to racing with
prep_compound_page() and seeing a NULL or dangling page->first_page
pointer.

This patch uses Andrea's implementation of compound_trans_head() that
deals with such a race and makes it the default compound_head()
implementation.  This includes a read memory barrier that ensures that
if PageTail(head) is true that we return a head page that is neither
NULL nor dangling.  The patch then adds a store memory barrier to
prep_compound_page() to ensure page->first_page is set.

This is the safest way to ensure we see the head page that we are
expecting, PageTail(page) is already in the unlikely() path and the
memory barriers are unfortunately required.

Hugetlbfs is the exception, we don't enforce a store memory barrier
during init since no race is possible.

Signed-off-by: David Rientjes <rientjes@google.com>
Cc: Holger Kiehl <Holger.Kiehl@dwd.de>
Cc: Christoph Lameter <cl@linux.com>
Cc: Rafael Aquini <aquini@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-03-04 07:55:47 -08:00
Michal Hocko 08088cb9ac memcg: change oom_info_lock to mutex
Kirill has reported the following:

  Task in /test killed as a result of limit of /test
  memory: usage 10240kB, limit 10240kB, failcnt 51
  memory+swap: usage 10240kB, limit 10240kB, failcnt 0
  kmem: usage 0kB, limit 18014398509481983kB, failcnt 0
  Memory cgroup stats for /test:

  BUG: sleeping function called from invalid context at kernel/cpu.c:68
  in_atomic(): 1, irqs_disabled(): 0, pid: 66, name: memcg_test
  2 locks held by memcg_test/66:
   #0:  (memcg_oom_lock#2){+.+...}, at: [<ffffffff81131014>] pagefault_out_of_memory+0x14/0x90
   #1:  (oom_info_lock){+.+...}, at: [<ffffffff81197b2a>] mem_cgroup_print_oom_info+0x2a/0x390
  CPU: 2 PID: 66 Comm: memcg_test Not tainted 3.14.0-rc1-dirty #745
  Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS Bochs 01/01/2011
  Call Trace:
    __might_sleep+0x16a/0x210
    get_online_cpus+0x1c/0x60
    mem_cgroup_read_stat+0x27/0xb0
    mem_cgroup_print_oom_info+0x260/0x390
    dump_header+0x88/0x251
    ? trace_hardirqs_on+0xd/0x10
    oom_kill_process+0x258/0x3d0
    mem_cgroup_oom_synchronize+0x656/0x6c0
    ? mem_cgroup_charge_common+0xd0/0xd0
    pagefault_out_of_memory+0x14/0x90
    mm_fault_error+0x91/0x189
    __do_page_fault+0x48e/0x580
    do_page_fault+0xe/0x10
    page_fault+0x22/0x30

which complains that mem_cgroup_read_stat cannot be called from an atomic
context but mem_cgroup_print_oom_info takes a spinlock.  Change
oom_info_lock to a mutex.

This was introduced by 947b3dd1a8 ("memcg, oom: lock
mem_cgroup_print_oom_info").

Signed-off-by: Michal Hocko <mhocko@suse.cz>
Reported-by: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-02-25 15:25:44 -08:00
Kirill A. Shutemov 9845cbbd11 mm, thp: fix infinite loop on memcg OOM
Masayoshi Mizuma reported a bug with the hang of an application under
the memcg limit.  It happens on write-protection fault to huge zero page

If we successfully allocate a huge page to replace zero page but hit the
memcg limit we need to split the zero page with split_huge_page_pmd()
and fallback to small pages.

The other part of the problem is that VM_FAULT_OOM has special meaning
in do_huge_pmd_wp_page() context.  __handle_mm_fault() expects the page
to be split if it sees VM_FAULT_OOM and it will will retry page fault
handling.  This causes an infinite loop if the page was not split.

do_huge_pmd_wp_zero_page_fallback() can return VM_FAULT_OOM if it failed
to allocate one small page, so fallback to small pages will not help.

The solution for this part is to replace VM_FAULT_OOM with
VM_FAULT_FALLBACK is fallback required.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reported-by: Masayoshi Mizuma <m.mizuma@jp.fujitsu.com>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-02-25 15:25:44 -08:00
Kirill A. Shutemov 33b6c7765f mm, hwpoison: release page on PageHWPoison() in __do_fault()
It seems we forget to release page after detecting HW error.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-02-25 15:25:42 -08:00
Thomas Gleixner d97a860c4f Merge branch 'linus' into sched/core
Reason: Bring bakc upstream modification to resolve conflicts

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-02-21 21:37:09 +01:00
Konstantin Weitz 45961722f8 mm: add support for discard of unused ptes
In a virtualized environment and given an appropriate interface the guest
can mark pages as unused while they are free (for the s390 implementation
see git commit 45e576b1c3 "guest page hinting light"). For the host
the unused state is a property of the pte.

This patch adds the primitive 'pte_unused' and code to the host swap out
handler so that pages marked as unused by all mappers are not swapped out
but discarded instead, thus saving one IO for swap out and potentially
another one for swap in.

[ Martin Schwidefsky: patch reordering and simplification ]

Signed-off-by: Konstantin Weitz <konstantin.weitz@gmail.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
2014-02-21 08:50:18 +01:00