linux/mm
Michal Hocko 79dfdaccd1 memcg: make oom_lock 0 and 1 based rather than counter
Commit 867578cb ("memcg: fix oom kill behavior") introduced a oom_lock
counter which is incremented by mem_cgroup_oom_lock when we are about to
handle memcg OOM situation.  mem_cgroup_handle_oom falls back to a sleep
if oom_lock > 1 to prevent from multiple oom kills at the same time.
The counter is then decremented by mem_cgroup_oom_unlock called from the
same function.

This works correctly but it can lead to serious starvations when we have
many processes triggering OOM and many CPUs available for them (I have
tested with 16 CPUs).

Consider a process (call it A) which gets the oom_lock (the first one
that got to mem_cgroup_handle_oom and grabbed memcg_oom_mutex) and other
processes that are blocked on the mutex.  While A releases the mutex and
calls mem_cgroup_out_of_memory others will wake up (one after another)
and increase the counter and fall into sleep (memcg_oom_waitq).

Once A finishes mem_cgroup_out_of_memory it takes the mutex again and
decreases oom_lock and wakes other tasks (if releasing memory by
somebody else - e.g.  killed process - hasn't done it yet).

A testcase would look like:
  Assume malloc XXX is a program allocating XXX Megabytes of memory
  which touches all allocated pages in a tight loop
  # swapoff SWAP_DEVICE
  # cgcreate -g memory:A
  # cgset -r memory.oom_control=0   A
  # cgset -r memory.limit_in_bytes= 200M
  # for i in `seq 100`
  # do
  #     cgexec -g memory:A   malloc 10 &
  # done

The main problem here is that all processes still race for the mutex and
there is no guarantee that we will get counter back to 0 for those that
got back to mem_cgroup_handle_oom.  In the end the whole convoy
in/decreases the counter but we do not get to 1 that would enable
killing so nothing useful can be done.  The time is basically unbounded
because it highly depends on scheduling and ordering on mutex (I have
seen this taking hours...).

This patch replaces the counter by a simple {un}lock semantic.  As
mem_cgroup_oom_{un}lock works on the a subtree of a hierarchy we have to
make sure that nobody else races with us which is guaranteed by the
memcg_oom_mutex.

We have to be careful while locking subtrees because we can encounter a
subtree which is already locked: hierarchy:

          A
        /   \
       B     \
      /\      \
     C  D     E

B - C - D tree might be already locked.  While we want to enable locking
E subtree because OOM situations cannot influence each other we
definitely do not want to allow locking A.

Therefore we have to refuse lock if any subtree is already locked and
clear up the lock for all nodes that have been set up to the failure
point.

On the other hand we have to make sure that the rest of the world will
recognize that a group is under OOM even though it doesn't have a lock.
Therefore we have to introduce under_oom variable which is incremented
and decremented for the whole subtree when we enter resp.  leave
mem_cgroup_handle_oom.  under_oom, unlike oom_lock, doesn't need be
updated under memcg_oom_mutex because its users only check a single
group and they use atomic operations for that.

This can be checked easily by the following test case:

  # cgcreate -g memory:A
  # cgset -r memory.use_hierarchy=1 A
  # cgset -r memory.oom_control=1   A
  # cgset -r memory.limit_in_bytes= 100M
  # cgset -r memory.memsw.limit_in_bytes= 100M
  # cgcreate -g memory:A/B
  # cgset -r memory.oom_control=1 A/B
  # cgset -r memory.limit_in_bytes=20M
  # cgset -r memory.memsw.limit_in_bytes=20M
  # cgexec -g memory:A/B malloc 30  &    #->this will be blocked by OOM of group B
  # cgexec -g memory:A   malloc 80  &    #->this will be blocked by OOM of group A

While B gets oom_lock A will not get it.  Both of them go into sleep and
wait for an external action.  We can make the limit higher for A to
enforce waking it up

  # cgset -r memory.memsw.limit_in_bytes=300M A
  # cgset -r memory.limit_in_bytes=300M A

malloc in A has to wake up even though it doesn't have oom_lock.

Finally, the unlock path is very easy because we always unlock only the
subtree we have locked previously while we always decrement under_oom.

Signed-off-by: Michal Hocko <mhocko@suse.cz>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Balbir Singh <bsingharora@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-07-26 16:49:42 -07:00
..
Kconfig mm Kconfig typo: cleancacne -> cleancache 2011-06-10 14:47:52 +02:00
Kconfig.debug mm: debug-pagealloc: fix kconfig dependency warning 2011-03-22 17:44:02 -07:00
Makefile mm: cleancache core ops functions and config 2011-05-26 10:01:36 -06:00
backing-dev.c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/writeback 2011-07-26 10:39:54 -07:00
bootmem.c crash_dump: export is_kdump_kernel to modules, consolidate elfcorehdr_addr, setup_elfcorehdr and saved_max_pfn 2011-03-23 19:47:19 -07:00
bounce.c bounce: call flush_dcache_page() after bounce_copy_vec() 2010-09-09 18:57:25 -07:00
cleancache.c mm: cleancache core ops functions and config 2011-05-26 10:01:36 -06:00
compaction.c mm: compaction: abort compaction if too many pages are isolated and caller is asynchronous V2 2011-06-15 20:04:02 -07:00
debug-pagealloc.c generic debug pagealloc 2009-04-01 08:59:13 -07:00
dmapool.c devres: fix possible use after free 2011-07-25 20:57:14 -07:00
fadvise.c readahead: introduce FMODE_RANDOM for POSIX_FADV_RANDOM 2010-03-06 11:26:25 -08:00
failslab.c include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h 2010-03-30 22:02:32 +09:00
filemap.c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/writeback 2011-07-26 10:39:54 -07:00
filemap_xip.c mm: Convert i_mmap_lock to a mutex 2011-05-25 08:39:18 -07:00
fremap.c mm: don't access vm_flags as 'int' 2011-05-26 09:20:31 -07:00
highmem.c mm,x86: fix kmap_atomic_push vs ioremap_32.c 2010-10-27 18:03:05 -07:00
huge_memory.c mm/huge_memory.c: minor lock simplification in __khugepaged_exit 2011-07-25 20:57:09 -07:00
hugetlb.c mm: hugetlb: fix coding style issues 2011-07-25 20:57:09 -07:00
hwpoison-inject.c Fix common misspellings 2011-03-31 11:26:23 -03:00
init-mm.c mm: convert mm->cpu_vm_cpumask into cpumask_var_t 2011-05-25 08:39:21 -07:00
internal.h mm: nommu: sort mm->mmap list properly 2011-05-25 08:39:05 -07:00
kmemcheck.c kmemcheck: Fix build errors due to missing slab.h 2010-03-30 22:02:32 +09:00
kmemleak-test.c kmemleak: remove memset by using kzalloc 2011-01-27 18:31:51 +00:00
kmemleak.c kmemleak: Do not return a pointer to an object that kmemleak did not get 2011-05-19 17:35:28 +01:00
ksm.c ksm: fix NULL pointer dereference in scan_get_next_rmap_item() 2011-06-15 20:04:02 -07:00
maccess.c maccess,probe_kernel: Make write/read src const void * 2011-05-25 19:56:23 -04:00
madvise.c fs: kill i_alloc_sem 2011-07-20 20:47:46 -04:00
memblock.c mm/memblock.c: avoid abuse of RED_INACTIVE 2011-07-25 20:57:09 -07:00
memcontrol.c memcg: make oom_lock 0 and 1 based rather than counter 2011-07-26 16:49:42 -07:00
memory-failure.c mm/memory-failure.c: fix spinlock vs mutex order 2011-06-27 18:00:13 -07:00
memory.c mm/futex: fix futex writes on archs with SW tracking of dirty & young 2011-07-25 20:57:11 -07:00
memory_hotplug.c mm: extend memory hotplug API to allow memory hotplug in virtual machines 2011-07-25 20:57:08 -07:00
mempolicy.c mm: proc: move show_numa_map() to fs/proc/task_mmu.c 2011-05-25 08:39:34 -07:00
mempool.c mm: remove broken 'kzalloc' mempool 2009-09-22 07:17:35 -07:00
migrate.c migrate: don't account swapcache as shmem 2011-06-16 15:01:24 -07:00
mincore.c thp: mincore transparent hugepage support 2011-01-13 17:32:44 -08:00
mlock.c mm: don't access vm_flags as 'int' 2011-05-26 09:20:31 -07:00
mm_init.c mm: mminit_loglevel cannot be __meminitdata anymore 2008-08-20 15:40:30 -07:00
mmap.c mmap: fix and tidy up overcommit page arithmetic 2011-07-25 20:57:09 -07:00
mmu_context.c exit: fix oops in sync_mm_rss 2010-03-24 16:31:21 -07:00
mmu_notifier.c thp: mmu_notifier_test_young 2011-01-13 17:32:46 -08:00
mmzone.c mm: page allocator: adjust the per-cpu counter threshold when memory is low 2011-01-13 17:32:31 -08:00
mprotect.c thp: mprotect: transparent huge page support 2011-01-13 17:32:44 -08:00
mremap.c mm: Convert i_mmap_lock to a mutex 2011-05-25 08:39:18 -07:00
msync.c sanitize vfs_fsync calling conventions 2010-05-21 18:31:21 -04:00
nobootmem.c memblock/nobootmem: remove unneeded code from alloc_bootmem_node_high() 2011-05-25 08:39:31 -07:00
nommu.c mmap: fix and tidy up overcommit page arithmetic 2011-07-25 20:57:09 -07:00
oom_kill.c oom: remove references to old badness() function 2011-07-25 20:57:09 -07:00
page-writeback.c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/writeback 2011-07-26 10:39:54 -07:00
page_alloc.c mm: page allocator: reconsider zones for allocation after direct reclaim 2011-07-25 20:57:10 -07:00
page_cgroup.c mm/page_cgroup.c: simplify code by using SECTION_ALIGN_UP() and SECTION_ALIGN_DOWN() macros 2011-07-25 20:57:09 -07:00
page_io.c block: kill off REQ_UNPLUG 2011-03-10 08:52:27 +01:00
page_isolation.c mm: page_isolation: codeclean fix comment and rm unneeded val init 2010-10-26 16:52:11 -07:00
pagewalk.c pagewalk: fix code comment for THP 2011-07-25 20:57:09 -07:00
percpu-km.c percpu: clear memory allocated with the km allocator 2010-10-02 10:28:42 +03:00
percpu-vm.c mm: remove gfp mask from pcpu_get_vm_areas 2011-01-13 17:32:34 -08:00
percpu.c Merge branch 'for-2.6.40' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu 2011-05-24 11:53:42 -07:00
pgtable-generic.c mm/pgtable-generic.c: fix CONFIG_SWAP=n build 2011-01-26 10:49:58 +10:00
prio_tree.c sanitize <linux/prefetch.h> usage 2011-05-20 12:50:29 -07:00
quicklist.c include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h 2010-03-30 22:02:32 +09:00
readahead.c readahead: readahead page allocations are OK to fail 2011-05-25 08:39:25 -07:00
rmap.c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/writeback 2011-07-26 10:39:54 -07:00
shmem.c Merge 'akpm' patch series 2011-07-25 21:00:19 -07:00
slab.c Merge branch 'slab-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/slab-2.6 2011-07-22 12:44:30 -07:00
slob.c slob/lockdep: Fix gfp flags passed to lockdep 2011-06-07 21:38:07 +03:00
slub.c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial 2011-07-25 13:56:39 -07:00
sparse-vmemmap.c tree-wide: fix comment/printk typos 2010-11-01 15:38:34 -04:00
sparse.c mm: make some struct page's const 2011-07-25 20:57:07 -07:00
swap.c mm: batch activate_page() to reduce lock contention 2011-05-25 08:39:37 -07:00
swap_state.c block: remove per-queue plugging 2011-03-10 08:52:07 +01:00
swapfile.c fs: seq_file - add event counter to simplify poll() support 2011-07-20 20:47:50 -04:00
thrash.c mm: swap-token: add a comment for priority aging 2011-07-25 20:57:08 -07:00
truncate.c mm: pincer in truncate_inode_pages_range 2011-07-25 20:57:10 -07:00
util.c mm: nommu: sort mm->mmap list properly 2011-05-25 08:39:05 -07:00
vmalloc.c vmalloc,rcu: Convert call_rcu(rcu_free_vb) to kfree_rcu() 2011-07-20 14:10:18 -07:00
vmscan.c memcg: consolidate memory cgroup lru stat functions 2011-07-26 16:49:42 -07:00
vmstat.c mm, mem-hotplug: update pcp->stat_threshold when memory hotplug occur 2011-05-25 08:39:09 -07:00