linux/mm
Michal Hocko d8dc595ce3 memcg: do not hang on OOM when killed by userspace OOM access to memory reserves
Eric has reported that he can see task(s) stuck in memcg OOM handler
regularly.  The only way out is to

	echo 0 > $GROUP/memory.oom_control

His usecase is:

- Setup a hierarchy with memory and the freezer (disable kernel oom and
  have a process watch for oom).

- In that memory cgroup add a process with one thread per cpu.

- In one thread slowly allocate once per second I think it is 16M of ram
  and mlock and dirty it (just to force the pages into ram and stay
  there).

- When oom is achieved loop:
  * attempt to freeze all of the tasks.
  * if frozen send every task SIGKILL, unfreeze, remove the directory in
    cgroupfs.

Eric has then pinpointed the issue to be memcg specific.

All tasks are sitting on the memcg_oom_waitq when memcg oom is disabled.
Those that have received fatal signal will bypass the charge and should
continue on their way out.  The tricky part is that the exit path might
trigger a page fault (e.g.  exit_robust_list), thus the memcg charge,
while its memcg is still under OOM because nobody has released any charges
yet.

Unlike with the in-kernel OOM handler the exiting task doesn't get
TIF_MEMDIE set so it doesn't shortcut further charges of the killed task
and falls to the memcg OOM again without any way out of it as there are no
fatal signals pending anymore.

This patch fixes the issue by checking PF_EXITING early in
mem_cgroup_try_charge and bypass the charge same as if it had fatal
signal pending or TIF_MEMDIE set.

Normally exiting tasks (aka not killed) will bypass the charge now but
this should be OK as the task is leaving and will release memory and
increasing the memory pressure just to release it in a moment seems
dubious wasting of cycles.  Besides that charges after exit_signals should
be rare.

I am bringing this patch again (rebased on the current mmotm tree). I
hope we can move forward finally. If there is still an opposition then
I would really appreciate a concurrent approach so that we can discuss
alternatives.

http://comments.gmane.org/gmane.linux.kernel.stable/77650 is a reference
to the followup discussion when the patch has been dropped from the mmotm
last time.

Reported-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Michal Hocko <mhocko@suse.cz>
Acked-by: David Rientjes <rientjes@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-04 16:54:01 -07:00
..
Kconfig hugetlb: restrict hugepage_migration_support() to x86_64 2014-06-04 16:53:51 -07:00
Kconfig.debug
Makefile block: move mm/bounce.c to block/ 2014-05-19 20:01:52 -06:00
backing-dev.c arch: Mass conversion of smp_mb__*() 2014-04-18 14:20:48 +02:00
balloon_compaction.c
bootmem.c
cleancache.c
compaction.c mm/compaction: cleanup isolate_freepages() 2014-06-04 16:54:00 -07:00
debug-pagealloc.c
dmapool.c mm: Fix printk typo in dmapool.c 2014-05-05 15:44:47 +02:00
early_ioremap.c mm: create generic early_ioremap() support 2014-04-07 16:36:15 -07:00
fadvise.c
failslab.c
filemap.c Merge branch 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip into next 2014-06-03 12:57:53 -07:00
filemap_xip.c
fremap.c mm: softdirty: make freshly remapped file pages being softdirty unconditionally 2014-06-04 16:53:56 -07:00
frontswap.c
highmem.c
huge_memory.c mm/huge_memory.c: complete conversion to pr_foo() 2014-06-04 16:53:58 -07:00
hugetlb.c hugetlb: add support for gigantic page allocation at runtime 2014-06-04 16:53:59 -07:00
hugetlb_cgroup.c cgroup: drop const from @buffer of cftype->write_string() 2014-03-19 10:23:54 -04:00
hwpoison-inject.c
init-mm.c
internal.h mm/readahead.c: inline ra_submit 2014-04-07 16:35:58 -07:00
interval_tree.c
iov_iter.c take iov_iter stuff to mm/iov_iter.c 2014-04-01 23:19:30 -04:00
kmemcheck.c
kmemleak-test.c
kmemleak.c mem-hotplug: implement get/put_online_mems 2014-06-04 16:53:59 -07:00
ksm.c
list_lru.c mm: keep page cache radix tree nodes in check 2014-04-03 16:21:01 -07:00
maccess.c
madvise.c mm: madvise: fix MADV_WILLNEED on shmem swapouts 2014-05-23 09:37:29 -07:00
memblock.c memblock: introduce memblock_alloc_range() 2014-06-04 16:53:57 -07:00
memcontrol.c memcg: do not hang on OOM when killed by userspace OOM access to memory reserves 2014-06-04 16:54:01 -07:00
memory-failure.c mem-hotplug: implement get/put_online_mems 2014-06-04 16:53:59 -07:00
memory.c x86: define _PAGE_NUMA by reusing software bits on the PMD and PTE levels 2014-06-04 16:53:55 -07:00
memory_hotplug.c mem-hotplug: implement get/put_online_mems 2014-06-04 16:53:59 -07:00
mempolicy.c mm, mempolicy: remove per-process flag 2014-04-07 16:35:54 -07:00
mempool.c mm/mempool: warn about __GFP_ZERO usage 2014-06-04 16:53:58 -07:00
migrate.c mm: fix swapops.h:131 bug if remap_file_pages raced migration 2014-03-20 22:09:09 -07:00
mincore.c mm + fs: prepare for non-page entries in page cache radix trees 2014-04-03 16:21:00 -07:00
mlock.c mm: try_to_unmap_cluster() should lock_page() before mlocking 2014-04-07 16:35:57 -07:00
mm_init.c
mmap.c mm/mmap.c: remove the first mapping check 2014-06-04 16:54:01 -07:00
mmu_context.c
mmu_notifier.c
mmzone.c
mprotect.c mm: move mmu notifier call from change_protection to change_pmd_range 2014-04-07 16:35:50 -07:00
mremap.c mm, thp: close race between mremap() and split_huge_page() 2014-05-11 17:55:48 +09:00
msync.c
nobootmem.c mm/nobootmem.c: mark function as static 2014-04-03 16:21:02 -07:00
nommu.c mm: fix 'ERROR: do not initialise globals to 0 or NULL' and coding style 2014-04-07 16:35:55 -07:00
oom_kill.c
page-writeback.c mm/page-writeback.c: fix divide by zero in pos_ratio_polynom 2014-05-06 13:04:58 -07:00
page_alloc.c mm: debug: make bad_range() output more usable and readable 2014-06-04 16:54:00 -07:00
page_cgroup.c mm/page_cgroup.c: mark functions as static 2014-04-03 16:21:02 -07:00
page_io.c
page_isolation.c
pagewalk.c
percpu-km.c
percpu-vm.c
percpu.c percpu: make pcpu_alloc_chunk() use pcpu_mem_free() instead of kfree() 2014-04-14 16:18:06 -04:00
pgtable-generic.c
process_vm_access.c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2014-04-12 14:49:50 -07:00
quicklist.c
readahead.c mm/readahead.c: inline ra_submit 2014-04-07 16:35:58 -07:00
rmap.c mm: softdirty: don't forget to save file map softdiry bit on unmap 2014-06-04 16:53:56 -07:00
shmem.c mm: Initialize error in shmem_file_aio_read() 2014-04-13 14:10:26 -07:00
slab.c slab: get_online_mems for kmem_cache_{create,destroy,shrink} 2014-06-04 16:53:59 -07:00
slab.h slab: get_online_mems for kmem_cache_{create,destroy,shrink} 2014-06-04 16:53:59 -07:00
slab_common.c slab: get_online_mems for kmem_cache_{create,destroy,shrink} 2014-06-04 16:53:59 -07:00
slob.c slab: get_online_mems for kmem_cache_{create,destroy,shrink} 2014-06-04 16:53:59 -07:00
slub.c slab: get_online_mems for kmem_cache_{create,destroy,shrink} 2014-06-04 16:53:59 -07:00
sparse-vmemmap.c
sparse.c mm: use macros from compiler.h instead of __attribute__((...)) 2014-04-07 16:35:54 -07:00
swap.c mm/swap.c: clean up *lru_cache_add* functions 2014-06-04 16:54:00 -07:00
swap_state.c
swapfile.c
truncate.c mm: filemap: update find_get_pages_tag() to deal with shadow entries 2014-05-06 13:04:59 -07:00
util.c nick kvfree() from apparmor 2014-05-06 14:02:53 -04:00
vmacache.c mm,vmacache: optimize overflow system-wide flushing 2014-06-04 16:53:57 -07:00
vmalloc.c mm/vmalloc.c: enhance vm_map_ram() comment 2014-04-07 16:35:55 -07:00
vmpressure.c
vmscan.c mm: vmscan: do not throttle based on pfmemalloc reserves if node has no ZONE_NORMAL 2014-06-04 16:54:01 -07:00
vmstat.c mm,vmacache: add debug data 2014-06-04 16:53:57 -07:00
workingset.c mm: keep page cache radix tree nodes in check 2014-04-03 16:21:01 -07:00
zbud.c
zsmalloc.c zsmalloc: Fix CPU hotplug callback registration 2014-03-20 13:43:45 +01:00
zswap.c Merge branch 'akpm' (incoming from Andrew) 2014-04-07 16:38:06 -07:00