linux

History

Michal Hocko 3b38722efd memcg, vmscan: integrate soft reclaim tighter with zone shrinking code This patchset is sitting out of tree for quite some time without any objections. I would be really happy if it made it into 3.12. I do not want to push it too hard but I think this work is basically ready and waiting more doesn't help. The basic idea is quite simple. Pull soft reclaim into shrink_zone in the first step and get rid of the previous soft reclaim infrastructure. shrink_zone is done in two passes now. First it tries to do the soft limit reclaim and it falls back to reclaim-all mode if no group is over the limit or no pages have been scanned. The second pass happens at the same priority so the only time we waste is the memcg tree walk which has been updated in the third step to have only negligible overhead. As a bonus we will get rid of a _lot_ of code by this and soft reclaim will not stand out like before when it wasn't integrated into the zone shrinking code and it reclaimed at priority 0 (the testing results show that some workloads suffers from such an aggressive reclaim). The clean up is in a separate patch because I felt it would be easier to review that way. The second step is soft limit reclaim integration into targeted reclaim. It should be rather straight forward. Soft limit has been used only for the global reclaim so far but it makes sense for any kind of pressure coming from up-the-hierarchy, including targeted reclaim. The third step (patches 4-8) addresses the tree walk overhead by enhancing memcg iterators to enable skipping whole subtrees and tracking number of over soft limit children at each level of the hierarchy. This information is updated same way the old soft limit tree was updated (from memcg_check_events) so we shouldn't see an additional overhead. In fact mem_cgroup_update_soft_limit is much simpler than tree manipulation done previously. __shrink_zone uses mem_cgroup_soft_reclaim_eligible as a predicate for mem_cgroup_iter so the decision whether a particular group should be visited is done at the iterator level which allows us to decide to skip the whole subtree as well (if there is no child in excess). This reduces the tree walk overhead considerably. * TEST 1 ======== My primary test case was a parallel kernel build with 2 groups (make is running with -j8 with a distribution .config in a separate cgroup without any hard limit) on a 32 CPU machine booted with 1GB memory and both builds run taskset to Node 0 cpus. I was mostly interested in 2 setups. Default - no soft limit set and - and 0 soft limit set to both groups. The first one should tell us whether the rework regresses the default behavior while the second one should show us improvements in an extreme case where both workloads are always over the soft limit. /usr/bin/time -v has been used to collect the statistics and each configuration had 3 runs after fresh boot without any other load on the system. base is mmotm-2013-07-18-16-40 rework all 8 patches applied on top of base * No-limit User no-limit/base: min: 651.92 max: 672.65 avg: 664.33 std: 8.01 runs: 6 no-limit/rework: min: 657.34 [100.8%] max: 668.39 [99.4%] avg: 663.13 [99.8%] std: 3.61 runs: 6 System no-limit/base: min: 69.33 max: 71.39 avg: 70.32 std: 0.79 runs: 6 no-limit/rework: min: 69.12 [99.7%] max: 71.05 [99.5%] avg: 70.04 [99.6%] std: 0.59 runs: 6 Elapsed no-limit/base: min: 398.27 max: 422.36 avg: 408.85 std: 7.74 runs: 6 no-limit/rework: min: 386.36 [97.0%] max: 438.40 [103.8%] avg: 416.34 [101.8%] std: 18.85 runs: 6 The results are within noise. Elapsed time has a bigger variance but the average looks good. * 0-limit User 0-limit/base: min: 573.76 max: 605.63 avg: 585.73 std: 12.21 runs: 6 0-limit/rework: min: 645.77 [112.6%] max: 666.25 [110.0%] avg: 656.97 [112.2%] std: 7.77 runs: 6 System 0-limit/base: min: 69.57 max: 71.13 avg: 70.29 std: 0.54 runs: 6 0-limit/rework: min: 68.68 [98.7%] max: 71.40 [100.4%] avg: 69.91 [99.5%] std: 0.87 runs: 6 Elapsed 0-limit/base: min: 1306.14 max: 1550.17 avg: 1430.35 std: 90.86 runs: 6 0-limit/rework: min: 404.06 [30.9%] max: 465.94 [30.1%] avg: 434.81 [30.4%] std: 22.68 runs: 6 The improvement is really huge here (even bigger than with my previous testing and I suspect that this highly depends on the storage). Page fault statistics tell us at least part of the story: Minor 0-limit/base: min: 37180461.00 max: 37319986.00 avg: 37247470.00 std: 54772.71 runs: 6 0-limit/rework: min: 36751685.00 [98.8%] max: 36805379.00 [98.6%] avg: 36774506.33 [98.7%] std: 17109.03 runs: 6 Major 0-limit/base: min: 170604.00 max: 221141.00 avg: 196081.83 std: 18217.01 runs: 6 0-limit/rework: min: 2864.00 [1.7%] max: 10029.00 [4.5%] avg: 5627.33 [2.9%] std: 2252.71 runs: 6 Same as with my previous testing Minor faults are more or less within noise but Major fault count is way bellow the base kernel. While this looks as a nice win it is fair to say that 0-limit configuration is quite artificial. So I was playing with 0-no-limit loads as well. * TEST 2 ======== The following results are from 2 groups configuration on a 16GB machine (single NUMA node). - A running stream IO (dd if=/dev/zero of=local.file bs=1024) with 2TotalMem with 0 soft limit. - B running a mem_eater which consumes TotalMem-1G without any limit. The mem_eater consumes the memory in 100 chunks with 1s nap after each mmap+poppulate so that both loads have chance to fight for the memory. The expected result is that B shouldn't be reclaimed and A shouldn't see a big dropdown in elapsed time. User base: min: 2.68 max: 2.89 avg: 2.76 std: 0.09 runs: 3 rework: min: 3.27 [122.0%] max: 3.74 [129.4%] avg: 3.44 [124.6%] std: 0.21 runs: 3 System base: min: 86.26 max: 88.29 avg: 87.28 std: 0.83 runs: 3 rework: min: 81.05 [94.0%] max: 84.96 [96.2%] avg: 83.14 [95.3%] std: 1.61 runs: 3 Elapsed base: min: 317.28 max: 332.39 avg: 325.84 std: 6.33 runs: 3 rework: min: 281.53 [88.7%] max: 298.16 [89.7%] avg: 290.99 [89.3%] std: 6.98 runs: 3 System time improved slightly as well as Elapsed. My previous testing has shown worse numbers but this again seem to depend on the storage speed. My theory is that the writeback doesn't catch up and prio-0 soft reclaim falls into wait on writeback page too often in the base kernel. The patched kernel doesn't do that because the soft reclaim is done from the kswapd/direct reclaim context. This can be seen on the following graph nicely. The A's group usage_in_bytes regurarly drops really low very often. All 3 runs http://labs.suse.cz/mhocko/soft_limit_rework/stream_io-vs-mem_eater/stream.png resp. a detail of the single run http://labs.suse.cz/mhocko/soft_limit_rework/stream_io-vs-mem_eater/stream-one-run.png mem_eater seems to be doing better as well. It gets to the full allocation size faster as can be seen on the following graph: http://labs.suse.cz/mhocko/soft_limit_rework/stream_io-vs-mem_eater/mem_eater-one-run.png /proc/meminfo collected during the test also shows that rework kernel hasn't swapped that much (well almost not at all): base: max: 123900 K avg: 56388.29 K rework: max: 300 K avg: 128.68 K kswapd and direct reclaim statistics are of no use unfortunatelly because soft reclaim is not accounted properly as the counters are hidden by global_reclaim() checks in the base kernel. TEST 3 ======== Another test was the same configuration as TEST2 except the stream IO was replaced by a single kbuild (16 parallel jobs bound to Node0 cpus same as in TEST1) and mem_eater allocated TotalMem-200M so kbuild had only 200MB left. Kbuild did better with the rework kernel here as well: User base: min: 860.28 max: 872.86 avg: 868.03 std: 5.54 runs: 3 rework: min: 880.81 [102.4%] max: 887.45 [101.7%] avg: 883.56 [101.8%] std: 2.83 runs: 3 System base: min: 84.35 max: 85.06 avg: 84.79 std: 0.31 runs: 3 rework: min: 85.62 [101.5%] max: 86.09 [101.2%] avg: 85.79 [101.2%] std: 0.21 runs: 3 Elapsed base: min: 135.36 max: 243.30 avg: 182.47 std: 45.12 runs: 3 rework: min: 110.46 [81.6%] max: 116.20 [47.8%] avg: 114.15 [62.6%] std: 2.61 runs: 3 Minor base: min: 36635476.00 max: 36673365.00 avg: 36654812.00 std: 15478.03 runs: 3 rework: min: 36639301.00 [100.0%] max: 36695541.00 [100.1%] avg: 36665511.00 [100.0%] std: 23118.23 runs: 3 Major base: min: 14708.00 max: 53328.00 avg: 31379.00 std: 16202.24 runs: 3 rework: min: 302.00 [2.1%] max: 414.00 [0.8%] avg: 366.33 [1.2%] std: 47.22 runs: 3 Again we can see a significant improvement in Elapsed (it also seems to be more stable), there is a huge dropdown for the Major page faults and much more swapping: base: max: 583736 K avg: 112547.43 K rework: max: 4012 K avg: 124.36 K Graphs from all three runs show the variability of the kbuild quite nicely. It even seems that it took longer after every run with the base kernel which would be quite surprising as the source tree for the build is removed and caches are dropped after each run so the build operates on a freshly extracted sources everytime. http://labs.suse.cz/mhocko/soft_limit_rework/stream_io-vs-mem_eater/kbuild-mem_eater.png My other testing shows that this is just a matter of timing and other runs behave differently the std for Elapsed time is similar ~50. Example of other three runs: http://labs.suse.cz/mhocko/soft_limit_rework/stream_io-vs-mem_eater/kbuild-mem_eater2.png So to wrap this up. The series is still doing good and improves the soft limit. The testing results for bunch of cgroups with both stream IO and kbuild loads can be found in "memcg: track children in soft limit excess to improve soft limit". This patch: Memcg soft reclaim has been traditionally triggered from the global reclaim paths before calling shrink_zone. mem_cgroup_soft_limit_reclaim then picked up a group which exceeds the soft limit the most and reclaimed it with 0 priority to reclaim at least SWAP_CLUSTER_MAX pages. The infrastructure requires per-node-zone trees which hold over-limit groups and keep them up-to-date (via memcg_check_events) which is not cost free. Although this overhead hasn't turned out to be a bottle neck the implementation is suboptimal because mem_cgroup_update_tree has no idea which zones consumed memory over the limit so we could easily end up having a group on a node-zone tree having only few pages from that node-zone. This patch doesn't try to fix node-zone trees management because it seems that integrating soft reclaim into zone shrinking sounds much easier and more appropriate for several reasons. First of all 0 priority reclaim was a crude hack which might lead to big stalls if the group's LRUs are big and hard to reclaim (e.g. a lot of dirty/writeback pages). Soft reclaim should be applicable also to the targeted reclaim which is awkward right now without additional hacks. Last but not least the whole infrastructure eats quite some code. After this patch shrink_zone is done in 2 passes. First it tries to do the soft reclaim if appropriate (only for global reclaim for now to keep compatible with the original state) and fall back to ignoring soft limit if no group is eligible to soft reclaim or nothing has been scanned during the first pass. Only groups which are over their soft limit or any of their parents up the hierarchy is over the limit are considered eligible during the first pass. Soft limit tree which is not necessary anymore will be removed in the follow up patch to make this patch smaller and easier to review. Signed-off-by: Michal Hocko <mhocko@suse.cz> Reviewed-by: Glauber Costa <glommer@openvz.org> Reviewed-by: Tejun Heo <tj@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Ying Han <yinghan@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Michel Lespinasse <walken@google.com> Cc: Greg Thelen <gthelen@google.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Balbir Singh <bsingharora@gmail.com> Cc: Glauber Costa <glommer@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>		2013-09-12 15:38:00 -07:00
..
Kconfig	Merge remote-tracking branch 'origin/next' into kvm-ppc-next	2013-08-29 00:41:59 +02:00
Kconfig.debug	mm: more intensive memory corruption debugging	2012-01-10 16:30:42 -08:00
Makefile	zswap: add to mm/	2013-07-10 18:11:34 -07:00
backing-dev.c	mm/backing-dev.c: check user buffer length before copying data to the related user buffer	2013-09-11 15:58:03 -07:00
balloon_compaction.c	mm: introduce a common interface for balloon pages mobility	2012-12-11 17:22:26 -08:00
bootmem.c	mm: kill free_all_bootmem_node()	2013-07-03 16:07:39 -07:00
bounce.c	Merge branch 'for-3.10/core' of git://git.kernel.dk/linux-block	2013-05-08 10:13:35 -07:00
cleancache.c	mm: cleancache: clean up cleancache_enabled	2013-04-30 17:04:01 -07:00
compaction.c	mm: compaction: do not compact pgdat for order-0	2013-09-11 15:57:55 -07:00
debug-pagealloc.c	mm, x86: Remove debug_pagealloc_enabled	2011-12-06 09:24:07 +01:00
dmapool.c	dmapool: make DMAPOOL_DEBUG detect corruption of free marker	2012-12-11 17:22:24 -08:00
fadvise.c	teach SYSCALL_DEFINE<n> how to deal with long long/unsigned long long	2013-03-03 22:46:22 -05:00
failslab.c	switch debugfs to umode_t	2012-01-03 22:54:56 -05:00
filemap.c	lib/radix-tree.c: make radix_tree_node_alloc() work correctly within interrupt	2013-09-11 15:59:36 -07:00
filemap_xip.c	lift sb_start_write() out of ->write()	2013-04-09 14:12:56 -04:00
fremap.c	mm: save soft-dirty bits on file pages	2013-08-13 17:57:48 -07:00
frontswap.c	frontswap: fix incorrect zeroing and allocation size for frontswap_map	2013-06-12 16:29:46 -07:00
highmem.c	Some nice cleanups, and even a patch my wife did as a "live" demo for	2012-12-20 08:37:05 -08:00
huge_memory.c	mm/huge_memory.c: fix potential NULL pointer dereference	2013-09-11 15:57:19 -07:00
hugetlb.c	mm: prepare to remove /proc/sys/vm/hugepages_treat_as_movable	2013-09-11 15:57:49 -07:00
hugetlb_cgroup.c	cgroup: pass around cgroup_subsys_state instead of cgroup in file methods	2013-08-08 20:11:24 -04:00
hwpoison-inject.c	mm/hwpoison-inject.c: change permission of corrupt-pfn/unpoison-pfn to 0200	2013-09-11 15:58:11 -07:00
init-mm.c	atomic: use <linux/atomic.h>	2011-07-26 16:49:47 -07:00
internal.h	mm: vmscan: fix do_try_to_free_pages() livelock	2013-09-11 15:58:01 -07:00
interval_tree.c	mm: add CONFIG_DEBUG_VM_RB build option	2012-10-09 16:22:42 +09:00
kmemcheck.c	kmemcheck: Fix build errors due to missing slab.h	2010-03-30 22:02:32 +09:00
kmemleak-test.c	kmemleak: remove memset by using kzalloc	2011-01-27 18:31:51 +00:00
kmemleak.c	mm: replace strict_strtoul() with kstrtoul()	2013-09-11 15:57:11 -07:00
ksm.c	mm: replace strict_strtoul() with kstrtoul()	2013-09-11 15:57:11 -07:00
maccess.c	mm: Map most files to use export.h instead of module.h	2011-10-31 09:20:12 -04:00
madvise.c	mm/madvise.c:madvise_hwpoison(): remove local `ret'	2013-09-11 15:58:13 -07:00
memblock.c	memblock, numa: binary search node id	2013-09-11 15:57:51 -07:00
memcontrol.c	memcg, vmscan: integrate soft reclaim tighter with zone shrinking code	2013-09-12 15:38:00 -07:00
memory-failure.c	mm/memory-failure.c: fix bug triggered by unpoisoning empty zero page	2013-09-11 15:58:12 -07:00
memory.c	mm: migrate: add hugepage migration code to move_pages()	2013-09-11 15:57:48 -07:00
memory_hotplug.c	mm: memory-hotplug: enable memory hotplug to handle hugepage	2013-09-11 15:57:48 -07:00
mempolicy.c	mbind: add BUG_ON(!vma) in new_vma_page()	2013-09-11 15:57:50 -07:00
mempool.c	mm/mempool.c: convert kmalloc_node(...GFP_ZERO...) to kzalloc_node(...)	2013-09-11 15:58:14 -07:00
migrate.c	mm: vmscan: fix do_try_to_free_pages() livelock	2013-09-11 15:58:01 -07:00
mincore.c	swap: make each swap partition have one address_space	2013-02-23 17:50:17 -08:00
mlock.c	mm: munlock: manual pte walk in fast path instead of follow_page_mask()	2013-09-11 15:58:01 -07:00
mm_init.c	mm: tune vm_committed_as percpu_counter batching size	2013-07-03 16:07:32 -07:00
mmap.c	mm/mmap: remove unnecessary assignment	2013-09-11 15:58:13 -07:00
mmu_context.c	mm: remove old aio use_mm() comment	2013-05-07 18:38:27 -07:00
mmu_notifier.c	treewide: relase -> release	2013-06-28 14:34:33 +02:00
mmzone.c	mm: rename page struct field helpers	2013-02-23 17:50:18 -08:00
mprotect.c	mm/mprotect.c: coding-style cleanups	2012-12-18 15:02:15 -08:00
mremap.c	mm/mremap.c: call pud_free() after fail calling pmd_alloc()	2013-09-11 15:58:03 -07:00
msync.c	sanitize vfs_fsync calling conventions	2010-05-21 18:31:21 -04:00
nobootmem.c	mm: concentrate modification of totalram_pages into the mm core	2013-07-03 16:07:33 -07:00
nommu.c	mm: remove free_area_cache	2013-07-10 18:11:34 -07:00
oom_kill.c	mm/oom_kill: remove weird use of ERR_PTR()/PTR_ERR().	2013-07-15 11:25:05 +09:30
page-writeback.c	mm/page-writeback.c: add strictlimit feature	2013-09-11 15:58:04 -07:00
page_alloc.c	mm: correct the comment about the value for buddy _mapcount	2013-09-11 15:58:06 -07:00
page_cgroup.c	memcontrol: use N_MEMORY instead N_HIGH_MEMORY	2012-12-12 17:38:32 -08:00
page_io.c	mm: remove compressed copy from zram in-memory	2013-07-03 16:07:26 -07:00
page_isolation.c	mm: memory-hotplug: enable memory hotplug to handle hugepage	2013-09-11 15:57:48 -07:00
pagewalk.c	mm/pagewalk.c: walk_page_range should avoid VM_PFNMAP areas	2013-05-24 16:22:53 -07:00
percpu-km.c	percpu: clear memory allocated with the km allocator	2010-10-02 10:28:42 +03:00
percpu-vm.c	mm: fix kernel-doc warnings	2012-06-20 14:39:36 -07:00
percpu.c	mm, percpu: Make sure percpu_alloc early parameter has an argument	2012-12-02 06:23:04 -08:00
pgtable-generic.c	mm: move pgtable related functions to right place	2013-09-11 15:57:30 -07:00
process_vm_access.c	Fix: compat_rw_copy_check_uvector() misuse in aio, readv, writev, and security keys	2013-03-12 11:05:45 -07:00
quicklist.c	mm: delete various needless include <linux/module.h>	2011-10-31 09:20:11 -04:00
readahead.c	readahead: make context readahead more conservative	2013-09-11 15:57:39 -07:00
rmap.c	s390/mm: implement software referenced bits	2013-08-29 13:20:11 +02:00
shmem.c	initmpfs: make rootfs use tmpfs when CONFIG_TMPFS enabled	2013-09-11 15:59:37 -07:00
slab.c	kernel: delete __cpuinit usage from all core kernel files	2013-07-14 19:36:59 -04:00
slab.h	memcg: check that kmem_cache has memcg_params before accessing it	2013-08-28 19:26:38 -07:00
slab_common.c	Merge branch 'slab/for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/linux	2013-07-14 15:14:29 -07:00
slob.c	Merge branch 'slab/for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/linux	2013-07-14 15:14:29 -07:00
slub.c	mm: replace strict_strtoul() with kstrtoul()	2013-09-11 15:57:11 -07:00
sparse-vmemmap.c	sparse-vmemmap: specify vmemmap population range in bytes	2013-04-29 15:54:35 -07:00
sparse.c	mm/sparse: introduce alloc_usemap_and_memmap	2013-09-11 15:58:01 -07:00
swap.c	mm: fix aio performance regression for database caused by THP	2013-09-11 15:57:55 -07:00
swap_state.c	lib/radix-tree.c: make radix_tree_node_alloc() work correctly within interrupt	2013-09-11 15:59:36 -07:00
swapfile.c	swap: make cluster allocation per-cpu	2013-09-11 15:57:17 -07:00
truncate.c	mm: teach truncate_inode_pages_range() to handle non page aligned ranges	2013-05-27 23:32:35 -04:00
util.c	swap: clean-up #ifdef in page_mapping()	2013-09-11 15:57:31 -07:00
vmalloc.c	mm/vmalloc: use wrapper function get_vm_area_size to caculate size of vm area	2013-09-11 15:58:02 -07:00
vmpressure.c	Merge branch 'for-3.12' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup	2013-09-03 18:25:03 -07:00
vmscan.c	memcg, vmscan: integrate soft reclaim tighter with zone shrinking code	2013-09-12 15:38:00 -07:00
vmstat.c	mm: vmscan: fix do_try_to_free_pages() livelock	2013-09-11 15:58:01 -07:00
zbud.c	mm/zbud: fix some trivial typos in comments	2013-09-11 15:57:35 -07:00
zswap.c	mm/zswap: use postorder iteration when destroying rbtree	2013-09-11 15:59:21 -07:00