linux/mm
Johannes Weiner a983b5ebee mm: memcontrol: fix excessive complexity in memory.stat reporting
We've seen memory.stat reads in top-level cgroups take up to fourteen
seconds during a userspace bug that created tens of thousands of ghost
cgroups pinned by lingering page cache.

Even with a more reasonable number of cgroups, aggregating memory.stat
is unnecessarily heavy.  The complexity is this:

	nr_cgroups * nr_stat_items * nr_possible_cpus

where the stat items are ~70 at this point.  With 128 cgroups and 128
CPUs - decent, not enormous setups - reading the top-level memory.stat
has to aggregate over a million per-cpu counters.  This doesn't scale.

Instead of spreading the source of truth across all CPUs, use the
per-cpu counters merely to batch updates to shared atomic counters.

This is the same as the per-cpu stocks we use for charging memory to the
shared atomic page_counters, and also the way the global vmstat counters
are implemented.

Vmstat has elaborate spilling thresholds that depend on the number of
CPUs, amount of memory, and memory pressure - carefully balancing the
cost of counter updates with the amount of per-cpu error.  That's
because the vmstat counters are system-wide, but also used for decisions
inside the kernel (e.g.  NR_FREE_PAGES in the allocator).  Neither is
true for the memory controller.

Use the same static batch size we already use for page_counter updates
during charging.  The per-cpu error in the stats will be 128k, which is
an acceptable ratio of cores to memory accounting granularity.

[hannes@cmpxchg.org: fix warning in __this_cpu_xchg() calls]
  Link: http://lkml.kernel.org/r/20171201135750.GB8097@cmpxchg.org
Link: http://lkml.kernel.org/r/20171103153336.24044-3-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-01-31 17:18:36 -08:00
..
kasan kasan: use %px to print addresses instead of %p 2017-11-29 12:13:16 +11:00
backing-dev.c Revert "bdi: add error handle for bdi_debug_register" 2017-12-21 10:01:30 -07:00
balloon_compaction.c
bootmem.c
cleancache.c
cma_debug.c
cma.c
cma.h
compaction.c mm, compaction: remove unneeded pageblock_skip_persistent() checks 2017-11-17 16:10:00 -08:00
debug_page_ref.c
debug.c mm/debug.c: provide useful debugging information for VM_BUG 2018-01-04 16:45:09 -08:00
dmapool.c
early_ioremap.c mm/early_ioremap: Fix boot hang with earlyprintk=efi,keep 2017-12-11 14:54:44 +01:00
fadvise.c
failslab.c
filemap.c mm/filemap.c: remove include of hardirq.h 2018-01-31 17:18:36 -08:00
frame_vector.c mm/frame_vector.c: release a semaphore in 'get_vaddr_frames()' 2017-12-14 16:00:48 -08:00
frontswap.c
gup_benchmark.c mm: add infrastructure for get_user_pages_fast() benchmarking 2017-11-17 16:10:04 -08:00
gup.c Merge branch 'work.get_user_pages_fast' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2018-01-31 10:01:08 -08:00
highmem.c
hmm.c Revert "mm: replace p??_write with pte_access_permitted in fault + gup paths" 2017-12-15 18:53:22 -08:00
huge_memory.c Revert "mm: replace p??_write with pte_access_permitted in fault + gup paths" 2017-12-15 18:53:22 -08:00
hugetlb_cgroup.c
hugetlb.c mm: show total hugetlb memory consumption in /proc/meminfo 2018-01-31 17:18:36 -08:00
hwpoison-inject.c mm/memory_failure: Remove unused trapno from memory_failure 2018-01-23 12:17:42 -06:00
init-mm.c
internal.h Revert "mm, thp: Do not make pmd/pud dirty without a reason" 2017-11-29 09:01:01 -08:00
interval_tree.c
Kconfig mm: relax deferred struct page requirements 2018-01-31 17:18:36 -08:00
Kconfig.debug
khugepaged.c Revert "mm, thp: Do not make pmd/pud dirty without a reason" 2017-11-29 09:01:01 -08:00
kmemleak-test.c
kmemleak.c mm: kmemleak: remove unused hardirq.h 2018-01-31 17:18:36 -08:00
ksm.c mm/ksm: Remove now-redundant smp_read_barrier_depends() 2017-12-04 10:52:56 -08:00
list_lru.c
maccess.c
madvise.c mm/memory_failure: Remove unused trapno from memory_failure 2018-01-23 12:17:42 -06:00
Makefile mm: add infrastructure for get_user_pages_fast() benchmarking 2017-11-17 16:10:04 -08:00
memblock.c
memcontrol.c mm: memcontrol: fix excessive complexity in memory.stat reporting 2018-01-31 17:18:36 -08:00
memory_hotplug.c mm: drop hotplug lock from lru_add_drain_all() 2018-01-31 17:18:36 -08:00
memory-failure.c signal/memory-failure: Use force_sig_mceerr and send_sig_mceerr 2018-01-23 12:17:48 -06:00
memory.c mm/memory.c: release locked page in do_swap_page() 2018-01-19 10:09:40 -08:00
mempolicy.c mm/mempolicy: add nodes_empty check in SYSC_migrate_pages 2018-01-31 17:18:36 -08:00
mempool.c
memtest.c
migrate.c Revert "mm, thp: Do not make pmd/pud dirty without a reason" 2017-11-29 09:01:01 -08:00
mincore.c
mlock.c mm: Eliminate cond_resched_rcu_qs() in favor of cond_resched() 2017-11-28 16:00:28 -08:00
mm_init.c
mmap.c mm, oom_reaper: fix memory corruption 2017-12-14 16:00:49 -08:00
mmu_context.c
mmu_notifier.c
mmzone.c
mprotect.c mm/mprotect: add a cond_resched() inside change_pmd_range() 2018-01-04 16:45:09 -08:00
mremap.c
msync.c
nobootmem.c
nommu.c
oom_kill.c mm, oom_reaper: fix memory corruption 2017-12-14 16:00:49 -08:00
page_alloc.c mm: split deferred_init_range into initializing and freeing parts 2018-01-31 17:18:36 -08:00
page_counter.c
page_ext.c
page_idle.c
page_io.c block: convert to bio_first_bvec_all & bio_first_page_all 2018-01-06 09:18:00 -07:00
page_isolation.c
page_owner.c mm/page_owner.c: remove drain_all_pages from init_early_allocated_pages 2018-01-19 10:09:40 -08:00
page_poison.c
page_vma_mapped.c mm, page_vma_mapped: Introduce pfn_in_hpage() 2018-01-22 12:15:57 -08:00
page-writeback.c Revert "mm/page-writeback.c: print a warning if the vm dirtiness settings are illogical" 2017-11-29 18:40:43 -08:00
pagewalk.c
percpu-internal.h
percpu-km.c
percpu-stats.c
percpu-vm.c
percpu.c percpu: hack to let the CRIS architecture to boot until they clean up 2017-11-27 12:53:12 -08:00
pgtable-generic.c
process_vm_access.c
quicklist.c
readahead.c
rmap.c
rodata_test.c
shmem.c Rename superblock flags (MS_xyz -> SB_xyz) 2017-11-27 13:05:09 -08:00
slab_common.c mm/slab_common.c: make calculate_alignment() static 2018-01-31 17:18:35 -08:00
slab.c mm/slab.c: remove redundant assignments for slab_state 2018-01-31 17:18:35 -08:00
slab.h mm/slab_common.c: make calculate_alignment() static 2018-01-31 17:18:35 -08:00
slob.c
slub.c slub: remove obsolete comments of put_cpu_partial() 2018-01-31 17:18:36 -08:00
sparse-vmemmap.c
sparse.c mm/sparse.c: wrong allocation for mem_section 2018-01-04 16:45:09 -08:00
swap_cgroup.c
swap_slots.c
swap_state.c
swap.c mm: drop hotplug lock from lru_add_drain_all() 2018-01-31 17:18:36 -08:00
swapfile.c ipc, kernel, mm: annotate ->poll() instances 2017-11-27 16:20:05 -05:00
truncate.c
usercopy.c
userfaultfd.c
util.c new primitive: vmemdup_user() 2018-01-07 13:06:15 -05:00
vmacache.c
vmalloc.c
vmpressure.c
vmscan.c mm: use sc->priority for slab shrink targets 2018-01-31 17:18:36 -08:00
vmstat.c
workingset.c
z3fold.c mm/z3fold.c: use kref to prevent page free/compact race 2017-11-17 16:10:00 -08:00
zbud.c
zpool.c
zsmalloc.c mm/zsmalloc.c: include fs.h 2018-01-04 16:45:09 -08:00
zswap.c zswap: same-filled pages handling 2018-01-31 17:18:36 -08:00