linux/mm
Kemi Wang 3a321d2a3d mm: change the call sites of numa statistics items
Patch series "Separate NUMA statistics from zone statistics", v2.

Each page allocation updates a set of per-zone statistics with a call to
zone_statistics().  As discussed in 2017 MM summit, these are a
substantial source of overhead in the page allocator and are very rarely
consumed.  This significant overhead in cache bouncing caused by zone
counters (NUMA associated counters) update in parallel in multi-threaded
page allocation (pointed out by Dave Hansen).

A link to the MM summit slides:
  http://people.netfilter.org/hawk/presentations/MM-summit2017/MM-summit2017-JesperBrouer.pdf

To mitigate this overhead, this patchset separates NUMA statistics from
zone statistics framework, and update NUMA counter threshold to a fixed
size of MAX_U16 - 2, as a small threshold greatly increases the update
frequency of the global counter from local per cpu counter (suggested by
Ying Huang).  The rationality is that these statistics counters don't
need to be read often, unlike other VM counters, so it's not a problem
to use a large threshold and make readers more expensive.

With this patchset, we see 31.3% drop of CPU cycles(537-->369, see
below) for per single page allocation and reclaim on Jesper's
page_bench03 benchmark.  Meanwhile, this patchset keeps the same style
of virtual memory statistics with little end-user-visible effects (only
move the numa stats to show behind zone page stats, see the first patch
for details).

I did an experiment of single page allocation and reclaim concurrently
using Jesper's page_bench03 benchmark on a 2-Socket Broadwell-based
server (88 processors with 126G memory) with different size of threshold
of pcp counter.

Benchmark provided by Jesper D Brouer(increase loop times to 10000000):
  https://github.com/netoptimizer/prototype-kernel/tree/master/kernel/mm/bench

   Threshold   CPU cycles    Throughput(88 threads)
      32        799         241760478
      64        640         301628829
      125       537         358906028 <==> system by default
      256       468         412397590
      512       428         450550704
      4096      399         482520943
      20000     394         489009617
      30000     395         488017817
      65533     369(-31.3%) 521661345(+45.3%) <==> with this patchset
      N/A       342(-36.3%) 562900157(+56.8%) <==> disable zone_statistics

This patch (of 3):

In this patch, NUMA statistics is separated from zone statistics
framework, all the call sites of NUMA stats are changed to use
numa-stats-specific functions, it does not have any functionality change
except that the number of NUMA stats is shown behind zone page stats
when users *read* the zone info.

E.g. cat /proc/zoneinfo
    ***Base***                           ***With this patch***
nr_free_pages 3976                         nr_free_pages 3976
nr_zone_inactive_anon 0                    nr_zone_inactive_anon 0
nr_zone_active_anon 0                      nr_zone_active_anon 0
nr_zone_inactive_file 0                    nr_zone_inactive_file 0
nr_zone_active_file 0                      nr_zone_active_file 0
nr_zone_unevictable 0                      nr_zone_unevictable 0
nr_zone_write_pending 0                    nr_zone_write_pending 0
nr_mlock     0                             nr_mlock     0
nr_page_table_pages 0                      nr_page_table_pages 0
nr_kernel_stack 0                          nr_kernel_stack 0
nr_bounce    0                             nr_bounce    0
nr_zspages   0                             nr_zspages   0
numa_hit 0                                *nr_free_cma  0*
numa_miss 0                                numa_hit     0
numa_foreign 0                             numa_miss    0
numa_interleave 0                          numa_foreign 0
numa_local   0                             numa_interleave 0
numa_other   0                             numa_local   0
*nr_free_cma 0*                            numa_other 0
    ...                                        ...
vm stats threshold: 10                     vm stats threshold: 10
    ...                                        ...

The next patch updates the numa stats counter size and threshold.

[akpm@linux-foundation.org: coding-style fixes]
Link: http://lkml.kernel.org/r/1503568801-21305-2-git-send-email-kemi.wang@intel.com
Signed-off-by: Kemi Wang <kemi.wang@intel.com>
Reported-by: Jesper Dangaard Brouer <brouer@redhat.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Christopher Lameter <cl@linux.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Andi Kleen <andi.kleen@intel.com>
Cc: Ying Huang <ying.huang@intel.com>
Cc: Aaron Lu <aaron.lu@intel.com>
Cc: Tim Chen <tim.c.chen@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-09-08 18:26:47 -07:00
..
kasan Merge branch 'linus' into locking/core, to pick up fixes 2017-08-10 12:20:53 +02:00
Kconfig mm/hmm: avoid bloating arch that do not make use of HMM 2017-09-08 18:26:46 -07:00
Kconfig.debug mm: enable page poisoning early at boot 2017-05-03 15:52:10 -07:00
Makefile mm/hmm: avoid bloating arch that do not make use of HMM 2017-09-08 18:26:46 -07:00
backing-dev.c bdi: Drop 'parent' argument from bdi_register[_va]() 2017-04-20 12:09:55 -06:00
balloon_compaction.c mm/migrate: new migrate mode MIGRATE_SYNC_NO_COPY 2017-09-08 18:26:46 -07:00
bootmem.c
cleancache.c fs: switch ->s_uuid to uuid_t 2017-06-05 16:59:12 +02:00
cma.c cma: fix calculation of aligned offset 2017-07-10 16:32:32 -07:00
cma.h cma: Store a name in the cma structure 2017-04-18 20:41:12 +02:00
cma_debug.c mm/cma_debug.c: fix stack corruption due to sprintf usage 2017-08-18 15:32:02 -07:00
compaction.c mm, compaction: skip over holes in __reset_isolation_suitable 2017-07-06 16:24:32 -07:00
debug.c mm: make tlb_flush_pending global 2017-08-10 15:54:07 -07:00
debug_page_ref.c
dmapool.c lib/vsprintf.c: remove %Z support 2017-02-27 18:43:47 -08:00
early_ioremap.c x86/mm: Add support to access boot related data in the clear 2017-07-18 11:38:02 +02:00
fadvise.c
failslab.c
filemap.c Merge branch 'akpm' (patches from Andrew) 2017-09-06 20:49:49 -07:00
frame_vector.c treewide: use kv[mz]alloc* rather than opencoded variants 2017-05-08 17:15:13 -07:00
frontswap.c
gup.c mm/device-public-memory: device memory cache coherent with CPU 2017-09-08 18:26:46 -07:00
highmem.c
hmm.c mm/hmm: avoid bloating arch that do not make use of HMM 2017-09-08 18:26:46 -07:00
huge_memory.c mm: soft-dirty: keep soft-dirty bits over thp migration 2017-09-08 18:26:45 -07:00
hugetlb.c powerpc updates for 4.14 2017-09-07 10:15:40 -07:00
hugetlb_cgroup.c
hwpoison-inject.c mm: hwpoison: call shake_page() unconditionally 2017-05-03 15:52:12 -07:00
init-mm.c
internal.h mm, oom: do not rely on TIF_MEMDIE for memory reserves access 2017-09-06 17:27:30 -07:00
interval_tree.c
khugepaged.c mm: make PR_SET_THP_DISABLE immediately active 2017-07-10 16:32:31 -07:00
kmemcheck.c mm: Rename SLAB_DESTROY_BY_RCU to SLAB_TYPESAFE_BY_RCU 2017-04-18 11:42:36 -07:00
kmemleak-test.c
kmemleak.c mm: kmemleak: treat vm_struct as alternative reference to vmalloc'ed objects 2017-07-06 16:24:34 -07:00
ksm.c mm/ksm.c: constify attribute_group structures 2017-09-06 17:27:27 -07:00
list_lru.c mm/list_lru.c: fix list_lru_count_node() to be race free 2017-07-10 16:32:33 -07:00
maccess.c
madvise.c mm/device-public-memory: device memory cache coherent with CPU 2017-09-08 18:26:46 -07:00
memblock.c mm/memblock.c: reversed logic in memblock_discard() 2017-08-25 16:12:46 -07:00
memcontrol.c mm/device-public-memory: device memory cache coherent with CPU 2017-09-08 18:26:46 -07:00
memory-failure.c x86/mm, mm/hwpoison: Clear PRESENT bit for kernel 1:1 mappings of poison pages 2017-08-17 10:30:49 +02:00
memory.c mm/memory.c: remove reduntant check for write access 2017-09-08 18:26:47 -07:00
memory_hotplug.c mm/ZONE_DEVICE: new type of ZONE_DEVICE for unaddressable memory 2017-09-08 18:26:46 -07:00
mempolicy.c mm: remove useless vma parameter to offset_il_node 2017-09-08 18:26:46 -07:00
mempool.c sched/wait: Rename wait_queue_t => wait_queue_entry_t 2017-06-20 12:18:27 +02:00
memtest.c
migrate.c mm/hmm: avoid bloating arch that do not make use of HMM 2017-09-08 18:26:46 -07:00
mincore.c mm: remove shmem_mapping() shmem_zero_setup() duplicates 2017-02-24 17:46:56 -08:00
mlock.c mlock: fix mlock count can not decrease in race condition 2017-06-02 15:07:38 -07:00
mm_init.c
mmap.c mm: oom: let oom_reap_task and exit_mmap run concurrently 2017-09-06 17:27:30 -07:00
mmu_context.c sched/headers: Prepare to move the task_lock()/unlock() APIs to <linux/sched/task.h> 2017-03-02 08:42:38 +01:00
mmu_notifier.c mm/mmu_notifier: kill invalidate_page 2017-08-31 16:13:00 -07:00
mmzone.c
mprotect.c mm/ZONE_DEVICE: new type of ZONE_DEVICE for unaddressable memory 2017-09-08 18:26:46 -07:00
mremap.c mm: thp: check pmd migration entry in common path 2017-09-08 18:26:45 -07:00
msync.c
nobootmem.c mm: discard memblock data later 2017-08-18 15:32:01 -07:00
nommu.c mm: rename global_page_state to global_zone_page_state 2017-09-06 17:27:29 -07:00
oom_kill.c mm: oom: let oom_reap_task and exit_mmap run concurrently 2017-09-06 17:27:30 -07:00
page-writeback.c mm: rename global_page_state to global_zone_page_state 2017-09-06 17:27:29 -07:00
page_alloc.c mm: change the call sites of numa statistics items 2017-09-08 18:26:47 -07:00
page_counter.c
page_ext.c mm, page_ext: periodically reschedule during page_ext_init() 2017-09-06 17:27:26 -07:00
page_idle.c mm/page_idle.c: constify attribute_group structures 2017-09-06 17:27:27 -07:00
page_io.c Merge branch 'for-4.14/block' of git://git.kernel.dk/linux-block 2017-09-07 11:59:42 -07:00
page_isolation.c mm: unify new_node_page and alloc_migrate_target 2017-07-10 16:32:31 -07:00
page_owner.c mm, page_owner: don't grab zone->lock for init_pages_in_zone() 2017-09-06 17:27:26 -07:00
page_poison.c mm: enable page poisoning early at boot 2017-05-03 15:52:10 -07:00
page_vma_mapped.c mm/migrate: support un-addressable ZONE_DEVICE page in migration 2017-09-08 18:26:46 -07:00
pagewalk.c mm/hugetlb: add size parameter to huge_pte_offset() 2017-07-06 16:24:34 -07:00
percpu-internal.h percpu: skip chunks if the alloc does not fit in the contig hint 2017-07-26 17:41:05 -04:00
percpu-km.c percpu: replace area map allocator with bitmap 2017-07-26 17:41:05 -04:00
percpu-stats.c percpu: add first_bit to keep track of the first free in the bitmap 2017-07-26 17:41:05 -04:00
percpu-vm.c percpu: fix static checker warnings in pcpu_destroy_chunk 2017-06-29 11:23:38 -04:00
percpu.c percpu: update header to contain bitmap allocator explanation. 2017-07-26 17:41:06 -04:00
pgtable-generic.c mm: thp: enable thp migration in generic path 2017-09-08 18:26:45 -07:00
process_vm_access.c sched/headers: Prepare for new header dependencies before moving code to <linux/sched/mm.h> 2017-03-02 08:42:28 +01:00
quicklist.c
readahead.c
rmap.c mm/migrate: support un-addressable ZONE_DEVICE page in migration 2017-09-08 18:26:46 -07:00
rodata_test.c mm: remove rodata_test_data export, add pr_fmt 2017-05-03 15:52:09 -07:00
shmem.c mm, swap: VMA based swap readahead 2017-09-06 17:27:29 -07:00
slab.c mm: memcontrol: account slab stats per lruvec 2017-07-06 16:24:35 -07:00
slab.h locking/lockdep: Rework FS_RECLAIM annotation 2017-08-10 12:29:03 +02:00
slab_common.c mm: allow slab_nomerge to be set at build time 2017-07-06 16:24:31 -07:00
slob.c locking/lockdep: Rework FS_RECLAIM annotation 2017-08-10 12:29:03 +02:00
slub.c mm/slub.c: constify attribute_group structures 2017-09-06 17:27:27 -07:00
sparse-vmemmap.c mm, sparse, page_ext: drop ugly N_HIGH_MEMORY branches for allocations 2017-09-06 17:27:26 -07:00
sparse.c mm, sparse, page_ext: drop ugly N_HIGH_MEMORY branches for allocations 2017-09-06 17:27:26 -07:00
swap.c mm/device-public-memory: device memory cache coherent with CPU 2017-09-08 18:26:46 -07:00
swap_cgroup.c mm, THP, swap: delay splitting THP during swap out 2017-07-06 16:24:31 -07:00
swap_slots.c mm/swap_slots.c: don't disable preemption while taking the per-CPU cache 2017-07-10 16:32:32 -07:00
swap_state.c mm, swap: add sysfs interface for VMA based swap readahead 2017-09-06 17:27:29 -07:00
swapfile.c swap: choose swap device according to numa node 2017-09-06 17:27:30 -07:00
truncate.c mm/truncate.c: fix THP handling in invalidate_mapping_pages() 2017-07-10 16:32:32 -07:00
usercopy.c mm/usercopy: Drop extra is_vmalloc_or_module() check 2017-04-05 12:30:18 -07:00
userfaultfd.c userfaultfd: shmem: wire up shmem_mfill_zeropage_pte 2017-09-06 17:27:28 -07:00
util.c mm: rename global_page_state to global_zone_page_state 2017-09-06 17:27:29 -07:00
vmacache.c sched/headers: Prepare to move 'init_task' and 'init_thread_union' from <linux/sched.h> to <linux/sched/task.h> 2017-03-02 08:42:38 +01:00
vmalloc.c mm/vmalloc.c: don't reinvent the wheel but use existing llist API 2017-09-06 17:27:29 -07:00
vmpressure.c mm, vmpressure: pass-through notification support 2017-07-10 16:32:31 -07:00
vmscan.c mm, THP, swap: add THP swapping out fallback counting 2017-09-06 17:27:28 -07:00
vmstat.c mm: change the call sites of numa statistics items 2017-09-08 18:26:47 -07:00
workingset.c mm: memcontrol: per-lruvec stats infrastructure 2017-07-06 16:24:35 -07:00
z3fold.c z3fold: use per-cpu unbuddied lists 2017-09-06 17:27:30 -07:00
zbud.c
zpool.c
zsmalloc.c mm/migrate: new migrate mode MIGRATE_SYNC_NO_COPY 2017-09-08 18:26:46 -07:00
zswap.c mm/zswap.c: delete an error message for a failed memory allocation in zswap_dstmem_prepare() 2017-07-06 16:24:35 -07:00