linux/kernel
Davidlohr Bueso efb5fea230 mm: per-thread vma caching
commit 615d6e8756 upstream.

This patch is a continuation of efforts trying to optimize find_vma(),
avoiding potentially expensive rbtree walks to locate a vma upon faults.
The original approach (https://lkml.org/lkml/2013/11/1/410), where the
largest vma was also cached, ended up being too specific and random,
thus further comparison with other approaches were needed.  There are
two things to consider when dealing with this, the cache hit rate and
the latency of find_vma().  Improving the hit-rate does not necessarily
translate in finding the vma any faster, as the overhead of any fancy
caching schemes can be too high to consider.

We currently cache the last used vma for the whole address space, which
provides a nice optimization, reducing the total cycles in find_vma() by
up to 250%, for workloads with good locality.  On the other hand, this
simple scheme is pretty much useless for workloads with poor locality.
Analyzing ebizzy runs shows that, no matter how many threads are
running, the mmap_cache hit rate is less than 2%, and in many situations
below 1%.

The proposed approach is to replace this scheme with a small per-thread
cache, maximizing hit rates at a very low maintenance cost.
Invalidations are performed by simply bumping up a 32-bit sequence
number.  The only expensive operation is in the rare case of a seq
number overflow, where all caches that share the same address space are
flushed.  Upon a miss, the proposed replacement policy is based on the
page number that contains the virtual address in question.  Concretely,
the following results are seen on an 80 core, 8 socket x86-64 box:

1) System bootup: Most programs are single threaded, so the per-thread
   scheme does improve ~50% hit rate by just adding a few more slots to
   the cache.

+----------------+----------+------------------+
| caching scheme | hit-rate | cycles (billion) |
+----------------+----------+------------------+
| baseline       | 50.61%   | 19.90            |
| patched        | 73.45%   | 13.58            |
+----------------+----------+------------------+

2) Kernel build: This one is already pretty good with the current
   approach as we're dealing with good locality.

+----------------+----------+------------------+
| caching scheme | hit-rate | cycles (billion) |
+----------------+----------+------------------+
| baseline       | 75.28%   | 11.03            |
| patched        | 88.09%   | 9.31             |
+----------------+----------+------------------+

3) Oracle 11g Data Mining (4k pages): Similar to the kernel build workload.

+----------------+----------+------------------+
| caching scheme | hit-rate | cycles (billion) |
+----------------+----------+------------------+
| baseline       | 70.66%   | 17.14            |
| patched        | 91.15%   | 12.57            |
+----------------+----------+------------------+

4) Ebizzy: There's a fair amount of variation from run to run, but this
   approach always shows nearly perfect hit rates, while baseline is just
   about non-existent.  The amounts of cycles can fluctuate between
   anywhere from ~60 to ~116 for the baseline scheme, but this approach
   reduces it considerably.  For instance, with 80 threads:

+----------------+----------+------------------+
| caching scheme | hit-rate | cycles (billion) |
+----------------+----------+------------------+
| baseline       | 1.06%    | 91.54            |
| patched        | 99.97%   | 14.18            |
+----------------+----------+------------------+

[akpm@linux-foundation.org: fix nommu build, per Davidlohr]
[akpm@linux-foundation.org: document vmacache_valid() logic]
[akpm@linux-foundation.org: attempt to untangle header files]
[akpm@linux-foundation.org: add vmacache_find() BUG_ON]
[hughd@google.com: add vmacache_valid_mm() (from Oleg)]
[akpm@linux-foundation.org: coding-style fixes]
[akpm@linux-foundation.org: adjust and enhance comments]
Signed-off-by: Davidlohr Bueso <davidlohr@hp.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Reviewed-by: Michel Lespinasse <walken@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Tested-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-10-09 12:21:29 -07:00
..
cpu sched/preempt: Fix up missed PREEMPT_NEED_RESCHED folding 2014-01-13 17:38:55 +01:00
debug mm: per-thread vma caching 2014-10-09 12:21:29 -07:00
events perf: fix perf bug in fork() 2014-10-09 12:21:26 -07:00
gcov gcov: reuse kbasename helper 2013-11-13 12:09:34 +09:00
irq genirq: Sanitize spurious interrupt detection of threaded irqs 2014-06-30 20:12:00 -07:00
locking rtmutex: Plug slow unlock race 2014-06-30 20:11:58 -07:00
power PM / sleep: Use valid_state() for platform-dependent sleep states only 2014-10-05 14:52:24 -07:00
printk printk: rename printk_sched to printk_deferred 2014-08-07 14:52:37 -07:00
rcu Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2014-01-28 08:38:04 -08:00
sched sched: Fix sched_setparam() policy == -1 logic 2014-09-05 16:34:13 -07:00
time alarmtimer: Lock k_itimer during timer callback 2014-10-05 14:52:22 -07:00
trace ring-buffer: Fix infinite spin in reading buffer 2014-10-09 12:21:27 -07:00
.gitignore Ignore generated file kernel/x509_certificate_list 2013-12-10 18:21:34 +00:00
Kconfig.freezer
Kconfig.hz kernel: remove CONFIG_USE_GENERIC_SMP_HELPERS 2013-11-15 09:32:22 +09:00
Kconfig.locks locking/mutex: Disable optimistic spinning on some architectures 2014-07-28 08:06:03 -07:00
Kconfig.preempt
Makefile KEYS: Remove files generated when SYSTEM_TRUSTED_KEYRING=y 2013-12-13 15:59:11 +00:00
acct.c fs: Fix hang with BSD accounting on frozen filesystem 2013-05-04 14:57:58 -04:00
async.c async: rename and redefine async_func_ptr 2013-03-12 13:59:14 -07:00
audit.c CAPABILITIES: remove undefined caps from all processes 2014-09-17 09:19:09 -07:00
audit.h audit: Use struct net not pid_t to remember the network namespce to reply in 2014-02-28 04:04:33 -08:00
audit_tree.c inotify: Fix reporting of cookies for inotify events 2014-02-18 11:17:17 +01:00
audit_watch.c inotify: Fix reporting of cookies for inotify events 2014-02-18 11:17:17 +01:00
auditfilter.c audit: Update kdoc for audit_send_reply and audit_list_rules_send 2014-03-08 15:31:54 -08:00
auditsc.c audit: remove superfluous new- prefix in AUDIT_LOGIN messages 2014-07-09 11:18:28 -07:00
backtracetest.c
bounds.c mm: do not allocate page->ptl dynamically, if spinlock_t fits to long 2013-12-20 12:25:45 -08:00
capability.c CAPABILITIES: remove undefined caps from all processes 2014-09-17 09:19:09 -07:00
cgroup.c cgroup: fix unbalanced locking 2014-10-05 14:52:17 -07:00
cgroup_freezer.c cgroup: replace cftype->read_seq_string() with cftype->seq_show() 2013-12-05 12:28:04 -05:00
compat.c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal 2013-05-01 07:21:43 -07:00
configs.c proc: Supply PDE attribute setting accessor functions 2013-05-01 17:29:18 -04:00
context_tracking.c context_tracking: Wrap static key check into more intuitive function name 2013-12-02 20:43:14 +01:00
cpu.c sched: Fix hotplug vs. set_cpus_allowed_ptr() 2014-06-11 11:54:11 -07:00
cpu_pm.c
cpuset.c mm: optimize put_mems_allowed() usage 2014-10-09 12:21:28 -07:00
crash_dump.c
cred.c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace 2012-12-18 10:55:28 -08:00
delayacct.c kernel/delayacct.c: remove redundant checking in __delayacct_add_tsk() 2013-11-13 12:09:12 +09:00
dma.c
elfcore.c switch elf_core_write_extra_phdrs() to dump_emit() 2013-11-09 00:16:23 -05:00
exec_domain.c
exit.c exit: call disassociate_ctty() before exit_task_namespaces() 2014-04-26 17:19:07 -07:00
extable.c kernel/extable: fix address-checks for core_kernel and init areas 2013-11-28 09:49:41 -08:00
fork.c mm: per-thread vma caching 2014-10-09 12:21:29 -07:00
freezer.c libata, freezer: avoid block device removal while system is frozen 2013-12-19 13:50:32 -05:00
futex.c futex: Unlock hb->lock in futex_wait_requeue_pi() error path 2014-10-05 14:52:19 -07:00
futex_compat.c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal 2013-02-23 18:50:11 -08:00
groups.c userns: Kill nsown_capable it makes the wrong thing easy 2013-08-30 23:44:11 -07:00
hrtimer.c hrtimer: Set expiry time before switch_hrtimer_base() 2014-06-07 10:28:12 -07:00
hung_task.c hung_task: Display every hung task warning 2014-01-25 12:13:33 +01:00
irq_work.c Merge branch 'nohz/printk-v8' into irq/core 2013-02-05 00:48:46 +01:00
itimer.c
jump_label.c static_key: WARN on usage before jump_label_init was called 2013-10-19 19:45:35 -04:00
kallsyms.c kernel: kallsyms: memory override issue, need check destination buffer length 2013-04-15 15:17:26 +09:30
kcmp.c kcmp: fix standard comparison bug 2014-10-05 14:52:20 -07:00
kexec.c powerpc, kexec: Fix "Processor X is stuck" issue during kexec from ST mode 2014-06-07 10:28:28 -07:00
kmod.c execve: use 'struct filename *' for executable name passing 2014-02-05 12:54:53 -08:00
kprobes.c kprobes: use KSYM_NAME_LEN to size identifier buffers 2013-11-13 12:09:26 +09:00
ksysfs.c kdump: fix exported size of vmcoreinfo note 2014-01-23 16:37:03 -08:00
kthread.c kthread: fix return value of kthread_create() upon SIGKILL. 2014-06-30 20:11:53 -07:00
latencytop.c
module-internal.h KEYS: Separate the kernel signature checking keyring from module signing 2013-09-25 17:17:01 +01:00
module.c module: remove warning about waiting module removal. 2014-06-07 10:28:11 -07:00
module_signing.c keys: change asymmetric keys to use common hash definitions 2013-10-25 17:15:18 -04:00
notifier.c
nsproxy.c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace 2013-09-07 14:35:32 -07:00
padata.c padata: Fix wrong usage of rcu_dereference() 2013-12-05 21:28:42 +08:00
panic.c panic: Make panic_timeout configurable 2013-11-26 12:12:26 +01:00
params.c params: improve standard definitions 2013-12-04 14:09:46 +10:30
pid.c pidns: fix free_pid() to handle the first fork failure 2013-09-30 14:31:03 -07:00
pid_namespace.c pid_namespace: pidns_get() should check task_active_pid_ns() != NULL 2014-04-26 17:19:04 -07:00
posix-cpu-timers.c posix-timers: Convert abuses of BUG_ON to WARN_ON 2013-12-09 16:56:29 +01:00
posix-timers.c posix-timers: Remove unused variable 2013-04-18 12:51:19 +02:00
profile.c mm: fix GFP_THISNODE callers and clarify 2014-03-10 17:26:19 -07:00
ptrace.c exec/ptrace: fix get_dumpable() incorrect tests 2013-11-13 12:09:33 +09:00
range.c range: Do not add new blank slot with add_range_with_merge 2013-06-18 11:32:10 -05:00
reboot.c kexec: migrate to reboot cpu 2013-12-18 19:04:50 -08:00
relay.c kernel: delete __cpuinit usage from all core kernel files 2013-07-14 19:36:59 -04:00
res_counter.c memcg: reduce function dereference 2013-09-12 15:38:02 -07:00
resource.c kernel/resource.c: remove the unneeded assignment in function __find_resource 2013-07-03 16:08:06 -07:00
seccomp.c seccomp: allow BPF_XOR based ALU instructions. 2013-03-26 11:07:19 +11:00
signal.c kernel/signal.c: change do_signal_stop/do_sigaction to use while_each_thread() 2014-01-23 16:37:02 -08:00
smp.c kernel/smp.c:on_each_cpu_cond(): fix warning in fallback path 2014-09-17 09:19:09 -07:00
smpboot.c kernel: delete __cpuinit usage from all core kernel files 2013-07-14 19:36:59 -04:00
smpboot.h
softirq.c Merge branch 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2014-01-31 09:02:51 -08:00
stacktrace.c
stop_machine.c stop_machine: Fix^2 race between stop_two_cpus() and stop_cpus() 2014-03-11 11:33:47 +01:00
sys.c kernel/sys.c: k_getrusage() can use while_each_thread() 2014-01-23 16:37:02 -08:00
sys_ni.c unify compat fanotify_mark(2), switch to COMPAT_SYSCALL_DEFINE 2013-05-09 13:46:38 -04:00
sysctl.c mm, pcp: allow restoring percpu_pagelist_fraction default 2014-07-09 11:18:26 -07:00
sysctl_binary.c kernel/sysctl_binary.c: use scnprintf() instead of snprintf() 2013-11-13 12:09:33 +09:00
system_certificates.S KEYS: correct alignment of system_certificate_list content in assembly file 2013-12-10 18:25:28 +00:00
system_keyring.c KEYS: correct alignment of system_certificate_list content in assembly file 2013-12-10 18:25:28 +00:00
task_work.c task_work: documentation 2013-09-11 15:58:27 -07:00
taskstats.c genetlink: only pass array to genl_register_family_with_ops() 2013-11-19 16:39:05 -05:00
test_kprobes.c kernel/: rename random32() to prandom_u32() 2013-04-29 18:28:42 -07:00
time.c jiffies: Fix timeval conversion to jiffies 2014-10-09 12:21:27 -07:00
timeconst.bc kernel: Replace timeconst.pl with a bc script 2013-02-16 23:17:25 +01:00
timer.c timer: Prevent overflow in apply_slack 2014-06-07 10:28:09 -07:00
tracepoint.c tracepoint: Do not waste memory on mods with no tracepoints 2014-05-31 13:20:28 -07:00
tsacct.c cputime: Use accessors to read task cputime stats 2013-01-27 19:23:31 +01:00
uid16.c userns: Kill nsown_capable it makes the wrong thing easy 2013-08-30 23:44:11 -07:00
up.c kernel: provide a __smp_call_function_single stub for !CONFIG_SMP 2013-11-15 09:32:22 +09:00
user-return-notifier.c hlist: drop the node parameter from iterators 2013-02-27 19:10:24 -08:00
user.c KEYS: fix uninitialized persistent_keyring_register_sem 2013-12-13 15:59:11 +00:00
user_namespace.c user namespace: fix incorrect memory barriers 2014-04-26 17:19:02 -07:00
utsname.c userns: Kill nsown_capable it makes the wrong thing easy 2013-08-30 23:44:11 -07:00
utsname_sysctl.c kernel/utsname_sysctl.c: put get/get_uts() into CONFIG_PROC_SYSCTL code block 2013-02-27 19:10:22 -08:00
watchdog.c kernel/watchdog.c: remove preemption restrictions when restarting lockup detector 2014-07-06 18:57:27 -07:00
workqueue.c workqueue: zero cpumask of wq_numa_possible_cpumask on init 2014-07-17 16:21:03 -07:00
workqueue_internal.h sched: Rename sched.c as sched/core.c in comments and Documentation 2013-06-19 12:58:42 +02:00