linux/fs
Davidlohr Bueso efb5fea230 mm: per-thread vma caching
commit 615d6e8756 upstream.

This patch is a continuation of efforts trying to optimize find_vma(),
avoiding potentially expensive rbtree walks to locate a vma upon faults.
The original approach (https://lkml.org/lkml/2013/11/1/410), where the
largest vma was also cached, ended up being too specific and random,
thus further comparison with other approaches were needed.  There are
two things to consider when dealing with this, the cache hit rate and
the latency of find_vma().  Improving the hit-rate does not necessarily
translate in finding the vma any faster, as the overhead of any fancy
caching schemes can be too high to consider.

We currently cache the last used vma for the whole address space, which
provides a nice optimization, reducing the total cycles in find_vma() by
up to 250%, for workloads with good locality.  On the other hand, this
simple scheme is pretty much useless for workloads with poor locality.
Analyzing ebizzy runs shows that, no matter how many threads are
running, the mmap_cache hit rate is less than 2%, and in many situations
below 1%.

The proposed approach is to replace this scheme with a small per-thread
cache, maximizing hit rates at a very low maintenance cost.
Invalidations are performed by simply bumping up a 32-bit sequence
number.  The only expensive operation is in the rare case of a seq
number overflow, where all caches that share the same address space are
flushed.  Upon a miss, the proposed replacement policy is based on the
page number that contains the virtual address in question.  Concretely,
the following results are seen on an 80 core, 8 socket x86-64 box:

1) System bootup: Most programs are single threaded, so the per-thread
   scheme does improve ~50% hit rate by just adding a few more slots to
   the cache.

+----------------+----------+------------------+
| caching scheme | hit-rate | cycles (billion) |
+----------------+----------+------------------+
| baseline       | 50.61%   | 19.90            |
| patched        | 73.45%   | 13.58            |
+----------------+----------+------------------+

2) Kernel build: This one is already pretty good with the current
   approach as we're dealing with good locality.

+----------------+----------+------------------+
| caching scheme | hit-rate | cycles (billion) |
+----------------+----------+------------------+
| baseline       | 75.28%   | 11.03            |
| patched        | 88.09%   | 9.31             |
+----------------+----------+------------------+

3) Oracle 11g Data Mining (4k pages): Similar to the kernel build workload.

+----------------+----------+------------------+
| caching scheme | hit-rate | cycles (billion) |
+----------------+----------+------------------+
| baseline       | 70.66%   | 17.14            |
| patched        | 91.15%   | 12.57            |
+----------------+----------+------------------+

4) Ebizzy: There's a fair amount of variation from run to run, but this
   approach always shows nearly perfect hit rates, while baseline is just
   about non-existent.  The amounts of cycles can fluctuate between
   anywhere from ~60 to ~116 for the baseline scheme, but this approach
   reduces it considerably.  For instance, with 80 threads:

+----------------+----------+------------------+
| caching scheme | hit-rate | cycles (billion) |
+----------------+----------+------------------+
| baseline       | 1.06%    | 91.54            |
| patched        | 99.97%   | 14.18            |
+----------------+----------+------------------+

[akpm@linux-foundation.org: fix nommu build, per Davidlohr]
[akpm@linux-foundation.org: document vmacache_valid() logic]
[akpm@linux-foundation.org: attempt to untangle header files]
[akpm@linux-foundation.org: add vmacache_find() BUG_ON]
[hughd@google.com: add vmacache_valid_mm() (from Oleg)]
[akpm@linux-foundation.org: coding-style fixes]
[akpm@linux-foundation.org: adjust and enhance comments]
Signed-off-by: Davidlohr Bueso <davidlohr@hp.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Reviewed-by: Michel Lespinasse <walken@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Tested-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-10-09 12:21:29 -07:00
..
9p Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2014-01-28 08:38:04 -08:00
adfs adfs: delayed freeing of sbi 2013-10-24 23:43:27 -04:00
affs fs/affs/super.c: bugfix / double free 2014-06-07 10:28:16 -07:00
afs afs: proc cells and rootcell are writeable 2014-02-01 10:59:39 -08:00
autofs4 autofs: fix lockref lookup 2014-06-07 10:28:20 -07:00
befs befs: iget_locked() doesn't return an ERR_PTR 2014-01-25 03:14:38 -05:00
bfs truncate: drop 'oldsize' truncate_pagecache() parameter 2013-09-12 15:38:02 -07:00
btrfs Btrfs: fix crash on endio of reading corrupted block 2014-09-05 16:34:16 -07:00
cachefiles Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2013-11-13 15:34:18 +09:00
ceph ceph: clear directory's completeness when creating file 2014-06-07 10:28:20 -07:00
cifs CIFS: Fix SMB2 readdir error handling 2014-10-09 12:21:27 -07:00
coda coda_revalidate_inode(): switch to passing inode... 2013-11-09 00:16:21 -05:00
configfs configfs: fix race between dentry put and lookup 2013-11-21 16:42:27 -08:00
cramfs cramfs: take headers to fs/cramfs 2014-01-25 03:13:02 -05:00
debugfs debugfs: Fix corrupted loop in debugfs_remove_recursive 2014-09-05 16:34:15 -07:00
devpts devpts: plug the memory leak in kill_sb 2013-11-13 12:09:36 +09:00
dlm Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next 2014-01-25 11:17:34 -08:00
ecryptfs ecryptfs: fix failure handling in ->readlink() 2014-01-25 03:13:00 -05:00
efivarfs consolidate simple ->d_delete() instances 2013-11-15 22:04:17 -05:00
efs efs: get rid of ->put_super() 2014-01-25 03:13:02 -05:00
exofs exofs: Print less in r4w 2014-01-23 18:54:14 +02:00
exportfs exportfs: fix quadratic behavior in filehandle lookup 2013-11-09 00:16:38 -05:00
ext2 ext2/3/4: use generic posix ACL infrastructure 2014-01-25 23:58:19 -05:00
ext3 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2014-01-28 08:38:04 -08:00
ext4 jbd2: fix descriptor block size handling errors with journal_csum 2014-09-05 16:34:17 -07:00
f2fs Merge branch 'for-3.14/core' of git://git.kernel.dk/linux-block 2014-01-30 11:19:05 -08:00
fat fat: rcu-delay unloading nls and freeing sbi 2013-10-24 23:43:28 -04:00
freevxfs
fscache FS-Cache: Handle removal of unadded object to the fscache_object_list rb tree 2014-02-17 13:47:35 -08:00
fuse fuse: ignore entry-timeout on LOOKUP_REVAL 2014-07-28 08:05:57 -07:00
gfs2 GFS2: fix d_splice_alias() misuses 2014-10-05 14:52:22 -07:00
hfs fs/hfs/btree.h: remove duplicate defines 2013-11-13 12:09:32 +09:00
hfsplus hfsplus: add HFSX subfolder count support 2014-03-10 17:26:21 -07:00
hostfs um: hostfs: make functions static 2014-01-26 11:51:09 +01:00
hpfs hpfs: optimize quad buffer loading 2014-02-02 16:24:07 -08:00
hppfs clean up scary strncpy(dst, src, strlen(src)) uses 2013-07-03 16:07:41 -07:00
hugetlbfs hugetlb: ensure hugepage access is denied if hugepages are not supported 2014-10-09 12:21:27 -07:00
isofs isofs: Fix unbounded recursion when processing relocated directories 2014-09-05 16:34:12 -07:00
jbd jbd: Revise KERN_EMERG error messages 2013-12-04 12:27:46 +01:00
jbd2 jbd2: fix descriptor block size handling errors with journal_csum 2014-09-05 16:34:17 -07:00
jffs2 jffs2: remove from wait queue after schedule() 2014-04-26 17:19:06 -07:00
jfs jfs: set i_ctime when setting ACL 2014-02-13 15:56:05 -06:00
kernfs kernfs: add back missing error check in kernfs_fop_mmap() 2014-06-07 10:28:08 -07:00
lockd lockd: fix rpcbind crash on lockd startup failure 2014-10-05 14:52:20 -07:00
logfs Merge branch 'for-3.14/core' of git://git.kernel.dk/linux-block 2014-01-30 11:19:05 -08:00
minix fs/minix: Drop dependency on H8300 2013-09-16 18:20:25 -07:00
ncpfs ncpfs: rcu-delay unload_nls() and freeing ncp_server 2013-10-24 23:43:28 -04:00
nfs NFSv4: Fix another bug in the close/open_downgrade code 2014-10-05 14:52:15 -07:00
nfs_common
nfsd svcrdma: Select NFSv4.1 backchannel transport based on forward channel 2014-09-05 16:34:18 -07:00
nilfs2 nilfs2: fix data loss with mmap() 2014-10-05 14:52:21 -07:00
nls nls: have register_nls() set ->owner 2014-01-25 03:14:05 -05:00
notify fs/notify: don't show f_handle if exportfs_encode_inode_fh failed 2014-10-05 14:52:21 -07:00
ntfs fix O_SYNC|O_APPEND syncing the wrong range on write() 2014-02-09 15:18:09 -05:00
ocfs2 ocfs2/dlm: do not get resource spinlock if lockres is new 2014-10-05 14:52:21 -07:00
omfs truncate: drop 'oldsize' truncate_pagecache() parameter 2013-09-12 15:38:02 -07:00
openpromfs
proc mm: per-thread vma caching 2014-10-09 12:21:29 -07:00
pstore pstore: Don't allow high traffic options on fragile devices 2013-12-20 13:12:01 -08:00
qnx4 qnx4: clean qnx4_fill_super() up 2014-01-25 03:13:03 -05:00
qnx6
quota quota: missing lock in dqcache_shrink_scan() 2014-07-28 08:05:58 -07:00
ramfs fs/ramfs: move ramfs_aops to inode.c 2014-01-23 16:36:58 -08:00
reiserfs reiserfs: call truncate_setsize under tailpack mutex 2014-07-06 18:57:29 -07:00
romfs romfs: fix returm err while getting inode in fill_super 2014-01-23 16:37:04 -08:00
squashfs Squashfs: fix failure to unlock pages on decompress error 2013-11-24 01:02:50 +00:00
sysfs sysfs: make sure read buffer is zeroed 2014-06-07 10:28:24 -07:00
sysv sysv: Add forgotten superblock lock init for v7 fs 2013-09-29 22:02:02 -04:00
ubifs UBIFS: Remove incorrect assertion in shrink_tnc() 2014-07-06 18:57:26 -07:00
udf udf: Avoid infinite loop when processing indirect ICBs 2014-10-09 12:21:26 -07:00
ufs truncate: drop 'oldsize' truncate_pagecache() parameter 2013-09-12 15:38:02 -07:00
xfs xfs: don't zero partial page cache pages during O_DIRECT writes 2014-09-17 09:19:25 -07:00
Kconfig fs: remove generic_acl 2014-01-26 08:26:40 -05:00
Kconfig.binfmt
Makefile Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2014-01-28 08:38:04 -08:00
aio.c aio: block exit_aio() until all context requests are completed 2014-10-05 14:52:24 -07:00
anon_inodes.c vfs: Allocate anon_inode_inode in anon_inode_init() 2014-03-27 09:52:54 -07:00
attr.c fs,userns: Change inode_capable to capable_wrt_inode_uidgid 2014-06-16 13:40:32 -07:00
bad_inode.c
binfmt_aout.c dump_skip(): dump_seek() replacement taking coredump_params 2013-11-09 00:16:26 -05:00
binfmt_elf.c fs: binfmt_elf: remove unused defines INTERPRETER_NONE and INTERPRETER_ELF 2014-01-23 16:36:58 -08:00
binfmt_elf_fdpic.c elf{,_fdpic} coredump: get rid of pointless if (siginfo->si_signo) 2013-11-09 00:16:30 -05:00
binfmt_em86.c file->f_op is never NULL... 2013-10-24 23:34:54 -04:00
binfmt_flat.c
binfmt_misc.c
binfmt_script.c
binfmt_som.c
bio-integrity.c bio-integrity: Drop bio_integrity_verify BUG_ON in post bip->bip_iter world 2014-02-21 15:56:36 -08:00
bio.c block: Fix cloning of discard/write same bios 2014-02-11 08:40:45 -07:00
block_dev.c a trivial writeback fix 2013-09-13 23:06:40 -04:00
buffer.c Fix nasty 32-bit overflow bug in buffer i/o code. 2014-10-05 14:52:22 -07:00
char_dev.c Merge branch 'for-3.13/core' of git://git.kernel.dk/linux-block 2013-11-14 12:08:14 +09:00
compat.c
compat_binfmt_elf.c
compat_ioctl.c fs/compat_ioctl.c: fix an underflow issue (harmless) 2014-01-21 16:19:42 -08:00
coredump.c coredump: fix the setting of PF_DUMPCORE 2014-07-31 12:52:56 -07:00
dcache.c vfs: fix bad hashing of dentries 2014-09-17 09:19:29 -07:00
dcookies.c fs/compat: fix lookup_dcookie() parameter handling 2014-01-29 16:22:40 -08:00
direct-io.c block: Abstract out bvec iterator 2013-11-23 22:33:47 -08:00
drop_caches.c shrinker: add node awareness 2013-09-10 18:56:31 -04:00
eventfd.c eventfd_ctx_fdget(): use fdget() instead of fget() 2014-01-25 03:13:04 -05:00
eventpoll.c eventpoll: fix uninitialized variable in epoll_ctl 2014-10-05 14:52:20 -07:00
exec.c mm: per-thread vma caching 2014-10-09 12:21:29 -07:00
fcntl.c file->f_op is never NULL... 2013-10-24 23:34:54 -04:00
fhandle.c
file.c vfs: Don't let __fdget_pos() get FMODE_PATH files 2014-03-23 00:03:12 -04:00
file_table.c don't bother with {get,put}_write_access() on non-regular files 2014-05-31 13:20:29 -07:00
filesystems.c
fs-writeback.c bdi: avoid oops on device removal 2014-04-26 17:19:05 -07:00
fs_struct.c seqcount: Add lockdep functionality to seqcount/seqlock structures 2013-11-06 12:40:26 +01:00
inode.c fs,userns: Change inode_capable to capable_wrt_inode_uidgid 2014-06-16 13:40:32 -07:00
internal.h get rid of s_files and files_lock 2013-11-09 00:16:20 -05:00
ioctl.c file->f_op is never NULL... 2013-10-24 23:34:54 -04:00
ioprio.c
libfs.c consolidate simple ->d_delete() instances 2013-11-15 22:04:17 -05:00
locks.c locks: allow __break_lease to sleep even when break_time is 0 2014-05-13 13:32:53 +02:00
mbcache.c fs: convert fs shrinkers to new scan/count API 2013-09-10 18:56:31 -04:00
mount.h switch mnt_hash to hlist 2014-03-30 19:18:51 -04:00
mpage.c block: Abstract out bvec iterator 2013-11-23 22:33:47 -08:00
namei.c don't bugger nd->seq on set_root_rcu() from follow_dotdot_rcu() 2014-10-05 14:52:21 -07:00
namespace.c fix copy_tree() regression 2014-09-17 09:19:23 -07:00
no-block.c
open.c don't bother with {get,put}_write_access() on non-regular files 2014-05-31 13:20:29 -07:00
pipe.c fs/pipe.c: skip file_update_time on frozen fs 2014-01-23 16:37:00 -08:00
pnode.c get rid of propagate_umount() mistakenly treating slaves as busy. 2014-09-17 09:19:22 -07:00
pnode.h smarter propagate_mnt() 2014-05-06 07:59:36 -07:00
posix_acl.c posix_acl: handle NULL ACL in posix_acl_equiv_mode 2014-06-07 10:28:16 -07:00
proc_namespace.c fs/proc_namespace.c: simplify testing nsp and nsp->mnt_ns 2014-01-23 16:37:02 -08:00
read_write.c vfs: atomic f_pos access in llseek() 2014-03-23 00:03:12 -04:00
readdir.c file->f_op is never NULL... 2013-10-24 23:34:54 -04:00
select.c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2013-11-13 15:34:18 +09:00
seq_file.c seq_file: always clear m->count when we free m->buf 2013-11-18 19:07:53 -08:00
signalfd.c
splice.c fuse: fix pipe_buf_operations 2014-01-22 19:36:57 +01:00
stack.c
stat.c vfs: split out vfs_getattr_nosec 2013-11-09 00:16:31 -05:00
statfs.c vfs: allow O_PATH file descriptors for fstatfs() 2013-10-12 13:12:31 -07:00
super.c fs: Don't return 0 from get_anon_bdev 2014-05-31 13:20:31 -07:00
sync.c Revert "writeback: do not sync data dirtied after sync start" 2014-02-22 02:02:28 +01:00
timerfd.c
utimes.c locks: break delegations on any attribute modification 2013-11-09 00:16:44 -05:00
xattr.c