linux/fs
Johannes Weiner 4b85afbdac mm: zero-seek shrinkers
The page cache and most shrinkable slab caches hold data that has been
read from disk, but there are some caches that only cache CPU work, such
as the dentry and inode caches of procfs and sysfs, as well as the subset
of radix tree nodes that track non-resident page cache.

Currently, all these are shrunk at the same rate: using DEFAULT_SEEKS for
the shrinker's seeks setting tells the reclaim algorithm that for every
two page cache pages scanned it should scan one slab object.

This is a bogus setting.  A virtual inode that required no IO to create is
not twice as valuable as a page cache page; shadow cache entries with
eviction distances beyond the size of memory aren't either.

In most cases, the behavior in practice is still fine.  Such virtual
caches don't tend to grow and assert themselves aggressively, and usually
get picked up before they cause problems.  But there are scenarios where
that's not true.

Our database workloads suffer from two of those.  For one, their file
workingset is several times bigger than available memory, which has the
kernel aggressively create shadow page cache entries for the non-resident
parts of it.  The workingset code does tell the VM that most of these are
expendable, but the VM ends up balancing them 2:1 to cache pages as per
the seeks setting.  This is a huge waste of memory.

These workloads also deal with tens of thousands of open files and use
/proc for introspection, which ends up growing the proc_inode_cache to
absurdly large sizes - again at the cost of valuable cache space, which
isn't a reasonable trade-off, given that proc inodes can be re-created
without involving the disk.

This patch implements a "zero-seek" setting for shrinkers that results in
a target ratio of 0:1 between their objects and IO-backed caches.  This
allows such virtual caches to grow when memory is available (they do
cache/avoid CPU work after all), but effectively disables them as soon as
IO-backed objects are under pressure.

It then switches the shrinkers for procfs and sysfs metadata, as well as
excess page cache shadow nodes, to the new zero-seek setting.

Link: http://lkml.kernel.org/r/20181009184732.762-5-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reported-by: Domas Mituzas <dmituzas@fb.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Rik van Riel <riel@surriel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-26 16:26:33 -07:00
..
9p Pull request for inclusion in 4.19, take two 2018-08-17 17:27:58 -07:00
adfs adfs: use timespec64 for time conversion 2018-08-22 10:52:51 -07:00
affs affs: fix potential memory leak when parsing option 'prefix' 2018-05-28 12:36:41 +02:00
afs Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2018-10-19 11:03:06 -07:00
autofs Merge branch 'akpm' (patches from Andrew) 2018-08-22 12:34:08 -07:00
befs fix a series of Documentation/ broken file name references 2018-06-15 18:10:01 -03:00
bfs
btrfs Merge branch 'work.lookup' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2018-10-25 12:55:31 -07:00
cachefiles cachefiles: fix the race between cachefiles_bury_object() and rmdir(2) 2018-10-18 11:32:21 +02:00
ceph ceph: avoid a use-after-free in ceph_destroy_options() 2018-09-06 16:18:04 +02:00
cifs smb3: fix lease break problem introduced by compounding 2018-10-02 18:54:09 -05:00
coda vfs: change inode times to use struct timespec64 2018-06-05 16:57:31 -07:00
configfs configfs: fix registered group removal 2018-07-17 06:14:07 -07:00
cramfs cramfs: convert to use vmf_insert_mixed 2018-10-26 16:25:19 -07:00
crypto crypto: speck - remove Speck 2018-09-04 11:35:03 +08:00
debugfs Revert "debugfs: inode: debugfs_create_dir uses mode permission from parent" 2018-06-12 20:52:16 -07:00
devpts devpts: Convert to new IDA API 2018-08-21 23:54:17 -04:00
dlm treewide: Use array_size() in vmalloc() 2018-06-12 16:19:22 -07:00
ecryptfs ecryptfs_rename(): verify that lower dentries are still OK after lock_rename() 2018-10-09 23:33:17 -04:00
efivarfs efivars: Call guid_parse() against guid_t type of variable 2018-07-22 14:13:44 +02:00
efs
exofs exofs: use bio_clone_fast in _write_mirror 2018-07-24 14:43:20 -06:00
exportfs
ext2 ext2, dax: set ext2_dax_aops for dax files 2018-09-19 15:03:04 +02:00
ext4 Further restructure ext4 documentation; fix up ext4's delayed 2018-10-24 17:42:24 +01:00
f2fs f2fs: fix to keep project quota consistent 2018-10-22 17:54:48 -07:00
fat fs/fat/fatent.c: add cond_resched() to fat_count_free_clusters() 2018-10-13 09:31:03 +02:00
freevxfs
fscache fscache: Fix out of bound read in long cookie keys 2018-10-18 11:32:21 +02:00
fuse fuse update for 4.19 2018-08-21 18:47:36 -07:00
gfs2 We've got 18 patches for this merge window, none of which are very major. 2018-10-24 17:30:39 +01:00
hfs hfs: prevent crash on exit from failed search 2018-08-23 18:48:42 -07:00
hfsplus hfsplus: prevent crash on exit from failed search 2018-08-23 18:48:42 -07:00
hostfs vfs: discard ATTR_ATTR_FLAG 2018-08-17 16:20:28 -07:00
hpfs hpfs: remove unnecessary checks on the value of r when assigning error code 2018-08-25 12:42:33 -07:00
hugetlbfs mm: zero out the vma in vma_init() 2018-08-22 10:52:44 -07:00
isofs isofs: reject hardware sector size > 2048 bytes 2018-08-21 11:37:41 +02:00
jbd2 jbd2: fix use after free in jbd2_log_do_checkpoint() 2018-10-05 18:44:40 -04:00
jffs2 Merge branch 'siginfo-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace 2018-10-24 11:22:39 +01:00
jfs jfs: remove redundant dquot_initialize() in jfs_evict_inode() 2018-09-20 09:28:49 -05:00
kernfs mm: zero-seek shrinkers 2018-10-26 16:26:33 -07:00
lockd nfsd: fix leaked file lock with nfs exported overlayfs 2018-08-09 16:11:21 -04:00
minix
nfs NFS client bugfixes for Linux 4.19 2018-09-14 19:25:28 -10:00
nfs_common
nfsd vfs: swap names of {do,vfs}_clone_file_range() 2018-09-24 10:54:01 +02:00
nilfs2 nilfs2: convert to SPDX license tags 2018-09-04 16:45:02 -07:00
nls
notify fsnotify: fix ignore mask logic in fsnotify() 2018-09-03 14:57:41 +02:00
ntfs ntfs: mft: remove VLA usage 2018-08-17 16:20:27 -07:00
ocfs2 ocfs2: remove set but not used variable 'rb' 2018-10-26 16:25:18 -07:00
omfs
openpromfs
orangefs orangefs: no need to check for service_operation returns > 0 2018-10-18 14:05:46 -04:00
overlayfs ovl: fix format of setxattr debug 2018-10-04 14:49:10 +02:00
proc mm: zero-seek shrinkers 2018-10-26 16:26:33 -07:00
pstore pstore improvements: 2018-10-24 14:42:02 +01:00
qnx4
qnx6
quota fs/quota: Fix spectre gadget in do_quotactl 2018-08-22 18:17:48 +02:00
ramfs
reiserfs reiserfs: fix broken xattr handling (heap corruption, bad retval) 2018-08-22 10:52:50 -07:00
romfs
squashfs Squashfs: Compute expected length from inode size rather than block length 2018-08-02 09:34:02 -07:00
sysfs Driver core patches for 4.19-rc1 2018-08-18 11:44:53 -07:00
sysv fs/sysv/inode.c: use ktime_get_real_seconds() for superblock stamp 2018-08-22 10:52:51 -07:00
tracefs tracefs: Annotate tracefs_ops with __ro_after_init 2018-07-31 11:32:44 -04:00
ubifs ubifs: Fix WARN_ON logic in exit path 2018-10-13 11:05:02 +02:00
udf udf: Fix mounting of Win7 created UDF filesystems 2018-08-24 11:13:32 +02:00
ufs fs/ufs: use ktime_get_real_seconds for sb and cg timestamps 2018-08-17 16:20:27 -07:00
xfs xfs: cancel COW blocks before swapext 2018-10-18 17:21:55 +11:00
aio.c y2038: globally rename compat_time to old_time32 2018-08-27 14:48:48 +02:00
anon_inodes.c anon_inode_getfile(): switch to alloc_file_pseudo() 2018-07-12 10:04:27 -04:00
attr.c fs: Fix attr.c kernel-doc 2018-07-03 16:44:45 -04:00
bad_inode.c get rid of 'opened' argument of ->atomic_open() - part 3 2018-07-12 10:04:20 -04:00
binfmt_aout.c
binfmt_elf_fdpic.c treewide: kmalloc() -> kmalloc_array() 2018-06-12 16:19:22 -07:00
binfmt_elf.c signal: Distinguish between kernel_siginfo and siginfo 2018-10-03 16:47:43 +02:00
binfmt_em86.c
binfmt_flat.c
binfmt_misc.c turn filp_clone_open() into inline wrapper for dentry_open() 2018-07-10 23:29:03 -04:00
binfmt_script.c
block_dev.c for-4.19/block-20180812 2018-08-14 10:23:25 -07:00
buffer.c blkcg: associate writeback bios with a blkg 2018-09-21 20:29:11 -06:00
char_dev.c
compat_binfmt_elf.c y2038: globally rename compat_time to old_time32 2018-08-27 14:48:48 +02:00
compat_ioctl.c Merge branch 'work.compat' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2018-10-25 12:48:22 -07:00
compat.c ncpfs: remove compat functionality 2018-06-05 19:23:26 +02:00
coredump.c signal: Distinguish between kernel_siginfo and siginfo 2018-10-03 16:47:43 +02:00
d_path.c
dax.c filesystem-dax: Fix dax_layout_busy_page() livelock 2018-10-08 11:38:44 -07:00
dcache.c dcache: allocate external names from reclaimable kmalloc caches 2018-10-26 16:26:32 -07:00
dcookies.c
direct-io.c
drop_caches.c
eventfd.c Revert changes to convert to ->poll_mask() and aio IOCB_CMD_POLL 2018-06-28 10:40:47 -07:00
eventpoll.c fs/eventpoll.c: simplify ep_is_linked() callers 2018-08-22 10:52:49 -07:00
exec.c vfs: require i_size <= SIZE_MAX in kernel_read_file() 2018-10-10 12:56:14 -04:00
fcntl.c signal: Distinguish between kernel_siginfo and siginfo 2018-10-03 16:47:43 +02:00
fhandle.c
file_table.c overlayfs update for 4.19 2018-08-21 18:19:09 -07:00
file.c
filesystems.c
fs_pin.c
fs_struct.c
fs-writeback.c
inode.c overlayfs update for 4.19 2018-08-21 18:19:09 -07:00
internal.h overlayfs update for 4.19 2018-08-21 18:19:09 -07:00
ioctl.c vfs: swap names of {do,vfs}_clone_file_range() 2018-09-24 10:54:01 +02:00
iomap.c fs/iomap.c: change return type to vm_fault_t 2018-10-26 16:25:18 -07:00
Kconfig autofs: remove left-over autofs4 stubs 2018-06-11 08:22:34 -07:00
Kconfig.binfmt kconfig: move the "Executable file formats" menu to fs/Kconfig.binfmt 2018-08-02 08:06:55 +09:00
libfs.c
locks.c overlayfs update for 4.19 2018-08-21 18:19:09 -07:00
Makefile autofs: remove left-over autofs4 stubs 2018-06-11 08:22:34 -07:00
mbcache.c treewide: kmalloc() -> kmalloc_array() 2018-06-12 16:19:22 -07:00
mount.h
mpage.c mpage: mpage_readpages() should submit IO as read-ahead 2018-08-17 16:20:29 -07:00
namei.c namei: allow restricted O_CREAT of FIFOs and regular files 2018-08-23 18:48:43 -07:00
namespace.c x86/fault: BUG() when uaccess helpers fault on kernel addresses 2018-09-03 15:12:09 +02:00
no-block.c
nsfs.c
open.c overlayfs update for 4.19 2018-08-21 18:19:09 -07:00
pipe.c Merge branch 'work.open3' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2018-08-13 19:58:36 -07:00
pnode.c
pnode.h
posix_acl.c
proc_namespace.c
read_write.c Merge branch 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2018-10-25 11:14:36 -07:00
readdir.c
select.c y2038: globally rename compat_time to old_time32 2018-08-27 14:48:48 +02:00
seq_file.c fs/seq_file.c: simplify seq_file iteration code and interface 2018-08-17 16:20:28 -07:00
signalfd.c signal: Distinguish between kernel_siginfo and siginfo 2018-10-03 16:47:43 +02:00
splice.c Merge branch 'work.compat' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2018-06-16 16:21:50 +09:00
stack.c
stat.c y2038: Remove newstat family from default syscall set 2018-08-29 15:42:20 +02:00
statfs.c kernel: add kcompat_sys_{f,}statfs64() 2018-07-12 14:49:48 +01:00
super.c Merge branch 'ida-4.19' of git://git.infradead.org/users/willy/linux-dax 2018-08-26 11:48:42 -07:00
sync.c
timerfd.c y2038: globally rename compat_time to old_time32 2018-08-27 14:48:48 +02:00
userfaultfd.c userfaultfd: disable irqs when taking the waitqueue lock 2018-10-26 16:25:18 -07:00
utimes.c y2038: utimes: Rework #ifdef guards for compat syscalls 2018-08-29 15:42:23 +02:00
xattr.c sysfs: Do not return POSIX ACL xattrs via listxattr 2018-09-18 07:30:48 -04:00