linux/mm
Hugh Dickins 9fab5619bd shmem: writepage directly to swap
Synopsis: if shmem_writepage calls swap_writepage directly, most shmem
swap loads benefit, and a catastrophic interaction between SLUB and some
flash storage is avoided.

shmem_writepage() has always been peculiar in making no attempt to write:
it has just transferred a shmem page from file cache to swap cache, then
let that page make its way around the LRU again before being written and
freed.

The idea was that people use tmpfs because they want those pages to stay
in RAM; so although we give it an overflow to swap, we should resist
writing too soon, giving those pages a second chance before they can be
reclaimed.

That was always questionable, and I've toyed with this patch for years;
but never had a clear justification to depart from the original design.

It became more questionable in 2.6.28, when the split LRU patches classed
shmem and tmpfs pages as SwapBacked rather than as file_cache: that in
itself gives them more resistance to reclaim than normal file pages.  I
prepared this patch for 2.6.29, but the merge window arrived before I'd
completed gathering statistics to justify sending it in.

Then while comparing SLQB against SLUB, running SLUB on a laptop I'd
habitually used with SLAB, I found SLUB to run my tmpfs kbuild swapping
tests five times slower than SLAB or SLQB - other machines slower too, but
nowhere near so bad.  Simpler "cp -a" swapping tests showed the same.

slub_max_order=0 brings sanity to all, but heavy swapping is too far from
normal to justify such a tuning.  The crucial factor on that laptop turns
out to be that I'm using an SD card for swap.  What happens is this:

By default, SLUB uses order-2 pages for shmem_inode_cache (and many other
fs inodes), so creating tmpfs files under memory pressure brings lumpy
reclaim into play.  One subpage of the order is chosen from the bottom of
the LRU as usual, then the other three picked out from their random
positions on the LRUs.

In a tmpfs load, many of these pages will be ones which already passed
through shmem_writepage, so already have swap allocated.  And though their
offsets on swap were probably allocated sequentially, now that the pages
are picked off at random, their swap offsets are scattered.

But the flash storage on the SD card is very sensitive to having its
writes merged: once swap is written at scattered offsets, performance
falls apart.  Rotating disk seeks increase too, but less disastrously.

So: stop giving shmem/tmpfs pages a second pass around the LRU, write them
out to swap as soon as their swap has been allocated.

It's surely possible to devise an artificial load which runs faster the
old way, one whose sizing is such that the tmpfs pages on their second
pass are the ones that are wanted again, and other pages not.

But I've not yet found such a load: on all machines, under the loads I've
tried, immediate swap_writepage speeds up shmem swapping: especially when
using the SLUB allocator (and more effectively than slub_max_order=0), but
also with the others; and it also reduces the variance between runs.  How
much faster varies widely: a factor of five is rare, 5% is common.

One load which might have suffered: imagine a swapping shmem load in a
limited mem_cgroup on a machine with plenty of memory.  Before 2.6.29 the
swapcache was not charged, and such a load would have run quickest with
the shmem swapcache never written to swap.  But now swapcache is charged,
so even this load benefits from shmem_writepage directly to swap.

Apologies for the #ifndef CONFIG_SWAP swap_writepage() stub in swap.h:
it's silly because that will never get called; but refactoring shmem.c
sensibly according to CONFIG_SWAP will be a separate task.

Signed-off-by: Hugh Dickins <hugh@veritas.com>
Acked-by: Pekka Enberg <penberg@cs.helsinki.fi>
Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-01 08:59:15 -07:00
..
allocpercpu.c cpumask: use new cpumask_ functions in core code. 2009-03-30 22:05:16 +10:30
backing-dev.c Move the default_backing_dev_info out of readahead.c and into backing-dev.c 2009-03-26 11:01:33 +01:00
bootmem.c bootmem, x86: further fixes for arch-specific bootmem wrapping 2009-03-01 16:06:56 +09:00
bounce.c
debug-pagealloc.c generic debug pagealloc 2009-04-01 08:59:13 -07:00
dmapool.c
fadvise.c [CVE-2009-0029] System call wrapper special cases 2009-01-14 14:15:18 +01:00
failslab.c
filemap_xip.c x86, mm: dont use non-temporal stores in pagecache accesses 2009-03-02 11:06:49 +01:00
filemap.c x86, mm: dont use non-temporal stores in pagecache accesses 2009-03-02 11:06:49 +01:00
fremap.c Do not account for the address space used by hugetlbfs using VM_ACCOUNT 2009-02-10 10:48:42 -08:00
highmem.c mm: introduce debug_kmap_atomic 2009-04-01 08:59:14 -07:00
hugetlb.c hugetlb: chg cannot become less than 0 2009-04-01 08:59:13 -07:00
internal.h nommu: there is no mlock() for NOMMU, so don't provide the bits 2009-04-01 08:59:14 -07:00
Kconfig nommu: make CONFIG_UNEVICTABLE_LRU available when CONFIG_MMU=n 2009-04-01 08:59:15 -07:00
Kconfig.debug generic debug pagealloc 2009-04-01 08:59:13 -07:00
maccess.c
madvise.c [CVE-2009-0029] System call wrappers part 14 2009-01-14 14:15:24 +01:00
Makefile generic debug pagealloc 2009-04-01 08:59:13 -07:00
memcontrol.c memcg: NULL pointer dereference at rmdir on some NUMA systems 2009-01-29 18:04:44 -08:00
memory_hotplug.c
memory.c mm: page_mkwrite change prototype to match fault 2009-04-01 08:59:14 -07:00
mempolicy.c [CVE-2009-0029] System call wrappers part 28 2009-01-14 14:15:30 +01:00
mempool.c
migrate.c migration: migrate_vmas should check "vma" 2009-02-11 14:25:34 -08:00
mincore.c [CVE-2009-0029] System call wrappers part 14 2009-01-14 14:15:24 +01:00
mlock.c Merge branch 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip 2009-02-17 14:27:39 -08:00
mm_init.c
mmap.c Merge branch 'master' into next 2009-03-24 10:52:46 +11:00
mmu_notifier.c
mmzone.c
mprotect.c Do not account for the address space used by hugetlbfs using VM_ACCOUNT 2009-02-10 10:48:42 -08:00
mremap.c [CVE-2009-0029] System call wrappers part 13 2009-01-14 14:15:23 +01:00
msync.c [CVE-2009-0029] System call wrappers part 13 2009-01-14 14:15:23 +01:00
nommu.c uclinux: add process name to allocation error message 2009-01-27 16:42:03 +10:00
oom_kill.c oom_kill: don't call for int_sqrt(0) 2009-04-01 08:59:11 -07:00
page_alloc.c vmscan: fix it to take care of nodemask 2009-04-01 08:59:15 -07:00
page_cgroup.c memcg: use __GFP_NOWARN in page cgroup allocation 2009-02-11 14:25:35 -08:00
page_io.c block: fix bad definition of BIO_RW_SYNC 2009-02-18 10:32:00 +01:00
page_isolation.c
page-writeback.c mm: fix proc_dointvec_userhz_jiffies "breakage" 2009-04-01 08:59:13 -07:00
pagewalk.c
pdflush.c cpumask: remove dangerous CPU_MASK_ALL_PTR, &CPU_MASK_ALL 2009-03-30 22:05:11 +10:30
percpu.c percpu: generalize embedding first chunk setup helper 2009-03-10 16:27:48 +09:00
prio_tree.c
quicklist.c
readahead.c Move the default_backing_dev_info out of readahead.c and into backing-dev.c 2009-03-26 11:01:33 +01:00
rmap.c mm: fix mlocked page counter mismatch 2009-02-11 14:25:35 -08:00
shmem_acl.c
shmem.c shmem: writepage directly to swap 2009-04-01 08:59:15 -07:00
slab.c Merge branch 'locking-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip 2009-03-30 17:17:35 -07:00
slob.c Merge branch 'locking-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip 2009-03-30 17:17:35 -07:00
slub.c Merge branch 'locking-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip 2009-03-30 17:17:35 -07:00
sparse-vmemmap.c
sparse.c mm: mminit_validate_memmodel_limits(): remove redundant test 2009-04-01 08:59:11 -07:00
swap_state.c
swap.c mm: remove pagevec_swap_free() 2009-04-01 08:59:13 -07:00
swapfile.c PM/hibernate: fix "swap breaks after hibernation failures" 2009-02-21 14:17:17 -08:00
thrash.c
truncate.c
util.c memdup_user(): introduce 2009-04-01 08:59:13 -07:00
vmalloc.c vmap: remove needless lock and list in vmap 2009-04-01 08:59:11 -07:00
vmscan.c vmscan: fix it to take care of nodemask 2009-04-01 08:59:15 -07:00
vmstat.c mm: introduce for_each_populated_zone() macro 2009-04-01 08:59:11 -07:00