linux/Documentation
Mel Gorman 90afa5de6f vmscan: properly account for the number of page cache pages zone_reclaim() can reclaim
A bug was brought to my attention against a distro kernel but it affects
mainline and I believe problems like this have been reported in various
guises on the mailing lists although I don't have specific examples at the
moment.

The reported problem was that malloc() stalled for a long time (minutes in
some cases) if a large tmpfs mount was occupying a large percentage of
memory overall.  The pages did not get cleaned or reclaimed by
zone_reclaim() because the zone_reclaim_mode was unsuitable, but the lists
are uselessly scanned frequencly making the CPU spin at near 100%.

This patchset intends to address that bug and bring the behaviour of
zone_reclaim() more in line with expectations which were noticed during
investigation.  It is based on top of mmotm and takes advantage of
Kosaki's work with respect to zone_reclaim().

Patch 1 fixes the heuristics that zone_reclaim() uses to determine if the
	scan should go ahead. The broken heuristic is what was causing the
	malloc() stall as it uselessly scanned the LRU constantly. Currently,
	zone_reclaim is assuming zone_reclaim_mode is 1 and historically it
	could not deal with tmpfs pages at all. This fixes up the heuristic so
	that an unnecessary scan is more likely to be correctly avoided.

Patch 2 notes that zone_reclaim() returning a failure automatically means
	the zone is marked full. This is not always true. It could have
	failed because the GFP mask or zone_reclaim_mode were unsuitable.

Patch 3 introduces a counter zreclaim_failed that will increment each
	time the zone_reclaim scan-avoidance heuristics fail. If that
	counter is rapidly increasing, then zone_reclaim_mode should be
	set to 0 as a temporarily resolution and a bug reported because
	the scan-avoidance heuristic is still broken.

This patch:

On NUMA machines, the administrator can configure zone_reclaim_mode that
is a more targetted form of direct reclaim.  On machines with large NUMA
distances for example, a zone_reclaim_mode defaults to 1 meaning that
clean unmapped pages will be reclaimed if the zone watermarks are not
being met.

There is a heuristic that determines if the scan is worthwhile but the
problem is that the heuristic is not being properly applied and is
basically assuming zone_reclaim_mode is 1 if it is enabled.  The lack of
proper detection can manfiest as high CPU usage as the LRU list is scanned
uselessly.

Historically, once enabled it was depending on NR_FILE_PAGES which may
include swapcache pages that the reclaim_mode cannot deal with.  Patch
vmscan-change-the-number-of-the-unmapped-files-in-zone-reclaim.patch by
Kosaki Motohiro noted that zone_page_state(zone, NR_FILE_PAGES) included
pages that were not file-backed such as swapcache and made a calculation
based on the inactive, active and mapped files.  This is far superior when
zone_reclaim==1 but if RECLAIM_SWAP is set, then NR_FILE_PAGES is a
reasonable starting figure.

This patch alters how zone_reclaim() works out how many pages it might be
able to reclaim given the current reclaim_mode.  If RECLAIM_SWAP is set in
the reclaim_mode it will either consider NR_FILE_PAGES as potential
candidates or else use NR_{IN}ACTIVE}_PAGES-NR_FILE_MAPPED to discount
swapcache and other non-file-backed pages.  If RECLAIM_WRITE is not set,
then NR_FILE_DIRTY number of pages are not candidates.  If RECLAIM_SWAP is
not set, then NR_FILE_MAPPED are not.

[kosaki.motohiro@jp.fujitsu.com: Estimate unmapped pages minus tmpfs pages]
[fengguang.wu@intel.com: Fix underflow problem in Kosaki's estimate]
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Rik van Riel <riel@redhat.com>
Acked-by: Christoph Lameter <cl@linux-foundation.org>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-16 19:47:45 -07:00
..
ABI Merge branch 'for-2.6.31' of git://git.kernel.dk/linux-2.6-block 2009-06-11 11:10:35 -07:00
DocBook Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/torvalds/linux-2.6 2009-06-15 03:02:23 -07:00
PCI
RCU trivial: Miscellaneous documentation typo fixes 2009-06-12 18:01:47 +02:00
accounting Documentation/accounting/getdelays.c: fix endless loop 2009-01-15 16:39:37 -08:00
acpi
aoe
arm [ARM] S3C24XX: GPIO: Change to macros for GPIO numbering 2009-05-18 16:26:03 +01:00
auxdisplay
blackfin
block trivial: Miscellaneous documentation typo fixes 2009-06-12 18:01:47 +02:00
blockdev mflash: initial support 2009-04-07 08:12:38 +02:00
cdrom
cgroups memcg: fix documentation 2009-04-13 15:04:33 -07:00
connector
console
cpu-freq
cpuidle
cris
crypto
development-process docs: Encourage better changelogs in the development process document 2009-06-04 10:32:49 -06:00
device-mapper
driver-model trivial: Miscellaneous documentation typo fixes 2009-06-12 18:01:47 +02:00
dvb V4L/DVB (11138): get_dvb_firmware: add support for downloading the cx2584x firmware for pvrusb2 2009-03-30 12:43:31 -03:00
early-userspace
fault-injection
fb trivial: Miscellaneous documentation typo fixes 2009-06-12 18:01:47 +02:00
filesystems oom: move oom_adj value from task_struct to mm_struct 2009-06-16 19:47:43 -07:00
firmware_class
frv
hwmon hwmon: Update documentation on fan_max 2009-06-01 13:46:50 +02:00
i2c i2c-ocores: Can add I2C devices to the bus 2009-06-13 10:39:28 +01:00
i2o
ia64
ide ide: preserve Host Protected Area by default (v2) 2009-06-07 13:52:52 +02:00
infiniband IPoIB: Document newish features 2009-04-08 13:52:01 -07:00
input Input: multitouch - augment event semantics documentation 2009-05-23 09:53:26 -07:00
ioctl V4L/DVB (10870a): remove all references for video_decoder.h 2009-03-30 12:43:15 -03:00
isdn isdn: extend INTERFACE.CAPI document 2009-06-08 00:45:52 -07:00
ja_JP
kbuild kconfig: resort the documentation of the environment variables 2009-06-09 22:37:47 +02:00
kdump trivial: Miscellaneous documentation typo fixes 2009-06-12 18:01:47 +02:00
ko_KR
laptops trivial: Miscellaneous documentation typo fixes 2009-06-12 18:01:47 +02:00
lguest lguest: add support for indirect ring entries 2009-06-12 22:27:13 +09:30
m68k
make
mips
misc-devices drivers/misc/isl29003.c: driver for the ISL29003 ambient light sensor 2009-04-01 08:59:18 -07:00
mn10300 trivial: Miscellaneous documentation typo fixes 2009-06-12 18:01:47 +02:00
mtd trivial: Miscellaneous documentation typo fixes 2009-06-12 18:01:47 +02:00
namespaces
netlabel
networking Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/torvalds/linux-2.6 2009-06-15 03:02:23 -07:00
parisc
pcmcia
power Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial 2009-06-14 13:46:25 -07:00
powerpc Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6 2009-06-15 09:40:05 -07:00
prctl
s390 trivial: Miscellaneous documentation typo fixes 2009-06-12 18:01:47 +02:00
scheduler trivial: Miscellaneous documentation typo fixes 2009-06-12 18:01:47 +02:00
scsi trivial: Miscellaneous documentation typo fixes 2009-06-12 18:01:47 +02:00
serial
sh
sound Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial 2009-06-14 13:46:25 -07:00
sparc
spi spi: documentation: emphasise spi_master.setup() semantics 2009-04-21 13:41:50 -07:00
sysctl vmscan: properly account for the number of page cache pages zone_reclaim() can reclaim 2009-06-16 19:47:45 -07:00
telephony
thermal
timers trivial: Miscellaneous documentation typo fixes 2009-06-12 18:01:47 +02:00
trace trivial: Remove the hyphen from git commands 2009-06-12 18:01:51 +02:00
uml
usb trivial: usb: fix missing space typo in doc 2009-06-12 18:01:51 +02:00
video4linux trivial: Miscellaneous documentation typo fixes 2009-06-12 18:01:47 +02:00
vm pagemap: add page-types tool 2009-06-16 19:47:38 -07:00
w1
watchdog
wimax
x86 Merge branch 'linus' into x86/mce3 2009-06-11 23:31:52 +02:00
zh_CN
00-INDEX
BUG-HUNTING
Changes Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/sam/kbuild-next 2009-06-14 14:12:18 -07:00
CodingStyle trivial: fix typo milisecond/millisecond for documentation and source comments. 2009-06-12 18:01:46 +02:00
DMA-API.txt trivial: Miscellaneous documentation typo fixes 2009-06-12 18:01:47 +02:00
DMA-ISA-LPC.txt
DMA-attributes.txt
DMA-mapping.txt dma-mapping: update the old macro DMA_nBIT_MASK related documentations 2009-04-07 08:31:12 -07:00
HOWTO
IO-mapping.txt
IPMI.txt
IRQ-affinity.txt
IRQ.txt
Intel-IOMMU.txt
Makefile
ManagementStyle
SAK.txt
SELinux.txt
SM501.txt trivial: Miscellaneous documentation typo fixes 2009-06-12 18:01:47 +02:00
SecurityBugs
Smack.txt smack: implement logging V3 2009-04-14 09:00:23 +10:00
SubmitChecklist
SubmittingDrivers
SubmittingPatches Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial 2009-06-14 13:46:25 -07:00
VGA-softcursor.txt
applying-patches.txt
atomic_ops.txt
bad_memory.txt
basic_profiling.txt
binfmt_misc.txt
braille-console.txt trivial: Miscellaneous documentation typo fixes 2009-06-12 18:01:47 +02:00
bt8xxgpio.txt
c2port.txt
cachetlb.txt
cpu-hotplug.txt
cpu-load.txt
cputopology.txt
credentials.txt
dcdbas.txt
debugging-modules.txt
debugging-via-ohci1394.txt
dell_rbu.txt trivial: Documentation/dell_rbu.txt: fix typos 2009-06-12 18:01:50 +02:00
devices.txt lanana: assign a device name and numbering for MAX3100 2009-04-07 08:44:05 -07:00
dmaengine.txt
dontdiff
dynamic-debug-howto.txt
edac.txt trivial: Miscellaneous documentation typo fixes 2009-06-12 18:01:47 +02:00
eisa.txt
email-clients.txt
exception.txt
feature-removal-schedule.txt Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/torvalds/linux-2.6 2009-06-15 03:02:23 -07:00
futex-requeue-pi.txt futex: add requeue-pi documentation 2009-05-09 07:12:50 +02:00
gpio.txt trivial: Miscellaneous documentation typo fixes 2009-06-12 18:01:47 +02:00
highuid.txt
hw_random.txt
ics932s401
initrd.txt
io-mapping.txt
io_ordering.txt
iostats.txt
irqflags-tracing.txt
isapnp.txt
java.txt
kernel-doc-nano-HOWTO.txt kernel-doc: restrict syntax for private: and public: 2009-05-02 15:36:10 -07:00
kernel-docs.txt
kernel-parameters.txt Merge branch 'merge' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc 2009-06-15 09:32:52 -07:00
keys-request-key.txt
keys.txt
kmemleak.txt kmemleak: Add documentation on the memory leak detector 2009-06-11 17:03:29 +01:00
kobject.txt trivial: Miscellaneous documentation typo fixes 2009-06-12 18:01:47 +02:00
kprobes.txt kprobes: support kretprobe and jprobe per-probe disabling 2009-04-07 08:31:08 -07:00
kref.txt
ldm.txt
leds-class.txt
local_ops.txt trivial: Miscellaneous documentation typo fixes 2009-06-12 18:01:47 +02:00
lockdep-design.txt locking: Documentation: lockdep-design.txt, fix note of state bits 2009-04-26 18:21:24 +02:00
lockstat.txt
logo.gif Revert "linux.conf.au 2009: Tuz" 2009-04-27 12:00:27 -07:00
logo.txt Revert "linux.conf.au 2009: Tuz" 2009-04-27 12:00:27 -07:00
magic-number.txt
markers.txt
mca.txt
md.txt Documentation/md.txt update 2009-03-31 15:18:37 +11:00
memory-barriers.txt sched: Document memory barriers implied by sleep/wake-up primitives 2009-04-29 14:15:55 +02:00
memory-hotplug.txt trivial: Miscellaneous documentation typo fixes 2009-06-12 18:01:47 +02:00
memory.txt
mono.txt
mutex-design.txt
nmi_watchdog.txt
nommu-mmap.txt
numastat.txt
oops-tracing.txt
parport-lowlevel.txt
parport.txt
pi-futex.txt
pnp.txt
preempt-locking.txt
printk-formats.txt
prio_tree.txt
rbtree.txt trivial: rbtree.txt: fix rb_entry() parameters in sample code 2009-06-12 18:01:47 +02:00
rfkill.txt rfkill: document /dev/rfkill 2009-06-03 14:06:15 -04:00
robust-futex-ABI.txt
robust-futexes.txt
rt-mutex-design.txt
rt-mutex.txt
rtc.txt
serial-console.txt
sgi-ioc4.txt
sgi-visws.txt
slow-work.txt Document the slow work thread pool 2009-04-03 16:42:35 +01:00
sparse.txt Documentation: explain the difference between __bitwise and __bitwise__ 2009-04-11 08:18:11 +02:00
spinlocks.txt
stable_api_nonsense.txt
stable_kernel_rules.txt
svga.txt
sysfs-rules.txt Doc/sysfs-rules: Swap the order of the words so the sentence makes more sense 2009-05-08 19:22:20 -07:00
sysrq.txt Merge branch 'tracing/core-v2' into tracing-for-linus 2009-04-02 00:49:02 +02:00
tomoyo.txt tomoyo: add Documentation/tomoyo.txt 2009-04-14 09:14:58 +10:00
unaligned-memory-access.txt
unicode.txt
unshare.txt
video-output.txt
volatile-considered-harmful.txt
voyager.txt
zorro.txt