linux/mm
Mel Gorman 215ddd6664 mm: vmscan: only read new_classzone_idx from pgdat when reclaiming successfully
During allocator-intensive workloads, kswapd will be woken frequently
causing free memory to oscillate between the high and min watermark.  This
is expected behaviour.  Unfortunately, if the highest zone is small, a
problem occurs.

When balance_pgdat() returns, it may be at a lower classzone_idx than it
started because the highest zone was unreclaimable.  Before checking if it
should go to sleep though, it checks pgdat->classzone_idx which when there
is no other activity will be MAX_NR_ZONES-1.  It interprets this as it has
been woken up while reclaiming, skips scheduling and reclaims again.  As
there is no useful reclaim work to do, it enters into a loop of shrinking
slab consuming loads of CPU until the highest zone becomes reclaimable for
a long period of time.

There are two problems here.  1) If the returned classzone or order is
lower, it'll continue reclaiming without scheduling.  2) if the highest
zone was marked unreclaimable but balance_pgdat() returns immediately at
DEF_PRIORITY, the new lower classzone is not communicated back to kswapd()
for sleeping.

This patch does two things that are related.  If the end_zone is
unreclaimable, this information is communicated back.  Second, if the
classzone or order was reduced due to failing to reclaim, new information
is not read from pgdat and instead an attempt is made to go to sleep.  Due
to this, it is also necessary that pgdat->classzone_idx be initialised
each time to pgdat->nr_zones - 1 to avoid re-reads being interpreted as
wakeups.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Reported-by: Pádraig Brady <P@draigBrady.com>
Tested-by: Pádraig Brady <P@draigBrady.com>
Tested-by: Andrew Lutomirski <luto@mit.edu>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-07-08 21:14:43 -07:00
..
Kconfig mm: cleancache core ops functions and config 2011-05-26 10:01:36 -06:00
Kconfig.debug mm: debug-pagealloc: fix kconfig dependency warning 2011-03-22 17:44:02 -07:00
Makefile mm: cleancache core ops functions and config 2011-05-26 10:01:36 -06:00
backing-dev.c backing-dev: Kill set but not used var in bdi_debug_stats_show() 2011-05-20 21:23:37 +02:00
bootmem.c crash_dump: export is_kdump_kernel to modules, consolidate elfcorehdr_addr, setup_elfcorehdr and saved_max_pfn 2011-03-23 19:47:19 -07:00
bounce.c
cleancache.c mm: cleancache core ops functions and config 2011-05-26 10:01:36 -06:00
compaction.c mm: compaction: abort compaction if too many pages are isolated and caller is asynchronous V2 2011-06-15 20:04:02 -07:00
debug-pagealloc.c
dmapool.c
fadvise.c
failslab.c
filemap.c more conservative S_NOSEC handling 2011-06-03 18:24:58 -04:00
filemap_xip.c mm: Convert i_mmap_lock to a mutex 2011-05-25 08:39:18 -07:00
fremap.c mm: don't access vm_flags as 'int' 2011-05-26 09:20:31 -07:00
highmem.c
huge_memory.c mm: remove khugepaged double thp vmstat update with CONFIG_NUMA=n 2011-06-15 20:03:58 -07:00
hugetlb.c mm: fix negative commitlimit when gigantic hugepages are allocated 2011-06-15 20:04:01 -07:00
hwpoison-inject.c Fix common misspellings 2011-03-31 11:26:23 -03:00
init-mm.c mm: convert mm->cpu_vm_cpumask into cpumask_var_t 2011-05-25 08:39:21 -07:00
internal.h mm: nommu: sort mm->mmap list properly 2011-05-25 08:39:05 -07:00
kmemcheck.c
kmemleak-test.c
kmemleak.c kmemleak: Do not return a pointer to an object that kmemleak did not get 2011-05-19 17:35:28 +01:00
ksm.c ksm: fix NULL pointer dereference in scan_get_next_rmap_item() 2011-06-15 20:04:02 -07:00
maccess.c maccess,probe_kernel: Make write/read src const void * 2011-05-25 19:56:23 -04:00
madvise.c
memblock.c mm/memblock: properly handle overlaps and fix error path 2011-03-22 17:44:09 -07:00
memcontrol.c mm: move shmem prototypes to shmem_fs.h 2011-06-27 18:00:12 -07:00
memory-failure.c mm/memory-failure.c: fix spinlock vs mutex order 2011-06-27 18:00:13 -07:00
memory.c mm: move vmtruncate_range to truncate.c 2011-06-27 18:00:12 -07:00
memory_hotplug.c mm, hotplug: protect zonelist building with zonelists_mutex 2011-06-22 21:06:48 -07:00
mempolicy.c mm: proc: move show_numa_map() to fs/proc/task_mmu.c 2011-05-25 08:39:34 -07:00
mempool.c
migrate.c migrate: don't account swapcache as shmem 2011-06-16 15:01:24 -07:00
mincore.c
mlock.c mm: don't access vm_flags as 'int' 2011-05-26 09:20:31 -07:00
mm_init.c
mmap.c mm: get rid of the most spurious find_vma_prev() users 2011-06-16 00:35:09 -07:00
mmu_context.c
mmu_notifier.c
mmzone.c
mprotect.c
mremap.c mm: Convert i_mmap_lock to a mutex 2011-05-25 08:39:18 -07:00
msync.c
nobootmem.c memblock/nobootmem: remove unneeded code from alloc_bootmem_node_high() 2011-05-25 08:39:31 -07:00
nommu.c nommu: add page alignment to mmap 2011-05-25 08:39:38 -07:00
oom_kill.c oom: replace PF_OOM_ORIGIN with toggling oom_score_adj 2011-05-25 08:39:10 -07:00
page-writeback.c Merge branch 'for-2.6.39/core' of git://git.kernel.dk/linux-2.6-block 2011-03-24 10:16:26 -07:00
page_alloc.c Revert "mm: fail GFP_DMA allocations when ZONE_DMA is not configured" 2011-06-02 06:11:24 +09:00
page_cgroup.c memcg: fix init_page_cgroup nid with sparsemem 2011-06-15 20:04:01 -07:00
page_io.c block: kill off REQ_UNPLUG 2011-03-10 08:52:27 +01:00
page_isolation.c
pagewalk.c pagewalk: only split huge pages when necessary 2011-03-22 17:44:04 -07:00
percpu-km.c
percpu-vm.c
percpu.c Merge branch 'for-2.6.40' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu 2011-05-24 11:53:42 -07:00
pgtable-generic.c
prio_tree.c sanitize <linux/prefetch.h> usage 2011-05-20 12:50:29 -07:00
quicklist.c
readahead.c readahead: readahead page allocations are OK to fail 2011-05-25 08:39:25 -07:00
rmap.c mm/memory-failure.c: fix spinlock vs mutex order 2011-06-27 18:00:13 -07:00
shmem.c tmpfs: add shmem_read_mapping_page_gfp 2011-06-27 18:00:12 -07:00
slab.c SLAB: Record actual last user of freed objects. 2011-06-03 19:33:50 +03:00
slob.c
slub.c slub: always align cpu_slab to honor cmpxchg_double requirement 2011-06-03 19:33:49 +03:00
sparse-vmemmap.c
sparse.c Fix common misspellings 2011-03-31 11:26:23 -03:00
swap.c mm: batch activate_page() to reduce lock contention 2011-05-25 08:39:37 -07:00
swap_state.c block: remove per-queue plugging 2011-03-10 08:52:07 +01:00
swapfile.c mm: move shmem prototypes to shmem_fs.h 2011-06-27 18:00:12 -07:00
thrash.c vmscan: implement swap token priority aging 2011-06-15 20:03:59 -07:00
truncate.c mm: fix assertion mapping->nrpages == 0 in end_writeback() 2011-06-27 18:00:13 -07:00
util.c mm: nommu: sort mm->mmap list properly 2011-05-25 08:39:05 -07:00
vmalloc.c Merge branch 'upstream/tidy-xen-mmu-2.6.39' of git://git.kernel.org/pub/scm/linux/kernel/git/jeremy/xen 2011-05-26 19:01:15 -07:00
vmscan.c mm: vmscan: only read new_classzone_idx from pgdat when reclaiming successfully 2011-07-08 21:14:43 -07:00
vmstat.c mm, mem-hotplug: update pcp->stat_threshold when memory hotplug occur 2011-05-25 08:39:09 -07:00