linux

History

Mel Gorman c67fe3752a mm: compaction: Abort async compaction if locks are contended or taking too long Jim Schutt reported a problem that pointed at compaction contending heavily on locks. The workload is straight-forward and in his own words; The systems in question have 24 SAS drives spread across 3 HBAs, running 24 Ceph OSD instances, one per drive. FWIW these servers are dual-socket Intel 5675 Xeons w/48 GB memory. I've got ~160 Ceph Linux clients doing dd simultaneously to a Ceph file system backed by 12 of these servers. Early in the test everything looks fine procs -------------------memory------------------ ---swap-- -----io---- --system-- -----cpu------- r b swpd free buff cache si so bi bo in cs us sy id wa st 31 15 0 287216 576 38606628 0 0 2 1158 2 14 1 3 95 0 0 27 15 0 225288 576 38583384 0 0 18 2222016 203357 134876 11 56 17 15 0 28 17 0 219256 576 38544736 0 0 11 2305932 203141 146296 11 49 23 17 0 6 18 0 215596 576 38552872 0 0 7 2363207 215264 166502 12 45 22 20 0 22 18 0 226984 576 38596404 0 0 3 2445741 223114 179527 12 43 23 22 0 and then it goes to pot procs -------------------memory------------------ ---swap-- -----io---- --system-- -----cpu------- r b swpd free buff cache si so bi bo in cs us sy id wa st 163 8 0 464308 576 36791368 0 0 11 22210 866 536 3 13 79 4 0 207 14 0 917752 576 36181928 0 0 712 1345376 134598 47367 7 90 1 2 0 123 12 0 685516 576 36296148 0 0 429 1386615 158494 60077 8 84 5 3 0 123 12 0 598572 576 36333728 0 0 1107 1233281 147542 62351 7 84 5 4 0 622 7 0 660768 576 36118264 0 0 557 1345548 151394 59353 7 85 4 3 0 223 11 0 283960 576 36463868 0 0 46 1107160 121846 33006 6 93 1 1 0 Note that system CPU usage is very high blocks being written out has dropped by 42%. He analysed this with perf and found perf record -g -a sleep 10 perf report --sort symbol --call-graph fractal,5 34.63% [k] _raw_spin_lock_irqsave \| \|--97.30%-- isolate_freepages \| compaction_alloc \| unmap_and_move \| migrate_pages \| compact_zone \| compact_zone_order \| try_to_compact_pages \| __alloc_pages_direct_compact \| __alloc_pages_slowpath \| __alloc_pages_nodemask \| alloc_pages_vma \| do_huge_pmd_anonymous_page \| handle_mm_fault \| do_page_fault \| page_fault \| \| \| \|--87.39%-- skb_copy_datagram_iovec \| \| tcp_recvmsg \| \| inet_recvmsg \| \| sock_recvmsg \| \| sys_recvfrom \| \| system_call \| \| __recv \| \| \| \| \| --100.00%-- (nil) \| \| \| --12.61%-- memcpy --2.70%-- [...] There was other data but primarily it is all showing that compaction is contended heavily on the zone->lock and zone->lru_lock. commit [b2eef8c0: mm: compaction: minimise the time IRQs are disabled while isolating pages for migration] noted that it was possible for migration to hold the lru_lock for an excessive amount of time. Very broadly speaking this patch expands the concept. This patch introduces compact_checklock_irqsave() to check if a lock is contended or the process needs to be scheduled. If either condition is true then async compaction is aborted and the caller is informed. The page allocator will fail a THP allocation if compaction failed due to contention. This patch also introduces compact_trylock_irqsave() which will acquire the lock only if it is not contended and the process does not need to schedule. Reported-by: Jim Schutt <jaschut@sandia.gov> Tested-by: Jim Schutt <jaschut@sandia.gov> Signed-off-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>		2012-08-21 16:45:03 -07:00
..
Kconfig	mm: factor out memory isolate functions	2012-07-31 18:42:45 -07:00
Kconfig.debug	mm: more intensive memory corruption debugging	2012-01-10 16:30:42 -08:00
Makefile	mm: factor out memory isolate functions	2012-07-31 18:42:45 -07:00
backing-dev.c	vfs: kill write_super and sync_supers	2012-08-04 01:24:44 +04:00
bootmem.c	bootmem: make ___alloc_bootmem_node_nopanic() really nopanic	2012-07-17 16:21:29 -07:00
bounce.c	bounce: allow use of bounce pool via config option	2012-07-18 16:40:35 -04:00
cleancache.c	->encode_fh() API change	2012-05-29 23:28:33 -04:00
compaction.c	mm: compaction: Abort async compaction if locks are contended or taking too long	2012-08-21 16:45:03 -07:00
debug-pagealloc.c	mm, x86: Remove debug_pagealloc_enabled	2011-12-06 09:24:07 +01:00
dmapool.c	…
fadvise.c	mm, fadvise: don't return -EINVAL when filesystem cannot implement fadvise()	2012-07-31 18:42:42 -07:00
failslab.c	switch debugfs to umode_t	2012-01-03 22:54:56 -05:00
filemap.c	fs: Protect write paths by sb_start_write - sb_end_write	2012-07-31 09:45:47 +04:00
filemap_xip.c	fs: Protect write paths by sb_start_write - sb_end_write	2012-07-31 09:45:47 +04:00
fremap.c	…
frontswap.c	mm/frontswap: cleanup doc and comment error	2012-07-23 11:16:20 -04:00
highmem.c	mm: add support for direct_IO to highmem pages	2012-07-31 18:42:47 -07:00
huge_memory.c	mm/memcg: apply add/del_page to lruvec	2012-05-29 16:22:28 -07:00
hugetlb.c	mm: hugetlbfs: close race during teardown of hugetlbfs shared page tables	2012-07-31 18:42:50 -07:00
hugetlb_cgroup.c	hugetlb/cgroup: remove exclude and wakeup rmdir calls from migrate	2012-07-31 18:42:41 -07:00
hwpoison-inject.c	memcg: rename config variables	2012-07-31 18:42:43 -07:00
init-mm.c	…
internal.h	mm: compaction: Abort async compaction if locks are contended or taking too long	2012-08-21 16:45:03 -07:00
kmemcheck.c	…
kmemleak-test.c	…
kmemleak.c	kmemleak: Disable early logging when kmemleak is off by default	2012-01-20 16:57:05 +00:00
ksm.c	ksm: cleanup: introduce find_mergeable_vma()	2012-03-21 17:54:59 -07:00
maccess.c	…
madvise.c	mm: Hold a file reference in madvise_remove	2012-07-06 10:34:38 -07:00
memblock.c	mm/memblock.c:memblock_double_array(): cosmetic cleanups	2012-07-31 18:42:41 -07:00
memcontrol.c	memcg: add mem_cgroup_from_css() helper	2012-07-31 18:42:49 -07:00
memory-failure.c	memcg: rename config variables	2012-07-31 18:42:43 -07:00
memory.c	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs	2012-08-01 10:26:23 -07:00
memory_hotplug.c	mm/hotplug: free zone->pageset when a zone becomes empty	2012-07-31 18:42:44 -07:00
mempolicy.c	Merge branch 'slab/next' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/linux	2012-07-30 11:32:24 -07:00
mempool.c	mempool: add @gfp_mask to mempool_create_node()	2012-06-25 11:53:47 +02:00
migrate.c	mm: memcg: fix compaction/migration failing due to memcg limits	2012-07-31 18:42:48 -07:00
mincore.c	mm: thp: fix pmd_bad() triggering in code paths holding mmap_sem read mode	2012-03-21 17:54:54 -07:00
mlock.c	vm: avoid using find_vma_prev() unnecessarily	2012-03-06 18:23:36 -08:00
mm_init.c	…
mmap.c	mm: change nr_ptes BUG_ON to WARN_ON	2012-08-21 16:45:02 -07:00
mmu_context.c	mm, counters: remove task argument to sync_mm_rss() and __sync_task_rss_stat()	2012-03-21 17:54:59 -07:00
mmu_notifier.c	mm: mmu_notifier: fix freed page still mapped in secondary MMU	2012-07-31 18:42:49 -07:00
mmzone.c	memcg: rename config variables	2012-07-31 18:42:43 -07:00
mprotect.c	Merge branch 'akpm' (Andrew's patch-bomb)	2012-03-22 09:04:48 -07:00
mremap.c	mm: account the total_vm in the vm_stat_account()	2012-07-31 18:42:39 -07:00
msync.c	…
nobootmem.c	memblock: free allocated memblock_reserved_regions later	2012-07-11 16:04:50 -07:00
nommu.c	nommu: fix compilation of nommu.c	2012-06-04 17:17:31 -04:00
oom_kill.c	mm, memcg: move all oom handling to memcontrol.c	2012-07-31 18:42:45 -07:00
page-writeback.c	vfs: kill write_super and sync_supers	2012-08-04 01:24:44 +04:00
page_alloc.c	mm: compaction: Abort async compaction if locks are contended or taking too long	2012-08-21 16:45:03 -07:00
page_cgroup.c	memcg: rename config variables	2012-07-31 18:42:43 -07:00
page_io.c	mm: add support for direct_IO to highmem pages	2012-07-31 18:42:47 -07:00
page_isolation.c	memory-hotplug: fix kswapd looping forever problem	2012-07-31 18:42:45 -07:00
pagewalk.c	mm: fix kernel-doc warnings	2012-06-20 14:39:36 -07:00
percpu-km.c	…
percpu-vm.c	mm: fix kernel-doc warnings	2012-06-20 14:39:36 -07:00
percpu.c	kmemleak: Fix the kmemleak tracking of the percpu areas with !SMP	2012-05-09 10:13:29 -07:00
pgtable-generic.c	arch/tile: allow building Linux with transparent huge pages enabled	2012-05-25 12:48:21 -04:00
prio_tree.c	…
process_vm_access.c	aio/vfs: cleanup of rw_copy_check_uvector() and compat_rw_copy_check_uvector()	2012-05-31 17:49:32 -07:00
quicklist.c	…
readahead.c	mm: move readahead syscall to mm/readahead.c	2012-05-29 16:22:23 -07:00
rmap.c	mm: remove swap token code	2012-05-29 16:22:19 -07:00
shmem.c	tmpfs: distribute interleave better across nodes	2012-07-31 18:42:50 -07:00
slab.c	mm: micro-optimise slab to avoid a function call	2012-07-31 18:42:46 -07:00
slab.h	mm, sl[aou]b: Use a common mutex definition	2012-07-09 12:13:41 +03:00
slab_common.c	mm: Fix build warning in kmem_cache_create()	2012-07-30 13:15:40 +03:00
slob.c	slob: Fix early boot kernel crash	2012-07-12 10:13:22 +03:00
slub.c	mm: slub: optimise the SLUB fast path to avoid pfmemalloc checks	2012-07-31 18:42:45 -07:00
sparse-vmemmap.c	…
sparse.c	mm/sparse: remove index_init_lock	2012-07-31 18:42:49 -07:00
swap.c	mm: add support for direct_IO to highmem pages	2012-07-31 18:42:47 -07:00
swap_state.c	mm: add support for a filesystem to activate swap files and use direct_IO for writing swap pages	2012-07-31 18:42:47 -07:00
swapfile.c	mm: swapfile: clean up unuse_pte race handling	2012-07-31 18:42:48 -07:00
truncate.c	mm/fs: remove truncate_range	2012-05-29 16:22:23 -07:00
util.c	new helper: vm_mmap_pgoff()	2012-06-01 10:37:18 -04:00
vmalloc.c	mm: make vb_alloc() more foolproof	2012-07-31 18:42:39 -07:00
vmscan.c	memcg: gix memory accounting scalability in shrink_page_list	2012-07-31 18:42:49 -07:00
vmstat.c	mm: account for the number of times direct reclaimers get throttled	2012-07-31 18:42:46 -07:00