linux/mm
Hugh Dickins 087c99b15f shmem: fix splicing from a hole while it's punched
commit b1a366500b upstream.

shmem_fault() is the actual culprit in trinity's hole-punch starvation,
and the most significant cause of such problems: since a page faulted is
one that then appears page_mapped(), needing unmap_mapping_range() and
i_mmap_mutex to be unmapped again.

But it is not the only way in which a page can be brought into a hole in
the radix_tree while that hole is being punched; and Vlastimil's testing
implies that if enough other processors are busy filling in the hole,
then shmem_undo_range() can be kept from completing indefinitely.

shmem_file_splice_read() is the main other user of SGP_CACHE, which can
instantiate shmem pagecache pages in the read-only case (without holding
i_mutex, so perhaps concurrently with a hole-punch).  Probably it's
silly not to use SGP_READ already (using the ZERO_PAGE for holes): which
ought to be safe, but might bring surprises - not a change to be rushed.

shmem_read_mapping_page_gfp() is an internal interface used by
drivers/gpu/drm GEM (and next by uprobes): it should be okay.  And
shmem_file_read_iter() uses the SGP_DIRTY variant of SGP_CACHE, when
called internally by the kernel (perhaps for a stacking filesystem,
which might rely on holes to be reserved): it's unclear whether it could
be provoked to keep hole-punch busy or not.

We could apply the same umbrella as now used in shmem_fault() to
shmem_file_splice_read() and the others; but it looks ugly, and use over
a range raises questions - should it actually be per page? can these get
starved themselves?

The origin of this part of the problem is my v3.1 commit d0823576bf
("mm: pincer in truncate_inode_pages_range"), once it was duplicated
into shmem.c.  It seemed like a nice idea at the time, to ensure
(barring RCU lookup fuzziness) that there's an instant when the entire
hole is empty; but the indefinitely repeated scans to ensure that make
it vulnerable.

Revert that "enhancement" to hole-punch from shmem_undo_range(), but
retain the unproblematic rescanning when it's truncating; add a couple
of comments there.

Remove the "indices[0] >= end" test: that is now handled satisfactorily
by the inner loop, and mem_cgroup_uncharge_start()/end() are too light
to be worth avoiding here.

But if we do not always loop indefinitely, we do need to handle the case
of swap swizzled back to page before shmem_free_swap() gets it: add a
retry for that case, as suggested by Konstantin Khlebnikov; and for the
case of page swizzled back to swap, as suggested by Johannes Weiner.

Signed-off-by: Hugh Dickins <hughd@google.com>
Reported-by: Sasha Levin <sasha.levin@oracle.com>
Suggested-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Konstantin Khlebnikov <koct9i@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Lukas Czerner <lczerner@redhat.com>
Cc: Dave Jones <davej@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-07-28 08:05:58 -07:00
..
Kconfig parisc,metag: Do not hardcode maximum userspace stack size 2014-07-17 16:21:03 -07:00
Kconfig.debug
Makefile zsmalloc: move it under mm 2014-01-30 16:56:55 -08:00
backing-dev.c bdi: avoid oops on device removal 2014-04-26 17:19:05 -07:00
balloon_compaction.c mm: print more details for bad_page() 2014-01-23 16:36:50 -08:00
bootmem.c
bounce.c
cleancache.c mm: dump page when hitting a VM_BUG_ON using VM_BUG_ON_PAGE 2014-01-23 16:36:50 -08:00
compaction.c mm/compaction: make isolate_freepages start at pageblock boundary 2014-06-07 10:28:17 -07:00
debug-pagealloc.c
dmapool.c
fadvise.c
failslab.c
filemap.c fix O_SYNC|O_APPEND syncing the wrong range on write() 2014-02-09 15:18:09 -05:00
filemap_xip.c
fremap.c mm: fix bad rss-counter if remap_file_pages raced migration 2014-03-19 16:21:49 -07:00
frontswap.c
highmem.c
huge_memory.c thp: close race between split and zap huge pages 2014-05-31 13:20:30 -07:00
hugetlb.c hugetlb: fix copy_hugetlb_page_range() to handle migration/hwpoisoned entry 2014-07-09 11:18:26 -07:00
hugetlb_cgroup.c mm: dump page when hitting a VM_BUG_ON using VM_BUG_ON_PAGE 2014-01-23 16:36:50 -08:00
hwpoison-inject.c mm/hwpoison: add '#' to hwpoison_inject 2014-01-21 16:19:48 -08:00
init-mm.c
internal.h mm: page_alloc: spill to remote nodes before waking kswapd 2014-05-06 07:59:35 -07:00
interval_tree.c
kmemcheck.c
kmemleak-test.c
kmemleak.c
ksm.c mm: close PageTail race 2014-03-04 07:55:47 -08:00
list_lru.c
maccess.c
madvise.c
memblock.c memblock: add limit checking to memblock_virt_alloc 2014-01-29 16:22:40 -08:00
memcontrol.c memcg: reparent charges of children before processing parent 2014-03-04 07:55:48 -08:00
memory-failure.c mm/memory-failure.c: support use of a dedicated thread to handle SIGBUS(BUS_MCEERR_AO) 2014-06-30 20:11:53 -07:00
memory.c mm/numa: Remove BUG_ON() in __handle_mm_fault() 2014-07-09 11:18:29 -07:00
memory_hotplug.c mm/memory_hotplug.c: move register_memory_resource out of the lock_memory_hotplug 2014-01-23 16:36:52 -08:00
mempolicy.c cpuset,mempolicy: fix sleeping function called from invalid context 2014-07-17 16:21:03 -07:00
mempool.c
migrate.c mm: fix swapops.h:131 bug if remap_file_pages raced migration 2014-03-20 22:09:09 -07:00
mincore.c mm: do_mincore() cleanup 2014-01-23 16:36:52 -08:00
mlock.c mm: try_to_unmap_cluster() should lock_page() before mlocking 2014-05-06 07:59:35 -07:00
mm_init.c mm: bring back /sys/kernel/mm 2014-01-27 21:02:39 -08:00
mmap.c mm: ignore VM_SOFTDIRTY on VMA merging 2014-01-23 16:36:53 -08:00
mmu_context.c
mmu_notifier.c mm: audit/fix non-modular users of module_init in core code 2014-01-23 16:36:52 -08:00
mmzone.c
mprotect.c mm: Use ptep/pmdp_set_numa() for updating _PAGE_NUMA bit 2014-02-17 11:19:36 +11:00
mremap.c mm, thp: close race between mremap() and split_huge_page() 2014-06-07 10:28:10 -07:00
msync.c
nobootmem.c mm/nobootmem: free_all_bootmem again 2014-01-23 16:36:52 -08:00
nommu.c mm: add overcommit_kbytes sysctl variable 2014-01-21 16:19:44 -08:00
oom_kill.c mm, oom: base root bonus on current usage 2014-01-30 16:56:56 -08:00
page-writeback.c ext4: fix data integrity sync in ordered mode 2014-06-30 20:11:55 -07:00
page_alloc.c mm: page_alloc: fix CMA area initialisation when pageblock > MAX_ORDER 2014-07-09 11:18:28 -07:00
page_cgroup.c Merge branch 'akpm' (incoming from Andrew) 2014-01-21 19:05:45 -08:00
page_io.c Merge branch 'for-3.14/core' of git://git.kernel.dk/linux-block 2014-01-30 11:19:05 -08:00
page_isolation.c
pagewalk.c
percpu-km.c
percpu-vm.c
percpu.c percpu: make pcpu_alloc_chunk() use pcpu_mem_free() instead of kfree() 2014-06-07 10:28:22 -07:00
pgtable-generic.c
process_vm_access.c
quicklist.c
readahead.c mm/readahead.c: fix do_readahead() for no readpage(s) 2014-01-29 16:22:40 -08:00
rmap.c mm: fix sleeping function warning from __put_anon_vma 2014-06-30 20:11:53 -07:00
shmem.c shmem: fix splicing from a hole while it's punched 2014-07-28 08:05:58 -07:00
slab.c slab: fix oops when reading /proc/slab_allocators 2014-07-09 11:18:29 -07:00
slab.h memcg, slab: RCU protect memcg_params for root caches 2014-01-23 16:36:51 -08:00
slab_common.c slab: fix wrong retval on kmem_cache_create_memcg error path 2014-01-29 16:22:40 -08:00
slob.c
slub.c slub: do not assert not having lock in removing freed partial 2014-02-10 16:01:42 -08:00
sparse-vmemmap.c mm/sparse: use memblock apis for early memory allocations 2014-01-21 16:19:47 -08:00
sparse.c mm/sparse: use memblock apis for early memory allocations 2014-01-21 16:19:47 -08:00
swap.c mm: close PageTail race 2014-03-04 07:55:47 -08:00
swap_state.c swap: add a simple detector for inappropriate swapin readahead 2014-02-06 13:48:51 -08:00
swapfile.c mm/swap: fix race on swap_info reuse between swapoff and swapon 2014-02-06 13:48:51 -08:00
truncate.c
util.c mm: add overcommit_kbytes sysctl variable 2014-01-21 16:19:44 -08:00
vmalloc.c Revert "mm/vmalloc: interchage the implementation of vmalloc_to_{pfn,page}" 2014-01-27 21:02:39 -08:00
vmpressure.c arm, pm, vmpressure: add missing slab.h includes 2014-02-03 13:24:01 -05:00
vmscan.c mm: vmscan: clear kswapd's special reclaim powers before exiting 2014-06-30 20:11:54 -07:00
vmstat.c mm, x86: Account for TLB flushes only when debugging 2014-01-25 09:10:41 +01:00
zbud.c
zsmalloc.c zsmalloc: add copyright 2014-01-30 16:56:55 -08:00
zswap.c mm/zswap.c: change params from hidden to ro 2014-01-23 16:36:50 -08:00