linux

History

Aaron Lu 5d1904204c mremap: fix race between mremap() and page cleanning Prior to 3.15, there was a race between zap_pte_range() and page_mkclean() where writes to a page could be lost. Dave Hansen discovered by inspection that there is a similar race between move_ptes() and page_mkclean(). We've been able to reproduce the issue by enlarging the race window with a msleep(), but have not been able to hit it without modifying the code. So, we think it's a real issue, but is difficult or impossible to hit in practice. The zap_pte_range() issue is fixed by commit 1cf35d47712d("mm: split 'tlb_flush_mmu()' into tlb flushing and memory freeing parts"). And this patch is to fix the race between page_mkclean() and mremap(). Here is one possible way to hit the race: suppose a process mmapped a file with READ \| WRITE and SHARED, it has two threads and they are bound to 2 different CPUs, e.g. CPU1 and CPU2. mmap returned X, then thread 1 did a write to addr X so that CPU1 now has a writable TLB for addr X on it. Thread 2 starts mremaping from addr X to Y while thread 1 cleaned the page and then did another write to the old addr X again. The 2nd write from thread 1 could succeed but the value will get lost. thread 1 thread 2 (bound to CPU1) (bound to CPU2) 1: write 1 to addr X to get a writeable TLB on this CPU 2: mremap starts 3: move_ptes emptied PTE for addr X and setup new PTE for addr Y and then dropped PTL for X and Y 4: page laundering for N by doing fadvise FADV_DONTNEED. When done, pageframe N is deemed clean. 5: write 2 to addr X 6: tlb flush for addr X 7: munmap (Y, pagesize) to make the page unmapped 8: fadvise with FADV_DONTNEED again to kick the page off the pagecache 9: pread the page from file to verify the value. If 1 is there, it means we have lost the written 2. the write may or may not cause segmentation fault, it depends on if the TLB is still on the CPU. Please note that this is only one specific way of how the race could occur, it didn't mean that the race could only occur in exact the above config, e.g. more than 2 threads could be involved and fadvise() could be done in another thread, etc. For anonymous pages, they could race between mremap() and page reclaim: THP: a huge PMD is moved by mremap to a new huge PMD, then the new huge PMD gets unmapped/splitted/pagedout before the flush tlb happened for the old huge PMD in move_page_tables() and we could still write data to it. The normal anonymous page has similar situation. To fix this, check for any dirty PTE in move_ptes()/move_huge_pmd() and if any, did the flush before dropping the PTL. If we did the flush for every move_ptes()/move_huge_pmd() call then we do not need to do the flush in move_pages_tables() for the whole range. But if we didn't, we still need to do the whole range flush. Alternatively, we can track which part of the range is flushed in move_ptes()/move_huge_pmd() and which didn't to avoid flushing the whole range in move_page_tables(). But that would require multiple tlb flushes for the different sub-ranges and should be less efficient than the single whole range flush. KBuild test on my Sandybridge desktop doesn't show any noticeable change. v4.9-rc4: real 5m14.048s user 32m19.800s sys 4m50.320s With this commit: real 5m13.888s user 32m19.330s sys 4m51.200s Reported-by: Dave Hansen <dave.hansen@intel.com> Signed-off-by: Aaron Lu <aaron.lu@intel.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>		2016-11-17 09:46:56 -08:00
..
kasan	kprobes: Unpoison stack in jprobe_return() for KASAN	2016-10-16 11:02:31 +02:00
Kconfig	Allow KASAN and HOTPLUG_MEMORY to co-exist when doing build testing	2016-10-27 16:23:01 -07:00
Kconfig.debug	PM / Hibernate: allow hibernation with PAGE_POISONING_ZERO	2016-09-13 02:35:27 +02:00
Makefile	Disable the __builtin_return_address() warning globally after all	2016-10-12 10:23:41 -07:00
backing-dev.c	…
balloon_compaction.c	…
bootmem.c	mm: kmemleak: avoid using __va() on addresses that don't have a lowmem mapping	2016-10-11 15:06:33 -07:00
cleancache.c	…
cma.c	mm/cma.c: check the max limit for cma allocation	2016-11-11 08:12:37 -08:00
cma.h	…
cma_debug.c	…
compaction.c	mm, compaction: restrict fragindex to costly orders	2016-10-07 18:46:29 -07:00
debug.c	mm: clarify why we avoid page_mapcount() for slab pages in dump_page()	2016-10-07 18:46:29 -07:00
debug_page_ref.c	…
dmapool.c	…
early_ioremap.c	…
fadvise.c	…
failslab.c	…
filemap.c	mm/filemap: don't allow partially uptodate page for pipes	2016-11-11 08:12:37 -08:00
frame_vector.c	mm: replace get_vaddr_frames() write/force parameters with gup_flags	2016-10-19 08:11:24 -07:00
frontswap.c	…
gup.c	mm: unexport __get_user_pages()	2016-10-24 19:13:20 -07:00
highmem.c	…
huge_memory.c	mremap: fix race between mremap() and page cleanning	2016-11-17 09:46:56 -08:00
hugetlb.c	mm/hugetlb: fix huge page reservation leak in private mapping error paths	2016-11-11 08:12:37 -08:00
hugetlb_cgroup.c	…
hwpoison-inject.c	…
init-mm.c	…
internal.h	mm, compaction: make full priority ignore pageblock suitability	2016-10-07 18:46:29 -07:00
interval_tree.c	…
khugepaged.c	mm, thp: fix leaking mapped pte in __collapse_huge_page_swapin()	2016-09-19 15:36:16 -07:00
kmemcheck.c	…
kmemleak-test.c	…
kmemleak.c	mm: kmemleak: scan .data.ro_after_init	2016-11-11 08:12:37 -08:00
ksm.c	mm,ksm: add __GFP_HIGH to the allocation in alloc_stable_node()	2016-10-07 18:46:29 -07:00
list_lru.c	mm/list_lru.c: avoid error-path NULL pointer deref	2016-10-27 18:43:42 -07:00
maccess.c	…
madvise.c	…
memblock.c	mm: kmemleak: avoid using __va() on addresses that don't have a lowmem mapping	2016-10-11 15:06:33 -07:00
memcontrol.c	mm: memcontrol: do not recurse in direct reclaim	2016-10-27 18:43:43 -07:00
memory-failure.c	mm: hwpoison: fix thp split handling in memory_failure()	2016-11-11 08:12:37 -08:00
memory.c	mm: replace access_process_vm() write parameter with gup_flags	2016-10-19 08:31:25 -07:00
memory_hotplug.c	mm: remove unused variable in memory hotplug	2016-10-27 15:49:12 -07:00
mempolicy.c	mm: replace get_user_pages() write/force parameters with gup_flags	2016-10-19 08:11:43 -07:00
mempool.c	…
memtest.c	…
migrate.c	mm: vm_page_prot: update with WRITE_ONCE/READ_ONCE	2016-10-07 18:46:29 -07:00
mincore.c	mm, swap: use offset of swap entry as key of swap cache	2016-10-07 18:46:28 -07:00
mlock.c	mm: mlock: avoid increase mm->locked_vm on mlock() when already mlock2(,MLOCK_ONFAULT)	2016-10-07 18:46:28 -07:00
mm_init.c	…
mmap.c	mm: vma_merge: correct false positive from __vma_unlink->validate_mm_rb	2016-10-07 18:46:29 -07:00
mmu_context.c	…
mmu_notifier.c	…
mmzone.c	…
mprotect.c	mm/numa: Remove duplicated include from mprotect.c	2016-10-19 17:28:48 +02:00
mremap.c	mremap: fix race between mremap() and page cleanning	2016-11-17 09:46:56 -08:00
msync.c	…
nobootmem.c	mm: kmemleak: avoid using __va() on addresses that don't have a lowmem mapping	2016-10-11 15:06:33 -07:00
nommu.c	mm: unexport __get_user_pages()	2016-10-24 19:13:20 -07:00
oom_kill.c	oom: print nodemask in the oom report	2016-10-07 18:46:29 -07:00
page-writeback.c	mm: don't use radix tree writeback tags for pages in swap cache	2016-10-07 18:46:28 -07:00
page_alloc.c	mm: remove extra newline from allocation stall warning	2016-11-11 08:12:37 -08:00
page_counter.c	…
page_ext.c	mm/page_ext: support extra space allocation by page_ext user	2016-10-07 18:46:27 -07:00
page_idle.c	…
page_io.c	mm/page_io.c: replace some BUG_ON()s with VM_BUG_ON_PAGE()	2016-10-07 18:46:29 -07:00
page_isolation.c	mm/page_isolation: fix typo: "paes" -> "pages"	2016-10-07 18:46:29 -07:00
page_owner.c	mm/page_owner: don't define fields on struct page_ext by hard-coding	2016-10-07 18:46:27 -07:00
page_poison.c	…
pagewalk.c	…
percpu-km.c	…
percpu-vm.c	…
percpu.c	mm/percpu.c: fix potential memory leakage for pcpu_embed_first_chunk()	2016-10-05 11:52:55 -04:00
pgtable-generic.c	…
process_vm_access.c	mm: remove write/force parameters from __get_user_pages_unlocked()	2016-10-18 14:13:37 -07:00
quicklist.c	…
readahead.c	…
rmap.c	…
shmem.c	shmem: fix pageflags after swapping DMA32 object	2016-11-11 08:12:37 -08:00
slab.c	mm/slab: improve performance of gathering slabinfo stats	2016-10-27 18:43:43 -07:00
slab.h	mm/slab: improve performance of gathering slabinfo stats	2016-10-27 18:43:43 -07:00
slab_common.c	memcg: prevent memcg caches to be both OFF_SLAB & OBJFREELIST_SLAB	2016-11-11 08:12:37 -08:00
slob.c	…
slub.c	slub: Convert to hotplug state machine	2016-09-06 18:30:20 +02:00
sparse-vmemmap.c	…
sparse.c	…
swap.c	thp: reduce usage of huge zero page's atomic counter	2016-10-07 18:46:28 -07:00
swap_cgroup.c	…
swap_state.c	mm, swap: use offset of swap entry as key of swap cache	2016-10-07 18:46:28 -07:00
swapfile.c	swapfile: fix memory corruption via malformed swapfile	2016-11-11 08:12:37 -08:00
truncate.c	…
usercopy.c	mm: usercopy: Check for module addresses	2016-09-20 16:07:39 -07:00
userfaultfd.c	…
util.c	Merge branch 'mm-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip	2016-10-22 09:39:10 -07:00
vmacache.c	mm: unrig VMA cache hit ratio	2016-10-07 18:46:27 -07:00
vmalloc.c	mm: consolidate warn_alloc_failed users	2016-10-07 18:46:29 -07:00
vmpressure.c	…
vmscan.c	mm: memcontrol: do not recurse in direct reclaim	2016-10-27 18:43:43 -07:00
vmstat.c	seq/proc: modify seq_put_decimal_[u]ll to take a const char *, not char	2016-10-07 18:46:30 -07:00
workingset.c	mm: workingset: fix crash in shadow node shrinker caused by replace_page_cache_page()	2016-09-30 15:26:52 -07:00
z3fold.c	…
zbud.c	…
zpool.c	…
zsmalloc.c	…
zswap.c	…