linux

History

Mel Gorman 4b62bbcc86 mm, numa: fix bad pmd by atomically check for pmd_trans_huge when marking page tables prot_numa commit `8b272b3cbb` upstream. : A user reported a bug against a distribution kernel while running a : proprietary workload described as "memory intensive that is not swapping" : that is expected to apply to mainline kernels. The workload is : read/write/modifying ranges of memory and checking the contents. They : reported that within a few hours that a bad PMD would be reported followed : by a memory corruption where expected data was all zeros. A partial : report of the bad PMD looked like : : [ 5195.338482] ../mm/pgtable-generic.c:33: bad pmd ffff8888157ba008(000002e0396009e2) : [ 5195.341184] ------------[ cut here ]------------ : [ 5195.356880] kernel BUG at ../mm/pgtable-generic.c:35! : .... : [ 5195.410033] Call Trace: : [ 5195.410471] [<ffffffff811bc75d>] change_protection_range+0x7dd/0x930 : [ 5195.410716] [<ffffffff811d4be8>] change_prot_numa+0x18/0x30 : [ 5195.410918] [<ffffffff810adefe>] task_numa_work+0x1fe/0x310 : [ 5195.411200] [<ffffffff81098322>] task_work_run+0x72/0x90 : [ 5195.411246] [<ffffffff81077139>] exit_to_usermode_loop+0x91/0xc2 : [ 5195.411494] [<ffffffff81003a51>] prepare_exit_to_usermode+0x31/0x40 : [ 5195.411739] [<ffffffff815e56af>] retint_user+0x8/0x10 : : Decoding revealed that the PMD was a valid prot_numa PMD and the bad PMD : was a false detection. The bug does not trigger if automatic NUMA : balancing or transparent huge pages is disabled. : : The bug is due a race in change_pmd_range between a pmd_trans_huge and : pmd_nond_or_clear_bad check without any locks held. During the : pmd_trans_huge check, a parallel protection update under lock can have : cleared the PMD and filled it with a prot_numa entry between the transhuge : check and the pmd_none_or_clear_bad check. : : While this could be fixed with heavy locking, it's only necessary to make : a copy of the PMD on the stack during change_pmd_range and avoid races. A : new helper is created for this as the check if quite subtle and the : existing similar helpful is not suitable. This passed 154 hours of : testing (usually triggers between 20 minutes and 24 hours) without : detecting bad PMDs or corruption. A basic test of an autonuma-intensive : workload showed no significant change in behaviour. Although Mel withdrew the patch on the face of LKML comment https://lkml.org/lkml/2017/4/10/922 the race window aforementioned is still open, and we have reports of Linpack test reporting bad residuals after the bad PMD warning is observed. In addition to that, bad rss-counter and non-zero pgtables assertions are triggered on mm teardown for the task hitting the bad PMD. host kernel: mm/pgtable-generic.c:40: bad pmd 00000000b3152f68(8000000d2d2008e7) .... host kernel: BUG: Bad rss-counter state mm:00000000b583043d idx:1 val:512 host kernel: BUG: non-zero pgtables_bytes on freeing mm: 4096 The issue is observed on a v4.18-based distribution kernel, but the race window is expected to be applicable to mainline kernels, as well. [akpm@linux-foundation.org: fix comment typo, per Rafael] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Rafael Aquini <aquini@redhat.com> Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Cc: <stable@vger.kernel.org> Cc: Zi Yan <zi.yan@cs.rutgers.edu> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Michal Hocko <mhocko@suse.com> Link: http://lkml.kernel.org/r/20200216191800.22423-1-aquini@redhat.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>		2020-03-12 13:00:19 +01:00
..
kasan
backing-dev.c	memcg: fix a crash in wb_workfn when a device disappears	2020-02-11 04:35:11 -08:00
balloon_compaction.c
cleancache.c
cma_debug.c
cma.c
cma.h
compaction.c
debug_page_ref.c
debug.c	mm/debug.c: always print flags in dump_page()	2020-03-05 16:43:51 +01:00
dmapool.c
early_ioremap.c
fadvise.c
failslab.c
filemap.c	mm: drop mmap_sem before calling balance_dirty_pages() in write fault	2020-01-09 10:19:55 +01:00
frame_vector.c
frontswap.c
gup_benchmark.c	mm/gup: fix memory leak in __gup_benchmark_ioctl	2020-01-09 10:20:00 +01:00
gup.c	mm/gup: allow FOLL_FORCE for get_user_pages_fast()	2020-03-05 16:43:51 +01:00
highmem.c
hmm.c
huge_memory.c	mm, thp: fix defrag setting if newline is not used	2020-03-05 16:43:51 +01:00
hugetlb_cgroup.c
hugetlb.c	mm/hugetlb: defer freeing of huge pages if in non-task context	2020-01-09 10:20:07 +01:00
hwpoison-inject.c
init-mm.c
internal.h	mm: drop mmap_sem before calling balance_dirty_pages() in write fault	2020-01-09 10:19:55 +01:00
interval_tree.c
Kconfig
Kconfig.debug
khugepaged.c
kmemleak-test.c
kmemleak.c
ksm.c
list_lru.c
maccess.c	uaccess: Add non-pagefault user-space write function	2020-01-17 19:48:40 +01:00
madvise.c
Makefile
memblock.c
memcontrol.c	mm/memcontrol.c: lost css_put in memcg_expand_shrinker_maps()	2020-02-28 17:22:20 +01:00
memfd.c
memory_hotplug.c	mm/memory_hotplug: fix remove_memory() lockdep splat	2020-02-11 04:35:12 -08:00
memory-failure.c
memory.c	mm: drop mmap_sem before calling balance_dirty_pages() in write fault	2020-01-09 10:19:55 +01:00
mempolicy.c	mm/mempolicy.c: fix out of bounds write in mpol_parse_str()	2020-02-05 21:22:40 +00:00
mempool.c
memremap.c	mm/memory_hotplug: shrink zones when offlining memory	2020-01-09 10:19:56 +01:00
memtest.c
migrate.c	mm: move_pages: report the number of non-attempted pages	2020-02-11 04:35:13 -08:00
mincore.c
mlock.c
mm_init.c
mmap.c	mm: Avoid creating virtual address aliases in brk()/mmap()/mremap()	2020-02-28 17:22:21 +01:00
mmu_context.c
mmu_gather.c	mm/mmu_gather: invalidate TLB correctly on batch allocation failure and flush	2020-02-11 04:35:42 -08:00
mmu_notifier.c
mmzone.c
mprotect.c	mm, numa: fix bad pmd by atomically check for pmd_trans_huge when marking page tables prot_numa	2020-03-12 13:00:19 +01:00
mremap.c	mm: Avoid creating virtual address aliases in brk()/mmap()/mremap()	2020-02-28 17:22:21 +01:00
msync.c
nommu.c
oom_kill.c	mm/oom: fix pgtables units mismatch in Killed process message	2020-01-09 10:19:57 +01:00
page_alloc.c	mm/page_alloc.c: fix uninitialized memmaps on a partially populated last section	2020-02-11 04:35:42 -08:00
page_counter.c
page_ext.c
page_idle.c
page_io.c
page_isolation.c
page_owner.c
page_poison.c
page_vma_mapped.c
page-writeback.c	mm/page-writeback.c: avoid potential division by zero in wb_min_max_ratio()	2020-01-23 08:22:41 +01:00
pagewalk.c
percpu-internal.h
percpu-km.c
percpu-stats.c
percpu-vm.c
percpu.c
pgtable-generic.c
process_vm_access.c
readahead.c
rmap.c
rodata_test.c
shmem.c	mm/shmem.c: thp, shmem: fix conflict of above-47bit hint address and PMD alignment	2020-01-23 08:22:39 +01:00
shuffle.c
shuffle.h
slab_common.c	mm: memcg/slab: call flush_memcg_workqueue() only if memcg workqueue is valid	2020-01-23 08:22:39 +01:00
slab.c	mm, debug_pagealloc: don't rely on static keys too early	2020-01-23 08:22:40 +01:00
slab.h
slob.c
slub.c	mm, debug_pagealloc: don't rely on static keys too early	2020-01-23 08:22:40 +01:00
sparse-vmemmap.c
sparse.c	mm/sparsemem: pfn_to_page is not valid yet on SPARSEMEM	2020-02-28 17:22:20 +01:00
swap_cgroup.c
swap_slots.c
swap_state.c
swap.c
swapfile.c
truncate.c
usercopy.c
userfaultfd.c
util.c
vmacache.c
vmalloc.c	mm, debug_pagealloc: don't rely on static keys too early	2020-01-23 08:22:40 +01:00
vmpressure.c
vmscan.c	mm/vmscan.c: don't round up scan size for online memory cgroup	2020-02-28 17:22:20 +01:00
vmstat.c
workingset.c
z3fold.c
zbud.c
zpool.c
zsmalloc.c	mm/zsmalloc.c: fix the migrated zspage statistics.	2020-01-09 10:19:56 +01:00
zswap.c