linux/mm
Gu Zheng 20272b1747 mm/memory_hotplug.c: set zone->wait_table to null after freeing it
commit 85bd839983 upstream.

Izumi found the following oops when hot re-adding a node:

    BUG: unable to handle kernel paging request at ffffc90008963690
    IP: __wake_up_bit+0x20/0x70
    Oops: 0000 [#1] SMP
    CPU: 68 PID: 1237 Comm: rs:main Q:Reg Not tainted 4.1.0-rc5 #80
    Hardware name: FUJITSU PRIMEQUEST2800E/SB, BIOS PRIMEQUEST 2000 Series BIOS Version 1.87 04/28/2015
    task: ffff880838df8000 ti: ffff880017b94000 task.ti: ffff880017b94000
    RIP: 0010:[<ffffffff810dff80>]  [<ffffffff810dff80>] __wake_up_bit+0x20/0x70
    RSP: 0018:ffff880017b97be8  EFLAGS: 00010246
    RAX: ffffc90008963690 RBX: 00000000003c0000 RCX: 000000000000a4c9
    RDX: 0000000000000000 RSI: ffffea101bffd500 RDI: ffffc90008963648
    RBP: ffff880017b97c08 R08: 0000000002000020 R09: 0000000000000000
    R10: 0000000000000000 R11: 0000000000000000 R12: ffff8a0797c73800
    R13: ffffea101bffd500 R14: 0000000000000001 R15: 00000000003c0000
    FS:  00007fcc7ffff700(0000) GS:ffff880874800000(0000) knlGS:0000000000000000
    CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: ffffc90008963690 CR3: 0000000836761000 CR4: 00000000001407e0
    Call Trace:
      unlock_page+0x6d/0x70
      generic_write_end+0x53/0xb0
      xfs_vm_write_end+0x29/0x80 [xfs]
      generic_perform_write+0x10a/0x1e0
      xfs_file_buffered_aio_write+0x14d/0x3e0 [xfs]
      xfs_file_write_iter+0x79/0x120 [xfs]
      __vfs_write+0xd4/0x110
      vfs_write+0xac/0x1c0
      SyS_write+0x58/0xd0
      system_call_fastpath+0x12/0x76
    Code: 5d c3 66 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 48 89 e5 48 83 ec 20 65 48 8b 04 25 28 00 00 00 48 89 45 f8 31 c0 48 8d 47 48 <48> 39 47 48 48 c7 45 e8 00 00 00 00 48 c7 45 f0 00 00 00 00 48
    RIP  [<ffffffff810dff80>] __wake_up_bit+0x20/0x70
     RSP <ffff880017b97be8>
    CR2: ffffc90008963690

Reproduce method (re-add a node)::
  Hot-add nodeA --> remove nodeA --> hot-add nodeA (panic)

This seems an use-after-free problem, and the root cause is
zone->wait_table was not set to *NULL* after free it in
try_offline_node.

When hot re-add a node, we will reuse the pgdat of it, so does the zone
struct, and when add pages to the target zone, it will init the zone
first (including the wait_table) if the zone is not initialized.  The
judgement of zone initialized is based on zone->wait_table:

	static inline bool zone_is_initialized(struct zone *zone)
	{
		return !!zone->wait_table;
	}

so if we do not set the zone->wait_table to *NULL* after free it, the
memory hotplug routine will skip the init of new zone when hot re-add
the node, and the wait_table still points to the freed memory, then we
will access the invalid address when trying to wake up the waiting
people after the i/o operation with the page is done, such as mentioned
above.

Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Reported-by: Taku Izumi <izumi.taku@jp.fujitsu.com>
Reviewed by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-06-22 17:01:23 -07:00
..
Kconfig parisc,metag: Do not hardcode maximum userspace stack size 2014-07-17 16:21:03 -07:00
Kconfig.debug mm: more intensive memory corruption debugging 2012-01-10 16:30:42 -08:00
Makefile mm: per-thread vma caching 2014-10-09 12:21:29 -07:00
backing-dev.c bdi: avoid oops on device removal 2014-04-26 17:19:05 -07:00
balloon_compaction.c mm: print more details for bad_page() 2014-01-23 16:36:50 -08:00
bootmem.c mm/bootmem.c: remove unused local `map' 2013-11-13 12:09:09 +09:00
bounce.c block: Convert bio_for_each_segment() to bvec_iter 2013-11-23 22:33:49 -08:00
cleancache.c mm: dump page when hitting a VM_BUG_ON using VM_BUG_ON_PAGE 2014-01-23 16:36:50 -08:00
compaction.c mm/compaction: fix wrong order check in compact_finished() 2015-03-18 13:31:23 +01:00
debug-pagealloc.c mm, x86: Remove debug_pagealloc_enabled 2011-12-06 09:24:07 +01:00
dmapool.c dmapool: make DMAPOOL_DEBUG detect corruption of free marker 2012-12-11 17:22:24 -08:00
fadvise.c teach SYSCALL_DEFINE<n> how to deal with long long/unsigned long long 2013-03-03 22:46:22 -05:00
failslab.c switch debugfs to umode_t 2012-01-03 22:54:56 -05:00
filemap.c mm: get rid of radix tree gfp mask for pagecache_get_page 2015-01-29 17:40:53 -08:00
filemap_xip.c seqcount: Add lockdep functionality to seqcount/seqlock structures 2013-11-06 12:40:26 +01:00
fremap.c mm: fix bad rss-counter if remap_file_pages raced migration 2014-03-19 16:21:49 -07:00
frontswap.c mm: frontswap: invalidate expired data on a dup-store failure 2014-12-16 09:34:26 -08:00
highmem.c Some nice cleanups, and even a patch my wife did as a "live" demo for 2012-12-20 08:37:05 -08:00
huge_memory.c mm, thp: only collapse hugepages to nodes with affinity for zone_reclaim_mode 2015-01-29 17:40:52 -08:00
hugetlb.c mm/hugetlb: add migration entry check in __unmap_hugepage_range 2015-03-18 13:31:23 +01:00
hugetlb_cgroup.c mm: dump page when hitting a VM_BUG_ON using VM_BUG_ON_PAGE 2014-01-23 16:36:50 -08:00
hwpoison-inject.c mm/hwpoison: add '#' to hwpoison_inject 2014-01-21 16:19:48 -08:00
init-mm.c atomic: use <linux/atomic.h> 2011-07-26 16:49:47 -07:00
internal.h mm, compaction: properly signal and act upon lock and need_sched() contention 2014-11-21 09:23:07 -08:00
interval_tree.c mm: add CONFIG_DEBUG_VM_RB build option 2012-10-09 16:22:42 +09:00
kmemcheck.c kmemcheck: Fix build errors due to missing slab.h 2010-03-30 22:02:32 +09:00
kmemleak-test.c kmemleak: remove memset by using kzalloc 2011-01-27 18:31:51 +00:00
kmemleak.c mm: kmemleak: avoid false negatives on vmalloc'ed objects 2013-11-13 12:09:07 +09:00
ksm.c vm: add VM_FAULT_SIGSEGV handling support 2015-04-29 10:31:55 +02:00
list_lru.c mm: list_lru: fix almost infinite loop causing effective livelock 2013-10-30 12:57:46 -07:00
maccess.c mm: Map most files to use export.h instead of module.h 2011-10-31 09:20:12 -04:00
madvise.c mm: madvise: fix MADV_WILLNEED on shmem swapouts 2014-11-21 09:23:06 -08:00
memblock.c memblock, memhotplug: fix wrong type in memblock_find_in_range_node(). 2014-10-05 14:52:17 -07:00
memcontrol.c mm: memcontrol: do not iterate uninitialized memcgs 2014-11-14 09:00:08 -08:00
memory-failure.c mm: soft-offline: fix num_poisoned_pages counting on concurrent events 2015-05-17 09:53:49 -07:00
memory.c vm: make stack guard page errors return VM_FAULT_SIGSEGV rather than SIGBUS 2015-04-29 10:31:55 +02:00
memory_hotplug.c mm/memory_hotplug.c: set zone->wait_table to null after freeing it 2015-06-22 17:01:23 -07:00
mempolicy.c mm, numa: really disable NUMA balancing by default on single node machines 2015-06-06 08:19:38 -07:00
mempool.c mm/mempool.c: convert kmalloc_node(...GFP_ZERO...) to kzalloc_node(...) 2013-09-11 15:58:14 -07:00
migrate.c mm: fix direct reclaim writeback regression 2014-11-21 09:23:07 -08:00
mincore.c mm + fs: prepare for non-page entries in page cache radix trees 2014-11-21 09:23:06 -08:00
mlock.c mm: try_to_unmap_cluster() should lock_page() before mlocking 2014-05-06 07:59:35 -07:00
mm_init.c mm: bring back /sys/kernel/mm 2014-01-27 21:02:39 -08:00
mmap.c mm/mmap.c: fix arithmetic overflow in __vm_enough_memory() 2015-03-18 13:31:23 +01:00
mmu_context.c mm: remove old aio use_mm() comment 2013-05-07 18:38:27 -07:00
mmu_notifier.c mm: audit/fix non-modular users of module_init in core code 2014-01-23 16:36:52 -08:00
mmzone.c mm: numa: Change page last {nid,pid} into {cpu,pid} 2013-10-09 14:47:45 +02:00
mprotect.c mm: Use ptep/pmdp_set_numa() for updating _PAGE_NUMA bit 2014-02-17 11:19:36 +11:00
mremap.c mm, thp: close race between mremap() and split_huge_page() 2014-06-07 10:28:10 -07:00
msync.c sanitize vfs_fsync calling conventions 2010-05-21 18:31:21 -04:00
nobootmem.c mm/nobootmem: free_all_bootmem again 2014-01-23 16:36:52 -08:00
nommu.c mm/nommu.c: fix arithmetic overflow in __vm_enough_memory() 2015-03-18 13:31:23 +01:00
oom_kill.c OOM, PM: OOM killed task shouldn't escape PM suspend 2014-11-14 09:00:01 -08:00
page-writeback.c writeback: use |1 instead of +1 to protect against div by zero 2015-05-17 09:53:49 -07:00
page_alloc.c mm: when stealing freepages, also take pages created by splitting buddy page 2015-03-18 13:31:23 +01:00
page_cgroup.c cgroup/kmemleak: add kmemleak_free() for cgroup deallocations. 2014-11-14 09:00:07 -08:00
page_io.c Merge branch 'for-3.14/core' of git://git.kernel.dk/linux-block 2014-01-30 11:19:05 -08:00
page_isolation.c mm: memory-hotplug: enable memory hotplug to handle hugepage 2013-09-11 15:57:48 -07:00
pagewalk.c mm: pagewalk: call pte_hole() for VM_PFNMAP during walk_page_range 2015-02-11 14:54:47 +08:00
percpu-km.c percpu: clear memory allocated with the km allocator 2010-10-02 10:28:42 +03:00
percpu-vm.c percpu: perform tlb flush after pcpu_map_pages() failure 2014-10-05 14:52:20 -07:00
percpu.c Revert "percpu: free percpu allocation info for uniprocessor system" 2014-11-14 08:59:45 -08:00
pgtable-generic.c mm: fix TLB flush race between migration, and change_protection_range 2013-12-18 19:04:51 -08:00
process_vm_access.c Fix: compat_rw_copy_check_uvector() misuse in aio, readv, writev, and security keys 2013-03-12 11:05:45 -07:00
quicklist.c mm: delete various needless include <linux/module.h> 2011-10-31 09:20:11 -04:00
readahead.c mm/readahead.c: inline ra_submit 2014-11-21 09:23:06 -08:00
rmap.c mm: fix anon_vma_clone() error treatment 2014-12-16 09:34:26 -08:00
shmem.c shmem: fix init_page_accessed use to stop !PageLRU bug 2015-01-29 17:40:52 -08:00
slab.c mm: optimize put_mems_allowed() usage 2014-10-09 12:21:28 -07:00
slab.h memcg, slab: RCU protect memcg_params for root caches 2014-01-23 16:36:51 -08:00
slab_common.c slab_common: fix the check for duplicate slab names 2014-07-31 12:52:55 -07:00
slob.c mm/sl[aou]b: Move kmallocXXX functions to common code 2013-09-04 20:51:33 +03:00
slub.c mm: optimize put_mems_allowed() usage 2014-10-09 12:21:28 -07:00
sparse-vmemmap.c mm/sparse: use memblock apis for early memory allocations 2014-01-21 16:19:47 -08:00
sparse.c mm/sparse: use memblock apis for early memory allocations 2014-01-21 16:19:47 -08:00
swap.c mm: pagemap: avoid unnecessary overhead when tracepoints are deactivated 2015-01-29 17:40:52 -08:00
swap_state.c mm: page_alloc: convert hot/cold parameter and immediate callers to bool 2015-01-29 17:40:51 -08:00
swapfile.c swap: change swap_list_head to plist, add swap_avail_head 2014-10-09 12:21:28 -07:00
truncate.c mm + fs: prepare for non-page entries in page cache radix trees 2014-11-21 09:23:06 -08:00
util.c vm_is_stack: use for_each_thread() rather then buggy while_each_thread() 2014-09-05 16:34:18 -07:00
vmacache.c mm: don't pointlessly use BUG_ON() for sanity check 2014-10-09 12:21:29 -07:00
vmalloc.c vmalloc: use rcu list iterator to reduce vmap_area_lock contention 2015-01-29 17:40:52 -08:00
vmpressure.c mm/vmpressure.c: fix race in vmpressure_work_fn() 2014-12-16 09:34:26 -08:00
vmscan.c mm: move zone->pages_scanned into a vmstat counter 2015-01-29 17:40:53 -08:00
vmstat.c mm: vmscan: only update per-cpu thresholds for online CPU 2015-01-29 17:40:53 -08:00
zbud.c mm/zbud: fix some trivial typos in comments 2013-09-11 15:57:35 -07:00
zsmalloc.c zsmalloc: add copyright 2014-01-30 16:56:55 -08:00
zswap.c mm/zswap.c: change params from hidden to ro 2014-01-23 16:36:50 -08:00