linux/arch/powerpc/mm
Srivatsa S. Bhat d4edc5b6c4 powerpc: Fix the setup of CPU-to-Node mappings during CPU online
On POWER platforms, the hypervisor can notify the guest kernel about dynamic
changes in the cpu-numa associativity (VPHN topology update). Hence the
cpu-to-node mappings that we got from the firmware during boot, may no longer
be valid after such updates. This is handled using the arch_update_cpu_topology()
hook in the scheduler, and the sched-domains are rebuilt according to the new
mappings.

But unfortunately, at the moment, CPU hotplug ignores these updated mappings
and instead queries the firmware for the cpu-to-numa relationships and uses
them during CPU online. So the kernel can end up assigning wrong NUMA nodes
to CPUs during subsequent CPU hotplug online operations (after booting).

Further, a particularly problematic scenario can result from this bug:
On POWER platforms, the SMT mode can be switched between 1, 2, 4 (and even 8)
threads per core. The switch to Single-Threaded (ST) mode is performed by
offlining all except the first CPU thread in each core. Switching back to
SMT mode involves onlining those other threads back, in each core.

Now consider this scenario:

1. During boot, the kernel gets the cpu-to-node mappings from the firmware
   and assigns the CPUs to NUMA nodes appropriately, during CPU online.

2. Later on, the hypervisor updates the cpu-to-node mappings dynamically and
   communicates this update to the kernel. The kernel in turn updates its
   cpu-to-node associations and rebuilds its sched domains. Everything is
   fine so far.

3. Now, the user switches the machine from SMT to ST mode (say, by running
   ppc64_cpu --smt=1). This involves offlining all except 1 thread in each
   core.

4. The user then tries to switch back from ST to SMT mode (say, by running
   ppc64_cpu --smt=4), and this involves onlining those threads back. Since
   CPU hotplug ignores the new mappings, it queries the firmware and tries to
   associate the newly onlined sibling threads to the old NUMA nodes. This
   results in sibling threads within the same core getting associated with
   different NUMA nodes, which is incorrect.

   The scheduler's build-sched-domains code gets thoroughly confused with this
   and enters an infinite loop and causes soft-lockups, as explained in detail
   in commit 3be7db6ab (powerpc: VPHN topology change updates all siblings).

So to fix this, use the numa_cpu_lookup_table to remember the updated
cpu-to-node mappings, and use them during CPU hotplug online operations.
Further, we also need to ensure that all threads in a core are assigned to a
common NUMA node, irrespective of whether all those threads were online during
the topology update. To achieve this, we take care not to use cpu_sibling_mask()
since it is not hotplug invariant. Instead, we use cpu_first_sibling_thread()
and set up the mappings manually using the 'threads_per_core' value for that
particular platform. This helps us ensure that we don't hit this bug with any
combination of CPU hotplug and SMT mode switching.

Cc: stable@vger.kernel.org
Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2014-01-15 13:58:37 +11:00
..
40x_mmu.c memblock: Remove rmo_size, burry it in arch/powerpc where it belongs 2010-08-05 12:56:08 +10:00
44x_mmu.c powerpc: Delete __cpuinit usage from all users 2013-07-01 11:10:36 +10:00
Makefile powerpc/THP: Add code to handle HPTE faults for hugepages 2013-06-21 16:01:56 +10:00
dma-noncoherent.c mm/arch: use __free_reserved_page() to simplify the code 2013-11-13 12:09:03 +09:00
fault.c arch: mm: pass userspace fault flag to generic fault handler 2013-09-12 15:38:01 -07:00
fsl_booke_mmu.c powerpc/fsl-booke: Fixup calc_cam_sz to support MMU v2 2012-03-15 12:12:19 -05:00
gup.c powerpc: Fix __get_user_pages_fast() irq handling 2013-11-21 10:33:37 +11:00
hash_low_32.S powerpc: Use CURRENT_THREAD_INFO instead of open coded assembly 2012-07-11 14:18:22 +10:00
hash_low_64.S powerpc/mm: Free up _PAGE_COHERENCE for numa fault use later 2013-12-09 11:40:28 +11:00
hash_native_64.c powerpc: Book 3S MMU little endian support 2013-10-11 16:48:26 +11:00
hash_utils_64.c powerpc/mm: Free up _PAGE_COHERENCE for numa fault use later 2013-12-09 11:40:28 +11:00
highmem.c mm: fix race in kunmap_atomic() 2010-10-27 18:03:05 -07:00
hugepage-hash64.c powerpc/mm: Free up _PAGE_COHERENCE for numa fault use later 2013-12-09 11:40:28 +11:00
hugetlbpage-book3e.c powerpc/booke: Only check for hugetlb in flush if vma != NULL 2013-11-22 16:57:29 -06:00
hugetlbpage-hash64.c powerpc/mm: Free up _PAGE_COHERENCE for numa fault use later 2013-12-09 11:40:28 +11:00
hugetlbpage.c mm: remove obsolete comments about page table lock 2013-11-13 12:09:03 +09:00
icswx.c powerpc: Fix typo "CONFIG_ICSWX_PID" 2013-04-18 13:03:54 +10:00
icswx.h powerpc/icswx: Fix race condition with IPI setting ACOP 2012-03-07 17:06:09 +11:00
icswx_pid.c powerpc: Split ICSWX ACOP and PID processing 2011-11-25 14:11:27 +11:00
init_32.c powerpc/8xx: Fixing memory init issue with CONFIG_PIN_TLB 2013-10-28 21:11:22 -05:00
init_64.c powerpc: Prepare to support kernel handling of IOMMU map/unmap 2013-10-11 17:24:39 +11:00
mem.c powerpc: Fix memory hotplug with sparse vmemmap 2013-10-03 17:21:38 +10:00
mmap.c mm: remove free_area_cache 2013-07-10 18:11:34 -07:00
mmu_context_hash32.c powerpc: include export.h for files using EXPORT_SYMBOL/THIS_MODULE 2011-10-31 19:30:38 -04:00
mmu_context_hash64.c powerpc: Reduce PTE table memory wastage 2013-04-30 16:00:07 +10:00
mmu_context_nohash.c powerpc: Delete __cpuinit usage from all users 2013-07-01 11:10:36 +10:00
mmu_decl.h powerpc/fsl-booke: Fix setup_initial_memory_limit to not blindly map 2011-10-11 23:30:41 -05:00
numa.c powerpc: Fix the setup of CPU-to-Node mappings during CPU online 2014-01-15 13:58:37 +11:00
pgtable.c powerpc: Delete non-required instances of include <linux/init.h> 2014-01-15 13:46:44 +11:00
pgtable_32.c powerpc: handle pgtable_page_ctor() fail 2013-11-15 09:32:18 +09:00
pgtable_64.c powerpc: Delete non-required instances of include <linux/init.h> 2014-01-15 13:46:44 +11:00
ppc_mmu_32.c memblock: Remove rmo_size, burry it in arch/powerpc where it belongs 2010-08-05 12:56:08 +10:00
slb.c powerpc: Fix little endian lppaca, slb_shadow and dtl_entry 2013-08-14 15:33:35 +10:00
slb_low.S powerpc: Rename USER_ESID_BITS* to ESID_BITS* 2013-03-17 12:45:44 +11:00
slice.c powerpc: ppc64 address space capped at 32TB, mmap randomisation disabled 2013-11-21 10:33:41 +11:00
stab.c powerpc/mm: Remove uses of abs_to_virt() and virt_to_abs() 2012-09-05 15:19:31 +10:00
subpage-prot.c powerpc: Fix a number of sparse warnings 2013-08-14 11:50:24 +10:00
tlb_hash32.c powerpc: include export.h for files using EXPORT_SYMBOL/THIS_MODULE 2011-10-31 19:30:38 -04:00
tlb_hash64.c powerpc: Delete non-required instances of include <linux/init.h> 2014-01-15 13:46:44 +11:00
tlb_low_64e.S powerpc/booke64: Use SPRG0/3 scratch for bolted TLB miss & crit int 2012-09-05 15:35:52 +10:00
tlb_nohash.c powerpc: Move the patch_exception to a common place 2013-12-02 14:06:54 +11:00
tlb_nohash_low.S powerpc/47x: Use the new ppc-opcode infrastructure 2012-11-15 12:59:24 +11:00