linux/kernel/sched
Valentin Schneider f2323c374e sched/topology: Assert non-NUMA topology masks don't (partially) overlap
[ Upstream commit ccf74128d6 ]

topology.c::get_group() relies on the assumption that non-NUMA domains do
not partially overlap. Zeng Tao pointed out in [1] that such topology
descriptions, while completely bogus, can end up being exposed to the
scheduler.

In his example (8 CPUs, 2-node system), we end up with:
  MC span for CPU3 == 3-7
  MC span for CPU4 == 4-7

The first pass through get_group(3, sdd@MC) will result in the following
sched_group list:

  3 -> 4 -> 5 -> 6 -> 7
  ^                  /
   `----------------'

And a later pass through get_group(4, sdd@MC) will "corrupt" that to:

  3 -> 4 -> 5 -> 6 -> 7
       ^             /
	`-----------'

which will completely break things like 'while (sg != sd->groups)' when
using CPU3's base sched_domain.

There already are some architecture-specific checks in place such as
x86/kernel/smpboot.c::topology.sane(), but this is something we can detect
in the core scheduler, so it seems worthwhile to do so.

Warn and abort the construction of the sched domains if such a broken
topology description is detected. Note that this is somewhat
expensive (O(t.c²), 't' non-NUMA topology levels and 'c' CPUs) and could be
gated under SCHED_DEBUG if deemed necessary.

Testing
=======

Dietmar managed to reproduce this using the following qemu incantation:

  $ qemu-system-aarch64 -kernel ./Image -hda ./qemu-image-aarch64.img \
  -append 'root=/dev/vda console=ttyAMA0 loglevel=8 sched_debug' -smp \
  cores=8 --nographic -m 512 -cpu cortex-a53 -machine virt -numa \
  node,cpus=0-2,nodeid=0 -numa node,cpus=3-7,nodeid=1

alongside the following drivers/base/arch_topology.c hack (AIUI wouldn't be
needed if '-smp cores=X, sockets=Y' would work with qemu):

8<---
@@ -465,6 +465,9 @@ void update_siblings_masks(unsigned int cpuid)
 		if (cpuid_topo->package_id != cpu_topo->package_id)
 			continue;

+		if ((cpu < 4 && cpuid > 3) || (cpu > 3 && cpuid < 4))
+			continue;
+
 		cpumask_set_cpu(cpuid, &cpu_topo->core_sibling);
 		cpumask_set_cpu(cpu, &cpuid_topo->core_sibling);

8<---

[1]: https://lkml.kernel.org/r/1577088979-8545-1-git-send-email-prime.zeng@hisilicon.com

Reported-by: Zeng Tao <prime.zeng@hisilicon.com>
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200115160915.22575-1-valentin.schneider@arm.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-02-24 08:36:52 +01:00
..
Makefile psi: pressure stall information for CPU, memory, and IO 2018-10-26 16:26:32 -07:00
autogroup.c sched/autogroup: Make autogroup_path() always available 2019-06-24 19:23:40 +02:00
autogroup.h sched/headers: Simplify and clean up header usage in the scheduler 2018-03-04 12:39:29 +01:00
clock.c treewide: Add SPDX license identifier for missed files 2019-05-21 10:50:45 +02:00
completion.c sched/Documentation: Update wake_up() & co. memory-barrier guarantees 2018-07-17 09:30:34 +02:00
core.c sched/core: Fix size of rq::uclamp initialization 2020-02-24 08:36:51 +01:00
cpuacct.c sched/headers: Simplify and clean up header usage in the scheduler 2018-03-04 12:39:29 +01:00
cpudeadline.c Linux 5.2-rc5 2019-06-17 12:12:27 +02:00
cpudeadline.h sched/headers: Simplify and clean up header usage in the scheduler 2018-03-04 12:39:29 +01:00
cpufreq.c cpufreq: Avoid leaving stale IRQ work items during CPU offline 2019-12-31 16:46:06 +01:00
cpufreq_schedutil.c cpufreq: Avoid leaving stale IRQ work items during CPU offline 2019-12-31 16:46:06 +01:00
cpupri.c Linux 5.2-rc5 2019-06-17 12:12:27 +02:00
cpupri.h sched/headers: Simplify and clean up header usage in the scheduler 2018-03-04 12:39:29 +01:00
cputime.c sched/vtime: Fix guest/system mis-accounting on task switch 2019-10-09 12:38:03 +02:00
deadline.c sched/core: Further clarify sched_class::set_next_task() 2020-01-26 10:01:03 +01:00
debug.c Linux 5.2-rc6 2019-06-24 19:19:53 +02:00
fair.c sched/cpufreq: Move the cfs_rq_util_change() call to cpufreq_update_util() 2020-01-26 10:01:08 +01:00
features.h sched/fair: Replace source_load() & target_load() with weighted_cpuload() 2019-06-03 11:49:39 +02:00
idle.c sched/core: Further clarify sched_class::set_next_task() 2020-01-26 10:01:03 +01:00
isolation.c sched/isolation: Prefer housekeeping CPU in local node 2019-07-25 15:51:55 +02:00
loadavg.c sched: loadavg: make calc_load_n() public 2018-10-26 16:26:32 -07:00
membarrier.c membarrier: Fix RCU locking bug caused by faulty merge 2019-10-01 21:27:50 +02:00
pelt.c sched/debug: Add new tracepoint to track PELT at se level 2019-06-24 19:23:42 +02:00
pelt.h sched/topology: Remove unused 'sd' parameter from arch_scale_cpu_capacity() 2019-06-24 19:23:39 +02:00
psi.c psi: Fix a division error in psi poll() 2020-01-12 12:21:36 +01:00
rt.c sched/core: Further clarify sched_class::set_next_task() 2020-01-26 10:01:03 +01:00
sched-pelt.h sched/fair: Fix "runnable_avg_yN_inv" not used warnings 2019-06-17 12:15:58 +02:00
sched.h sched/core: Further clarify sched_class::set_next_task() 2020-01-26 10:01:03 +01:00
stats.c proc: introduce proc_create_seq{,_data} 2018-05-16 07:23:35 +02:00
stats.h sched/stats: Fix unlikely() use of sched_info_on() 2019-07-25 15:51:55 +02:00
stop_task.c sched/core: Further clarify sched_class::set_next_task() 2020-01-26 10:01:03 +01:00
swait.c kernel/sched/: remove caller signal_pending branch predictions 2019-01-04 13:13:48 -08:00
topology.c sched/topology: Assert non-NUMA topology masks don't (partially) overlap 2020-02-24 08:36:52 +01:00
wait.c sched/wait: Deduplicate code with do-while 2019-06-24 19:23:40 +02:00
wait_bit.c treewide: Add SPDX license identifier for missed files 2019-05-21 10:50:45 +02:00