linux

History

Rik van Riel 6f9aad0bc3 sched/numa: Only consider less busy nodes as numa balancing destinations Changeset `a43455a1d5` ("sched/numa: Ensure task_numa_migrate() checks the preferred node") fixes an issue where workloads would never converge on a fully loaded (or overloaded) system. However, it introduces a regression on less than fully loaded systems, where workloads converge on a few NUMA nodes, instead of properly staying spread out across the whole system. This leads to a reduction in available memory bandwidth, and usable CPU cache, with predictable performance problems. The root cause appears to be an interaction between the load balancer and NUMA balancing, where the short term load represented by the load balancer differs from the long term load the NUMA balancing code would like to base its decisions on. Simply reverting `a43455a1d5` would re-introduce the non-convergence of workloads on fully loaded systems, so that is not a good option. As an aside, the check done before `a43455a1d5` only applied to a task's preferred node, not to other candidate nodes in the system, so the converge-on-too-few-nodes problem still happens, just to a lesser degree. Instead, try to compensate for the impedance mismatch between the load balancer and NUMA balancing by only ever considering a lesser loaded node as a destination for NUMA balancing, regardless of whether the task is trying to move to the preferred node, or to another node. This patch also addresses the issue that a system with a single runnable thread would never migrate that thread to near its memory, introduced by `095bebf61a` ("sched/numa: Do not move past the balance point if unbalanced"). A test where the main thread creates a large memory area, and spawns a worker thread to iterate over the memory (placed on another node by select_task_rq_fair), after which the main thread goes to sleep and waits for the worker thread to loop over all the memory now sees the worker thread migrated to where the memory is, instead of having all the memory migrated over like before. Jirka has run a number of performance tests on several systems: single instance SpecJBB 2005 performance is 7-15% higher on a 4 node system, with higher gains on systems with more cores per socket. Multi-instance SpecJBB 2005 (one per node), linpack, and stream see little or no changes with the revert of `095bebf61a` and this patch. Reported-by: Artem Bityutski <dedekind1@gmail.com> Reported-by: Jirka Hladky <jhladky@redhat.com> Tested-by: Jirka Hladky <jhladky@redhat.com> Tested-by: Artem Bityutskiy <dedekind1@gmail.com> Signed-off-by: Rik van Riel <riel@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Mel Gorman <mgorman@suse.de> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20150528095249.3083ade0@annuminas.surriel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>		2015-06-07 15:57:45 +02:00
..
Makefile	sched: Move the loadavg code to a more obvious location	2015-05-08 12:04:12 +02:00
auto_group.c	sched, timer: Convert usages of ACCESS_ONCE() in the scheduler to READ_ONCE()/WRITE_ONCE()	2015-05-08 12:11:32 +02:00
auto_group.h	sched, timer: Convert usages of ACCESS_ONCE() in the scheduler to READ_ONCE()/WRITE_ONCE()	2015-05-08 12:11:32 +02:00
clock.c	kernel/sched/clock.c: add another clock for use with the soft lockup watchdog	2015-02-12 18:54:13 -08:00
completion.c	sched/completion: Serialize completion_done() with complete()	2015-02-18 14:27:40 +01:00
core.c	preempt: Use preempt_schedule_context() as the official tracing preemption point	2015-06-07 15:57:42 +02:00
cpuacct.c	cgroup: rename cgroup_subsys->base_cftypes to ->legacy_cftypes	2014-07-15 11:05:09 -04:00
cpuacct.h	sched/cpuacct: Initialize root cpuacct earlier	2013-04-10 13:54:20 +02:00
cpudeadline.c	sched/deadline: Remove cpu_active_mask from cpudl_find()	2015-02-04 07:52:29 +01:00
cpudeadline.h	sched/deadline: Modify cpudl::free_cpus to reflect rd->online	2015-01-30 19:39:16 +01:00
cpupri.c	Merge commit '3cf2f34' into sched/core, to fix build error	2014-06-12 13:46:37 +02:00
cpupri.h	sched/cpupri: Remove unnecessary definitions in cpupri.h	2014-11-16 10:58:59 +01:00
cputime.c	sched, timer: Convert usages of ACCESS_ONCE() in the scheduler to READ_ONCE()/WRITE_ONCE()	2015-05-08 12:11:32 +02:00
deadline.c	sched, timer: Convert usages of ACCESS_ONCE() in the scheduler to READ_ONCE()/WRITE_ONCE()	2015-05-08 12:11:32 +02:00
debug.c	sched: Track group sched_entity usage contributions	2015-03-27 09:35:58 +01:00
fair.c	sched/numa: Only consider less busy nodes as numa balancing destinations	2015-06-07 15:57:45 +02:00
features.h	sched/rt: Use IPI to trigger RT task push migration instead of pulling	2015-03-23 10:55:22 +01:00
idle.c	cpuidle: Run tick_broadcast_exit() with disabled interrupts	2015-04-29 15:19:21 +02:00
idle_task.c	sched: Provide update_curr callbacks for stop/idle scheduling classes	2014-11-23 14:14:40 -08:00
loadavg.c	sched: Move the loadavg code to a more obvious location	2015-05-08 12:04:12 +02:00
rt.c	sched, timer: Convert usages of ACCESS_ONCE() in the scheduler to READ_ONCE()/WRITE_ONCE()	2015-05-08 12:11:32 +02:00
sched.h	sched, timer: Convert usages of ACCESS_ONCE() in the scheduler to READ_ONCE()/WRITE_ONCE()	2015-05-08 12:11:32 +02:00
stats.c	sched: use %*pb[l] to print bitmaps including cpumasks and nodemasks	2015-02-13 21:21:37 -08:00
stats.h	sched, timer: Use the atomic task_cputime in thread_group_cputimer	2015-05-08 12:17:46 +02:00
stop_task.c	sched: Provide update_curr callbacks for stop/idle scheduling classes	2014-11-23 14:14:40 -08:00
wait.c	sched, timer: Convert usages of ACCESS_ONCE() in the scheduler to READ_ONCE()/WRITE_ONCE()	2015-05-08 12:11:32 +02:00