linux/kernel/sched
Peter Zijlstra 10e2f1acd0 sched/core: Rewrite and improve select_idle_siblings()
select_idle_siblings() is a known pain point for a number of
workloads; it either does too much or not enough and sometimes just
does plain wrong.

This rewrite attempts to address a number of issues (but sadly not
all).

The current code does an unconditional sched_domain iteration; with
the intent of finding an idle core (on SMT hardware). The problems
which this patch tries to address are:

 - its pointless to look for idle cores if the machine is real busy;
   at which point you're just wasting cycles.

 - it's behaviour is inconsistent between SMT and !SMT hardware in
   that !SMT hardware ends up doing a scan for any idle CPU in the LLC
   domain, while SMT hardware does a scan for idle cores and if that
   fails, falls back to a scan for idle threads on the 'target' core.

The new code replaces the sched_domain scan with 3 explicit scans:

 1) search for an idle core in the LLC
 2) search for an idle CPU in the LLC
 3) search for an idle thread in the 'target' core

where 1 and 3 are conditional on SMT support and 1 and 2 have runtime
heuristics to skip the step.

Step 1) is conditional on sd_llc_shared->has_idle_cores; when a cpu
goes idle and sd_llc_shared->has_idle_cores is false, we scan all SMT
siblings of the CPU going idle. Similarly, we clear
sd_llc_shared->has_idle_cores when we fail to find an idle core.

Step 2) tracks the average cost of the scan and compares this to the
average idle time guestimate for the CPU doing the wakeup. There is a
significant fudge factor involved to deal with the variability of the
averages. Esp. hackbench was sensitive to this.

Step 3) is unconditional; we assume (also per step 1) that scanning
all SMT siblings in a core is 'cheap'.

With this; SMT systems gain step 2, which cures a few benchmarks --
notably one from Facebook.

One 'feature' of the sched_domain iteration, which we preserve in the
new code, is that it would start scanning from the 'target' CPU,
instead of scanning the cpumask in cpu id order. This avoids multiple
CPUs in the LLC scanning for idle to gang up and find the same CPU
quite as much. The down side is that tasks can end up hopping across
the LLC for no apparent reason.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-09-30 11:03:09 +02:00
..
Makefile cpufreq: schedutil: New governor based on scheduler utilization data 2016-04-02 01:09:12 +02:00
auto_group.c sched/core: Move the sched_to_prio[] arrays out of line 2015-12-04 10:34:46 +01:00
auto_group.h sched, timer: Convert usages of ACCESS_ONCE() in the scheduler to READ_ONCE()/WRITE_ONCE() 2015-05-08 12:11:32 +02:00
clock.c sched/clock: Make local_clock()/cpu_clock() inline 2016-04-13 12:25:22 +02:00
completion.c sched/completion: Serialize completion_done() with complete() 2015-02-18 14:27:40 +01:00
core.c sched/core: Rewrite and improve select_idle_siblings() 2016-09-30 11:03:09 +02:00
cpuacct.c sched/cpuacct: Introduce cpuacct.usage_all to show all CPU stats together 2016-07-09 13:56:15 +02:00
cpuacct.h sched/cpuacct: Simplify the cpuacct code 2016-03-21 11:00:28 +01:00
cpudeadline.c sched/deadline: Split cpudl_set() into cpudl_set() and cpudl_clear() 2016-09-05 13:29:43 +02:00
cpudeadline.h sched/deadline: Split cpudl_set() into cpudl_set() and cpudl_clear() 2016-09-05 13:29:43 +02:00
cpufreq.c cpufreq: sched: Helpers to add and remove update_util hooks 2016-04-02 01:08:43 +02:00
cpufreq_schedutil.c cpufreq: schedutil: map raw required frequency to driver frequency 2016-07-21 22:28:21 +02:00
cpupri.c sched/core: Use tsk_cpus_allowed() instead of accessing ->cpus_allowed 2016-05-12 09:55:35 +02:00
cpupri.h sched/cpupri: Remove unnecessary definitions in cpupri.h 2014-11-16 10:58:59 +01:00
cputime.c sched/cputime: Improve scalability by not accounting thread group tasks pending runtime 2016-08-18 11:53:46 +02:00
deadline.c sched/deadline: Fix the intention to re-evalute tick dependency for offline CPU 2016-09-05 13:29:45 +02:00
debug.c sched/debug: Remove several CONFIG_SCHEDSTATS guards 2016-09-05 13:29:47 +02:00
fair.c sched/core: Rewrite and improve select_idle_siblings() 2016-09-30 11:03:09 +02:00
features.h sched/fair: Convert arch_scale_cpu_capacity() from weak function to #define 2015-09-13 09:52:55 +02:00
idle.c Merge branch 'sched/urgent' into sched/core, to pick up fixes 2016-06-14 11:04:13 +02:00
idle_task.c sched/core: Rewrite and improve select_idle_siblings() 2016-09-30 11:03:09 +02:00
loadavg.c sched/core: Correct off by one bug in load migration calculation 2016-07-13 14:58:20 +02:00
rt.c sched/core: Provide a tsk_nr_cpus_allowed() helper 2016-05-12 09:55:36 +02:00
sched.h sched/core: Rewrite and improve select_idle_siblings() 2016-09-30 11:03:09 +02:00
stats.c sched: use %*pb[l] to print bitmaps including cpumasks and nodemasks 2015-02-13 21:21:37 -08:00
stats.h sched/debug: Rename 'schedstat_val()' -> 'schedstat_val_or_zero()' 2016-09-05 13:29:46 +02:00
stop_task.c locking/lockdep, sched/core: Implement a better lock pinning scheme 2016-05-05 09:23:59 +02:00
swait.c wait.[ch]: Introduce the simple waitqueue (swait) implementation 2016-02-25 11:27:16 +01:00
wait.c sched/wait: Introduce init_wait_entry() 2016-09-30 10:54:03 +02:00