linux

History

Michael Wang 62470419e9 sched: Implement smarter wake-affine logic The wake-affine scheduler feature is currently always trying to pull the wakee close to the waker. In theory this should be beneficial if the waker's CPU caches hot data for the wakee, and it's also beneficial in the extreme ping-pong high context switch rate case. Testing shows it can benefit hackbench up to 15%. However, the feature is somewhat blind, from which some workloads such as pgbench suffer. It's also time-consuming algorithmically. Testing shows it can damage pgbench up to 50% - far more than the benefit it brings in the best case. So wake-affine should be smarter and it should realize when to stop its thankless effort at trying to find a suitable CPU to wake on. This patch introduces 'wakee_flips', which will be increased each time the task flips (switches) its wakee target. So a high 'wakee_flips' value means the task has more than one wakee, and the bigger the number, the higher the wakeup frequency. Now when making the decision on whether to pull or not, pay attention to the wakee with a high 'wakee_flips', pulling such a task may benefit the wakee. Also imply that the waker will face cruel competition later, it could be very cruel or very fast depends on the story behind 'wakee_flips', waker therefore suffers. Furthermore, if waker also has a high 'wakee_flips', that implies that multiple tasks rely on it, then waker's higher latency will damage all of them, so pulling wakee seems to be a bad deal. Thus, when 'waker->wakee_flips / wakee->wakee_flips' becomes higher and higher, the cost of pulling seems to be worse and worse. The patch therefore helps the wake-affine feature to stop its pulling work when: wakee->wakee_flips > factor && waker->wakee_flips > (factor * wakee->wakee_flips) The 'factor' here is the number of CPUs in the current CPU's NUMA node, so a bigger node will lead to more pulling since the trial becomes more severe. After applying the patch, pgbench shows up to 40% improvements and no regressions. Tested with 12 cpu x86 server and tip 3.10.0-rc7. The percentages in the final column highlight the areas with the biggest wins, all other areas improved as well: pgbench base smart \| db_size \| clients \| tps \| \| tps \| +---------+---------+-------+ +-------+ \| 22 MB \| 1 \| 10598 \| \| 10796 \| \| 22 MB \| 2 \| 21257 \| \| 21336 \| \| 22 MB \| 4 \| 41386 \| \| 41622 \| \| 22 MB \| 8 \| 51253 \| \| 57932 \| \| 22 MB \| 12 \| 48570 \| \| 54000 \| \| 22 MB \| 16 \| 46748 \| \| 55982 \| +19.75% \| 22 MB \| 24 \| 44346 \| \| 55847 \| +25.93% \| 22 MB \| 32 \| 43460 \| \| 54614 \| +25.66% \| 7484 MB \| 1 \| 8951 \| \| 9193 \| \| 7484 MB \| 2 \| 19233 \| \| 19240 \| \| 7484 MB \| 4 \| 37239 \| \| 37302 \| \| 7484 MB \| 8 \| 46087 \| \| 50018 \| \| 7484 MB \| 12 \| 42054 \| \| 48763 \| \| 7484 MB \| 16 \| 40765 \| \| 51633 \| +26.66% \| 7484 MB \| 24 \| 37651 \| \| 52377 \| +39.11% \| 7484 MB \| 32 \| 37056 \| \| 51108 \| +37.92% \| 15 GB \| 1 \| 8845 \| \| 9104 \| \| 15 GB \| 2 \| 19094 \| \| 19162 \| \| 15 GB \| 4 \| 36979 \| \| 36983 \| \| 15 GB \| 8 \| 46087 \| \| 49977 \| \| 15 GB \| 12 \| 41901 \| \| 48591 \| \| 15 GB \| 16 \| 40147 \| \| 50651 \| +26.16% \| 15 GB \| 24 \| 37250 \| \| 52365 \| +40.58% \| 15 GB \| 32 \| 36470 \| \| 50015 \| +37.14% Signed-off-by: Michael Wang <wangyun@linux.vnet.ibm.com> Cc: Mike Galbraith <efault@gmx.de> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/51D50057.9000809@linux.vnet.ibm.com [ Improved the changelog. ] Signed-off-by: Ingo Molnar <mingo@kernel.org>		2013-07-23 12:18:41 +02:00
..
Makefile	sched: Factor out load calculation code from sched/core.c --> sched/proc.c	2013-05-07 13:14:50 +02:00
auto_group.c	sched/autogroup: Fix race with task_groups list	2013-05-28 09:40:22 +02:00
auto_group.h	Revert "sched/autogroup: Fix crash on reboot when autogroup is disabled"	2012-12-11 10:23:45 +01:00
clock.c	sched_clock: Prevent 64bit inatomicity on 32bit systems	2013-04-08 11:50:44 +02:00
core.c	kernel: delete __cpuinit usage from all core kernel files	2013-07-14 19:36:59 -04:00
cpuacct.c	sched/cpuacct/UML: Fix header file dependency bug on the UML build	2013-04-10 15:12:41 +02:00
cpuacct.h	sched/cpuacct: Initialize root cpuacct earlier	2013-04-10 13:54:20 +02:00
cpupri.c	sched/rt: Move rt specific bits into new header file	2013-02-07 20:51:08 +01:00
cpupri.h	…
cputime.c	Linux 3.10	2013-07-01 11:18:53 +02:00
debug.c	sched/debug: Remove CONFIG_FAIR_GROUP_SCHED mask	2013-06-28 13:17:17 +02:00
fair.c	sched: Implement smarter wake-affine logic	2013-07-23 12:18:41 +02:00
features.h	mutex: Move mutex spinning code from sched/core.c back to mutex.c	2013-04-19 09:33:34 +02:00
idle_task.c	sched: Keep at least 1 tick per second for active dynticks tasks	2013-05-04 08:32:02 +02:00
proc.c	sched: Change get_rq_runnable_load() to static and inline	2013-06-27 10:07:44 +02:00
rt.c	sched/rt: Simplify pull_rt_task() logic and remove .leaf_rt_rq_list	2013-06-19 12:58:40 +02:00
sched.h	sched: Move h_load calculation to task_h_load()	2013-07-23 12:18:41 +02:00
stats.c	fix a leak in /proc/schedstats	2013-04-29 15:41:45 -04:00
stats.h	Merge branch 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip	2013-07-06 14:09:38 -07:00
stop_task.c	sched: Use an accessor to read the rq clock	2013-05-28 09:40:27 +02:00