linux

History

Yuyang Du 9d89c257df sched/fair: Rewrite runnable load and utilization average tracking The idea of runnable load average (let runnable time contribute to weight) was proposed by Paul Turner and Ben Segall, and it is still followed by this rewrite. This rewrite aims to solve the following issues: 1. cfs_rq's load average (namely runnable_load_avg and blocked_load_avg) is updated at the granularity of an entity at a time, which results in the cfs_rq's load average is stale or partially updated: at any time, only one entity is up to date, all other entities are effectively lagging behind. This is undesirable. To illustrate, if we have n runnable entities in the cfs_rq, as time elapses, they certainly become outdated: t0: cfs_rq { e1_old, e2_old, ..., en_old } and when we update: t1: update e1, then we have cfs_rq { e1_new, e2_old, ..., en_old } t2: update e2, then we have cfs_rq { e1_old, e2_new, ..., en_old } ... We solve this by combining all runnable entities' load averages together in cfs_rq's avg, and update the cfs_rq's avg as a whole. This is based on the fact that if we regard the update as a function, then: w * update(e) = update(w * e) and update(e1) + update(e2) = update(e1 + e2), then w1 * update(e1) + w2 * update(e2) = update(w1 * e1 + w2 * e2) therefore, by this rewrite, we have an entirely updated cfs_rq at the time we update it: t1: update cfs_rq { e1_new, e2_new, ..., en_new } t2: update cfs_rq { e1_new, e2_new, ..., en_new } ... 2. cfs_rq's load average is different between top rq->cfs_rq and other task_group's per CPU cfs_rqs in whether or not blocked_load_average contributes to the load. The basic idea behind runnable load average (the same for utilization) is that the blocked state is taken into account as opposed to only accounting for the currently runnable state. Therefore, the average should include both the runnable/running and blocked load averages. This rewrite does that. In addition, we also combine runnable/running and blocked averages of all entities into the cfs_rq's average, and update it together at once. This is based on the fact that: update(runnable) + update(blocked) = update(runnable + blocked) This significantly reduces the code as we don't need to separately maintain/update runnable/running load and blocked load. 3. How task_group entities' share is calculated is complex and imprecise. We reduce the complexity in this rewrite to allow a very simple rule: the task_group's load_avg is aggregated from its per CPU cfs_rqs's load_avgs. Then group entity's weight is simply proportional to its own cfs_rq's load_avg / task_group's load_avg. To illustrate, if a task_group has { cfs_rq1, cfs_rq2, ..., cfs_rqn }, then, task_group_avg = cfs_rq1_avg + cfs_rq2_avg + ... + cfs_rqn_avg, then cfs_rqx's entity's share = cfs_rqx_avg / task_group_avg * task_group's share To sum up, this rewrite in principle is equivalent to the current one, but fixes the issues described above. Turns out, it significantly reduces the code complexity and hence increases clarity and efficiency. In addition, the new averages are more smooth/continuous (no spurious spikes and valleys) and updated more consistently and quickly to reflect the load dynamics. As a result, we have less load tracking overhead, better performance, and especially better power efficiency due to more balanced load. Signed-off-by: Yuyang Du <yuyang.du@intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: arjan@linux.intel.com Cc: bsegall@google.com Cc: dietmar.eggemann@arm.com Cc: fengguang.wu@intel.com Cc: len.brown@intel.com Cc: morten.rasmussen@arm.com Cc: pjt@google.com Cc: rafael.j.wysocki@intel.com Cc: umgwanakikbuti@gmail.com Cc: vincent.guittot@linaro.org Link: http://lkml.kernel.org/r/1436918682-4971-3-git-send-email-yuyang.du@intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>		2015-08-03 12:21:29 +02:00
..
acpi	Additional ACPICA material for v4.2-rc1	2015-07-02 17:11:28 -07:00
asm-generic	sched/preempt: Fix cond_resched_lock() and cond_resched_softirq()	2015-08-03 12:21:24 +02:00
clocksource	…
crypto	crypto: rng - Do not free default RNG when it becomes unused	2015-06-22 15:49:18 +08:00
drm	drm: use kvfree() in drm_free_large()	2015-06-30 19:44:59 -07:00
dt-bindings	The changes to the common clock framework for 4.2 are dominated by new	2015-07-01 19:22:00 -07:00
keys	…
kvm	…
linux	sched/fair: Rewrite runnable load and utilization average tracking	2015-08-03 12:21:29 +02:00
math-emu	…
media	Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/cooloney/linux-leds	2015-07-01 19:09:11 -07:00
memory	…
misc	…
net	net: Kill sock->sk_protinfo	2015-06-28 16:55:44 -07:00
pcmcia	…
ras	tracing: add trace event for memory-failure	2015-06-24 17:49:43 -07:00
rdma	…
rxrpc	…
scsi	SCSI misc on 20150622	2015-06-23 15:55:44 -07:00
soc	Merge branch 'for-linus' of git://ftp.arm.linux.org.uk/~rmk/linux-arm	2015-06-26 12:20:00 -07:00
sound	ASoC: Further updates for v4.2	2015-06-22 11:32:41 +02:00
target	Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/nab/target-pending	2015-07-04 14:13:43 -07:00
trace	sched: Introduce the 'trace_sched_waking' tracepoint	2015-08-03 12:21:22 +02:00
uapi	virtio/vhost: cross endian support	2015-07-03 16:02:25 -07:00
video	Merge branch 'drm-next' of git://people.freedesktop.org/~airlied/linux	2015-06-26 13:18:51 -07:00
xen	…
Kbuild	…