sched: Describe CFS load-balancer

Add some scribbles on how and why the load-balancer works..

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1341316406.23484.64.camel@twins
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This commit is contained in:
Peter Zijlstra 2012-07-03 13:53:26 +02:00 committed by Ingo Molnar
parent f4e26b120b
commit e9c84cb8d5
1 changed files with 116 additions and 2 deletions

View File

@ -3456,8 +3456,122 @@ static bool yield_to_task_fair(struct rq *rq, struct task_struct *p, bool preemp
#ifdef CONFIG_SMP
/**************************************************
* Fair scheduling class load-balancing methods:
*/
* Fair scheduling class load-balancing methods.
*
* BASICS
*
* The purpose of load-balancing is to achieve the same basic fairness the
* per-cpu scheduler provides, namely provide a proportional amount of compute
* time to each task. This is expressed in the following equation:
*
* W_i,n/P_i == W_j,n/P_j for all i,j (1)
*
* Where W_i,n is the n-th weight average for cpu i. The instantaneous weight
* W_i,0 is defined as:
*
* W_i,0 = \Sum_j w_i,j (2)
*
* Where w_i,j is the weight of the j-th runnable task on cpu i. This weight
* is derived from the nice value as per prio_to_weight[].
*
* The weight average is an exponential decay average of the instantaneous
* weight:
*
* W'_i,n = (2^n - 1) / 2^n * W_i,n + 1 / 2^n * W_i,0 (3)
*
* P_i is the cpu power (or compute capacity) of cpu i, typically it is the
* fraction of 'recent' time available for SCHED_OTHER task execution. But it
* can also include other factors [XXX].
*
* To achieve this balance we define a measure of imbalance which follows
* directly from (1):
*
* imb_i,j = max{ avg(W/P), W_i/P_i } - min{ avg(W/P), W_j/P_j } (4)
*
* We them move tasks around to minimize the imbalance. In the continuous
* function space it is obvious this converges, in the discrete case we get
* a few fun cases generally called infeasible weight scenarios.
*
* [XXX expand on:
* - infeasible weights;
* - local vs global optima in the discrete case. ]
*
*
* SCHED DOMAINS
*
* In order to solve the imbalance equation (4), and avoid the obvious O(n^2)
* for all i,j solution, we create a tree of cpus that follows the hardware
* topology where each level pairs two lower groups (or better). This results
* in O(log n) layers. Furthermore we reduce the number of cpus going up the
* tree to only the first of the previous level and we decrease the frequency
* of load-balance at each level inv. proportional to the number of cpus in
* the groups.
*
* This yields:
*
* log_2 n 1 n
* \Sum { --- * --- * 2^i } = O(n) (5)
* i = 0 2^i 2^i
* `- size of each group
* | | `- number of cpus doing load-balance
* | `- freq
* `- sum over all levels
*
* Coupled with a limit on how many tasks we can migrate every balance pass,
* this makes (5) the runtime complexity of the balancer.
*
* An important property here is that each CPU is still (indirectly) connected
* to every other cpu in at most O(log n) steps:
*
* The adjacency matrix of the resulting graph is given by:
*
* log_2 n
* A_i,j = \Union (i % 2^k == 0) && i / 2^(k+1) == j / 2^(k+1) (6)
* k = 0
*
* And you'll find that:
*
* A^(log_2 n)_i,j != 0 for all i,j (7)
*
* Showing there's indeed a path between every cpu in at most O(log n) steps.
* The task movement gives a factor of O(m), giving a convergence complexity
* of:
*
* O(nm log n), n := nr_cpus, m := nr_tasks (8)
*
*
* WORK CONSERVING
*
* In order to avoid CPUs going idle while there's still work to do, new idle
* balancing is more aggressive and has the newly idle cpu iterate up the domain
* tree itself instead of relying on other CPUs to bring it work.
*
* This adds some complexity to both (5) and (8) but it reduces the total idle
* time.
*
* [XXX more?]
*
*
* CGROUPS
*
* Cgroups make a horror show out of (2), instead of a simple sum we get:
*
* s_k,i
* W_i,0 = \Sum_j \Prod_k w_k * ----- (9)
* S_k
*
* Where
*
* s_k,i = \Sum_j w_i,j,k and S_k = \Sum_i s_k,i (10)
*
* w_i,j,k is the weight of the j-th runnable task in the k-th cgroup on cpu i.
*
* The big problem is S_k, its a global sum needed to compute a local (W_i)
* property.
*
* [XXX write more on how we solve this.. _after_ merging pjt's patches that
* rewrite all of this once again.]
*/
static unsigned long __read_mostly max_load_balance_interval = HZ/10;