linux/kernel
Ken Chen 908a7c1b9b sched: fix improper load balance across sched domain
We recently discovered a nasty performance bug in the kernel CPU load
balancer where we were hit by 50% performance regression.

When tasks are assigned to a subset of CPUs that span across
sched_domains (either ccNUMA node or the new multi-core domain) via
cpu affinity, kernel fails to perform proper load balance at
these domains, due to several logic in find_busiest_group() miss
identified busiest sched group within a given domain. This leads to
inadequate load balance and causes 50% performance hit.

To give you a concrete example, on a dual-core, 2 socket numa system,
there are 4 logical cpu, organized as:

CPU0 attaching sched-domain:
 domain 0: span 0003  groups: 0001 0002
 domain 1: span 000f  groups: 0003 000c
CPU1 attaching sched-domain:
 domain 0: span 0003  groups: 0002 0001
 domain 1: span 000f  groups: 0003 000c
CPU2 attaching sched-domain:
 domain 0: span 000c  groups: 0004 0008
 domain 1: span 000f  groups: 000c 0003
CPU3 attaching sched-domain:
 domain 0: span 000c  groups: 0008 0004
 domain 1: span 000f  groups: 000c 0003

If I run 2 tasks with CPU affinity set to 0x5.  There are situation
where cpu0 has run queue length of 2, and cpu2 will be idle.  The
kernel load balancer is unable to balance out these two tasks over
cpu0 and cpu2 due to at least three logics in find_busiest_group()
that heavily bias load balance towards power saving mode. e.g. while
determining "busiest" variable, kernel only set it when
"sum_nr_running > group_capacity".  This test is flawed that
"sum_nr_running" is not necessary same as
sum-tasks-allowed-to-run-within-the sched-group.  The end result is
that kernel "think" everything is balanced, but in reality we have an
imbalance and thus causing one CPU to be over-subscribed and leaving
other idle.  There are two other logic in the same function will also
causing similar effect.  The nastiness of this bug is that kernel not
be able to get unstuck in this unfortunate broken state.  From what
we've seen in our environment, kernel will stuck in imbalanced state
for extended period of time and it is also very easy for the kernel to
stuck into that state (it's pretty much 100% reproducible for us).

So proposing the following fix: add addition logic in
find_busiest_group to detect intrinsic imbalance within the busiest
group.  When such condition is detected, load balance goes into spread
mode instead of default grouping mode.

Signed-off-by: Ken Chen <kenchen@google.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-10-17 16:55:11 +02:00
..
irq request_irq: fix DEBUG_SHIRQ handling 2007-08-31 01:42:23 -07:00
power hibernation doesn't even build on frv - tons of helpers are missing 2007-09-26 09:22:04 -07:00
time time: introduce xtime_seconds 2007-10-16 10:01:50 -07:00
.gitignore
Kconfig.hz
Kconfig.preempt [PATCH] sched: arch preempt notifier mechanism 2007-07-26 13:40:43 +02:00
Makefile user namespace: add the framework 2007-07-16 09:05:47 -07:00
acct.c Cleanup non-arch xtime uses, use get_seconds() or current_kernel_time(). 2007-07-25 10:09:20 -07:00
audit.c [NET]: make netlink user -> kernel interface synchronious 2007-10-10 21:15:29 -07:00
audit.h Audit: add TTY input auditing 2007-07-16 09:05:47 -07:00
auditfilter.c [PATCH] allow audit filtering on bit & operations 2007-07-22 09:57:02 -04:00
auditsc.c SUNRPC: Convert rpc_pipefs to use the generic filesystem notification hooks 2007-10-09 17:15:26 -04:00
capability.c
compat.c signal/timer/event: timerfd compat code 2007-05-11 08:29:36 -07:00
configs.c
cpu.c PM: Fix dependencies of CONFIG_SUSPEND and CONFIG_HIBERNATION 2007-08-31 01:42:22 -07:00
cpuset.c cpuset: remove sched domain hooks from cpusets 2007-10-16 09:43:09 -07:00
delayacct.c sched: clean up schedstats, cnt -> count 2007-10-15 17:00:12 +02:00
die_notifier.c
dma.c
exec_domain.c
exit.c sched: guest CPU accounting: add guest-CPU /proc/<pid>/stat fields 2007-10-15 17:00:19 +02:00
extable.c
fork.c sched: guest CPU accounting: add guest-CPU /proc/<pid>/stat fields 2007-10-15 17:00:19 +02:00
futex.c robust futex thread exit race 2007-10-01 07:52:23 -07:00
futex_compat.c robust futex thread exit race 2007-10-01 07:52:23 -07:00
hrtimer.c [KTIME]: Introduce ktime_sub_ns and ktime_sub_us 2007-10-10 16:48:12 -07:00
itimer.c
kallsyms.c kallsyms: make KSYM_NAME_LEN include space for trailing '\0' 2007-07-17 10:23:03 -07:00
kexec.c
kfifo.c is_power_of_2: kernel/kfifo.c 2007-07-16 09:05:50 -07:00
kmod.c Restore call_usermodehelper_pipe() behaviour 2007-09-11 17:21:20 -07:00
kprobes.c kprobes: support kretprobe blacklist 2007-10-16 09:43:10 -07:00
ksysfs.c sched: group scheduling, sysfs tunables 2007-10-15 17:00:14 +02:00
kthread.c kthread: silence bogus section mismatch warning 2007-07-31 15:39:42 -07:00
latency.c
lockdep.c lockdep: syscall exit check 2007-10-11 22:11:12 +02:00
lockdep_internals.h
lockdep_proc.c lockdep: Avoid /proc/lockdep & lock_stat infinite output 2007-10-11 22:11:11 +02:00
module.c Fix Off-by-one in /sys/module/*/refcnt 2007-08-22 14:35:35 -07:00
mutex-debug.c
mutex-debug.h
mutex.c lockdep: fixup mutex annotations 2007-10-11 22:11:12 +02:00
mutex.h
nsproxy.c [NET]: Add network namespace clone & unshare support. 2007-10-10 16:52:46 -07:00
panic.c Report that kernel is tainted if there was an OOPS 2007-07-17 10:23:02 -07:00
params.c modules: better error messages when modules fail to load due to a sysfs problem. 2007-07-30 14:25:23 -07:00
pid.c namespace: ensure clone_flags are always stored in an unsigned long 2007-07-16 09:05:48 -07:00
posix-cpu-timers.c sched: make posix-cpu-timers use CFS's accounting information 2007-07-09 18:51:58 +02:00
posix-timers.c more low-hanging fruits - kernel, fs, lib signedness 2007-10-14 12:41:52 -07:00
printk.c slow down printk during boot 2007-10-16 09:42:49 -07:00
profile.c Memoryless nodes: Allow profiling data to fall back to other nodes 2007-10-16 09:42:58 -07:00
ptrace.c m32r: convert to generic sys_ptrace 2007-10-16 09:43:04 -07:00
rcupdate.c lockdep: annotate rcu_read_{,un}lock{,_bh} 2007-10-11 22:11:12 +02:00
rcutorture.c Freezer: make kernel threads nonfreezable by default 2007-07-17 10:23:02 -07:00
relay.c Fix a use after free bug in kernel->userspace relay file support 2007-07-31 15:39:42 -07:00
resource.c memory unplug: memory hotplug cleanup 2007-10-16 09:43:01 -07:00
rtmutex-debug.c FUTEX: Tidy up the code 2007-07-16 09:05:49 -07:00
rtmutex-debug.h
rtmutex-tester.c Freezer: make kernel threads nonfreezable by default 2007-07-17 10:23:02 -07:00
rtmutex.c FUTEX: Tidy up the code 2007-07-16 09:05:49 -07:00
rtmutex.h
rtmutex_common.h FUTEX: Tidy up the code 2007-07-16 09:05:49 -07:00
rwsem.c lockstat: hook into spinlock_t, rwlock_t, rwsem and mutex 2007-07-19 10:04:49 -07:00
sched.c sched: fix improper load balance across sched domain 2007-10-17 16:55:11 +02:00
sched_debug.c Make scheduler debug file operations const 2007-10-15 17:00:19 +02:00
sched_fair.c sched: reintroduce cache-hot affinity 2007-10-15 17:00:18 +02:00
sched_idletask.c sched: mark scheduling classes as const 2007-10-15 17:00:12 +02:00
sched_rt.c sched: tidy up SCHED_RR 2007-10-15 17:00:13 +02:00
sched_stats.h sched: clean up schedstats, cnt -> count 2007-10-15 17:00:12 +02:00
seccomp.c make seccomp zerocost in schedule 2007-07-16 09:05:50 -07:00
signal.c fix bogus reporting of signals by audit 2007-10-07 16:28:43 -07:00
softirq.c [KERNEL]: Unexport raise_softirq_irqoff 2007-10-10 16:49:18 -07:00
softlockup.c Freezer: make kernel threads nonfreezable by default 2007-07-17 10:23:02 -07:00
spinlock.c lockstat: hook into spinlock_t, rwlock_t, rwsem and mutex 2007-07-19 10:04:49 -07:00
srcu.c
stacktrace.c
stop_machine.c Fix stop_machine_run problem with naughty real time process 2007-07-16 09:05:41 -07:00
sys.c Fix SMP poweroff hangs 2007-10-01 07:52:23 -07:00
sys_ni.c diskquota: 32bit quota tools on 64bit architectures 2007-07-16 09:05:48 -07:00
sysctl.c hugetlb: Add hugetlb_dynamic_pool sysctl 2007-10-16 09:43:02 -07:00
taskstats.c taskstats: add context-switch counters 2007-07-16 09:05:46 -07:00
time.c time: introduce xtime_seconds 2007-10-16 10:01:50 -07:00
timer.c Pull ia64-clocksource into release branch 2007-07-20 11:26:47 -07:00
tsacct.c Cleanup non-arch xtime uses, use get_seconds() or current_kernel_time(). 2007-07-25 10:09:20 -07:00
uid16.c
user.c sched: generate uevents for user creation/destruction 2007-10-15 17:00:18 +02:00
user_namespace.c Fix user namespace exiting OOPs 2007-09-19 11:24:18 -07:00
utsname.c Fix UTS corruption during clone(CLONE_NEWUTS) 2007-09-19 11:24:17 -07:00
utsname_sysctl.c remove CONFIG_UTS_NS and CONFIG_IPC_NS 2007-07-16 09:05:47 -07:00
wait.c
workqueue.c fix bogus hotplug cpu warning 2007-08-27 10:27:48 -07:00