linux/kernel
Tejun Heo 4c16bd327c workqueue: implement NUMA affinity for unbound workqueues
Currently, an unbound workqueue has single current, or first, pwq
(pool_workqueue) to which all new work items are queued.  This often
isn't optimal on NUMA machines as workers may jump around across node
boundaries and work items get assigned to workers without any regard
to NUMA affinity.

This patch implements NUMA affinity for unbound workqueues.  Instead
of mapping all entries of numa_pwq_tbl[] to the same pwq,
apply_workqueue_attrs() now creates a separate pwq covering the
intersecting CPUs for each NUMA node which has online CPUs in
@attrs->cpumask.  Nodes which don't have intersecting possible CPUs
are mapped to pwqs covering whole @attrs->cpumask.

As CPUs come up and go down, the pool association is changed
accordingly.  Changing pool association may involve allocating new
pools which may fail.  To avoid failing CPU_DOWN, each workqueue
always keeps a default pwq which covers whole attrs->cpumask which is
used as fallback if pool creation fails during a CPU hotplug
operation.

This ensures that all work items issued on a NUMA node is executed on
the same node as long as the workqueue allows execution on the CPUs of
the node.

As this maps a workqueue to multiple pwqs and max_active is per-pwq,
this change the behavior of max_active.  The limit is now per NUMA
node instead of global.  While this is an actual change, max_active is
already per-cpu for per-cpu workqueues and primarily used as safety
mechanism rather than for active concurrency control.  Concurrency is
usually limited from workqueue users by the number of concurrently
active work items and this change shouldn't matter much.

v2: Fixed pwq freeing in apply_workqueue_attrs() error path.  Spotted
    by Lai.

v3: The previous version incorrectly made a workqueue spanning
    multiple nodes spread work items over all online CPUs when some of
    its nodes don't have any desired cpus.  Reimplemented so that NUMA
    affinity is properly updated as CPUs go up and down.  This problem
    was spotted by Lai Jiangshan.

v4: destroy_workqueue() was putting wq->dfl_pwq and then clearing it;
    however, wq may be freed at any time after dfl_pwq is put making
    the clearing use-after-free.  Clear wq->dfl_pwq before putting it.

v5: apply_workqueue_attrs() was leaking @tmp_attrs, @new_attrs and
    @pwq_tbl after success.  Fixed.

    Retry loop in wq_update_unbound_numa_attrs() isn't necessary as
    application of new attrs is excluded via CPU hotplug.  Removed.

    Documentation on CPU affinity guarantee on CPU_DOWN added.

    All changes are suggested by Lai Jiangshan.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
2013-04-01 11:23:36 -07:00
..
debug KGDB/KDB fixes and cleanups 2013-03-02 08:31:39 -08:00
events hlist: drop the node parameter from iterators 2013-02-27 19:10:24 -08:00
gcov kernel/gcov: remove depends on CONFIG_EXPERIMENTAL 2013-01-11 11:39:33 -08:00
irq Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2013-02-26 20:16:07 -08:00
power Driver core patches for 3.9-rc1 2013-02-21 12:05:51 -08:00
sched sched: replace PF_THREAD_BOUND with PF_NO_SETAFFINITY 2013-03-19 13:45:20 -07:00
time Merge branch 'release' of git://git.kernel.org/pub/scm/linux/kernel/git/rzhang/linux 2013-02-28 19:48:26 -08:00
trace ImgTec Meta architecture changes for v3.9-rc1 2013-03-03 12:06:09 -08:00
.gitignore
Kconfig.freezer
Kconfig.hz
Kconfig.locks locking: Adjust spin lock inlining Kconfig options 2012-09-13 17:56:13 +02:00
Kconfig.preempt locking/kconfig: Simplify INLINE_SPIN_UNLOCK usage 2012-03-23 13:18:57 +01:00
Makefile Merge branch 'akpm' (final batch from Andrew) 2013-02-27 20:58:09 -08:00
acct.c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2013-02-26 20:16:07 -08:00
async.c Merge branch 'for-3.9' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq 2013-02-19 22:01:33 -08:00
audit.c kernel/audit.c: avoid negative sleep durations 2013-01-11 14:54:56 -08:00
audit.h audit: optimize audit_compare_dname_path 2012-10-12 00:32:02 -04:00
audit_tree.c audit: catch possible NULL audit buffers 2013-01-11 14:54:55 -08:00
audit_watch.c audit: catch possible NULL audit buffers 2013-01-11 14:54:55 -08:00
auditfilter.c audit: fix auditfilter.c kernel-doc warnings 2013-01-10 14:35:23 -08:00
auditsc.c audit: catch possible NULL audit buffers 2013-01-11 14:54:55 -08:00
backtracetest.c
bounds.c memcg: remove direct page_cgroup-to-page pointer 2011-03-23 19:46:28 -07:00
capability.c userns: Teach inode_capable to understand inodes whose uids map to other namespaces. 2012-05-15 14:59:24 -07:00
cgroup.c sched: replace PF_THREAD_BOUND with PF_NO_SETAFFINITY 2013-03-19 13:45:20 -07:00
cgroup_freezer.c cgroup: rename ->create/post_create/pre_destroy/destroy() to ->css_alloc/online/offline/free() 2012-11-19 08:13:38 -08:00
compat.c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal 2013-02-23 18:50:11 -08:00
configs.c kernel/configs.c: include MODULE_*() when CONFIG_IKCONFIG_PROC=n 2011-07-25 20:57:15 -07:00
context_tracking.c Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2013-02-19 18:19:48 -08:00
cpu.c Merge branch 'smp-hotplug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2013-02-19 19:04:55 -08:00
cpu_pm.c kernel/cpu_pm.c: fix various typos 2012-05-31 17:49:27 -07:00
cpuset.c sched: replace PF_THREAD_BOUND with PF_NO_SETAFFINITY 2013-03-19 13:45:20 -07:00
crash_dump.c Merge branch 'modsplit-Oct31_2011' of git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux 2011-11-06 19:44:47 -08:00
cred.c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace 2012-12-18 10:55:28 -08:00
delayacct.c cputime: Use accessors to read task cputime stats 2013-01-27 19:23:31 +01:00
dma.c Remove all #inclusions of asm/system.h 2012-03-28 18:30:03 +01:00
elfcore.c
exec_domain.c
exit.c coredump: use a freezable_schedule for the coredump_finish wait 2013-02-27 19:10:11 -08:00
extable.c extable: Skip sorting if sorted at build time. 2012-04-19 15:06:55 -07:00
fork.c fork: unshare: remove dead code 2013-02-27 19:10:12 -08:00
freezer.c freezer: change ptrace_stop/do_signal_stop to use freezable_schedule() 2012-10-26 14:27:49 -07:00
futex.c more file_inode() open-coded instances 2013-02-27 16:59:05 -05:00
futex_compat.c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal 2013-02-23 18:50:11 -08:00
groups.c userns: Convert in_group_p and in_egroup_p to use kgid_t 2012-05-03 03:29:33 -07:00
hrtimer.c Merge branch 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2013-02-19 19:05:45 -08:00
hung_task.c hung task debugging: Inject NMI when hung and going to panic 2012-04-25 12:39:25 +02:00
irq_work.c Merge branch 'nohz/printk-v8' into irq/core 2013-02-05 00:48:46 +01:00
itimer.c itimer: Use printk_once instead of WARN_ONCE 2012-04-10 11:00:30 +02:00
jump_label.c jump_label: Export jump_label_rate_limit() 2012-08-06 19:00:35 +03:00
kallsyms.c vsprintf: fix %ps on non symbols when using kallsyms 2012-05-29 16:22:32 -07:00
kcmp.c kcmp: include linux/ptrace.h 2012-12-20 17:40:19 -08:00
kexec.c kexec: avoid freeing NULL pointer in image_crash_alloc() 2013-02-27 19:10:12 -08:00
kmod.c Merge branch 'master' into for-3.9-async 2013-01-23 09:31:01 -08:00
kprobes.c hlist: drop the node parameter from iterators 2013-02-27 19:10:24 -08:00
ksysfs.c Merge branch 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2012-12-11 18:10:49 -08:00
kthread.c sched: replace PF_THREAD_BOUND with PF_NO_SETAFFINITY 2013-03-19 13:45:20 -07:00
latencytop.c kernel: Map most files to use export.h instead of module.h 2011-10-31 09:20:12 -04:00
lglock.c brlocks/lglocks: turn into functions 2012-05-29 23:28:41 -04:00
lockdep.c lockdep: check that no locks held at freeze time 2013-02-27 19:10:11 -08:00
lockdep_internals.h
lockdep_proc.c lockdep: Use KSYM_NAME_LEN'ed buffer for __get_key_name() 2012-10-24 12:39:09 +02:00
lockdep_states.h
modsign_certificate.S MODSIGN: Avoid using .incbin in C source 2012-12-14 13:06:44 +10:30
modsign_pubkey.c keys: use keyring_alloc() to create module signing keyring 2012-12-20 17:40:21 -08:00
module-internal.h MODSIGN: Move the magic string to the end of a module and eliminate the search 2012-10-19 17:30:40 -07:00
module.c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2013-02-26 20:16:07 -08:00
module_signing.c MODSIGN: Don't use enum-type bitfields in module signature info block 2012-12-05 11:27:24 +10:30
mutex-debug.c kernel: Map most files to use export.h instead of module.h 2011-10-31 09:20:12 -04:00
mutex-debug.h mutex: Use p->on_cpu for the adaptive spin 2011-04-14 08:52:33 +02:00
mutex.c sched/rt: Move rt specific bits into new header file 2013-02-07 20:51:08 +01:00
mutex.h mutex: Use p->on_cpu for the adaptive spin 2011-04-14 08:52:33 +02:00
notifier.c kernel: Map most files to use export.h instead of module.h 2011-10-31 09:20:12 -04:00
nsproxy.c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2013-02-26 20:16:07 -08:00
padata.c padata: use __this_cpu_read per-cpu helper 2012-12-06 17:16:23 +08:00
panic.c taint: add explicit flag to show whether lock dep is still OK. 2013-01-21 17:17:57 +10:30
params.c params: replace printk(KERN_<LVL>...) with pr_<lvl>(...) 2012-05-04 17:28:18 -07:00
pid.c hlist: drop the node parameter from iterators 2013-02-27 19:10:24 -08:00
pid_namespace.c pidns: Stop pid allocation when init dies 2012-12-25 16:10:05 -08:00
posix-cpu-timers.c Merge branch 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2013-02-19 19:05:45 -08:00
posix-timers.c posix-timers: convert to idr_alloc() 2013-02-27 19:10:19 -08:00
printk.c Merge branch 'drm-next' of git://people.freedesktop.org/~airlied/linux 2013-02-25 16:46:44 -08:00
profile.c profiling: Remove unused timer hook 2013-01-24 15:37:26 +01:00
ptrace.c uprobes: Add exports for module use 2013-02-08 17:47:13 +01:00
range.c range: fix bogus misuse of module.h to get printk() 2011-10-31 09:20:11 -04:00
rcu.h rcu: Provide RCU CPU stall warnings for tiny RCU 2013-01-28 22:06:21 -08:00
rcupdate.c Merge branches 'doctorture.2013.01.29a', 'fixes.2013.01.26a', 'tagcb.2013.01.24a' and 'tiny.2013.01.29b' into HEAD 2013-01-28 22:25:21 -08:00
rcutiny.c Merge branches 'doctorture.2013.01.29a', 'fixes.2013.01.26a', 'tagcb.2013.01.24a' and 'tiny.2013.01.29b' into HEAD 2013-01-28 22:25:21 -08:00
rcutiny_plugin.h rcu: Provide RCU CPU stall warnings for tiny RCU 2013-01-28 22:06:21 -08:00
rcutorture.c rcu: Allow rcutorture to be built at low optimization levels 2013-02-04 12:18:20 -08:00
rcutree.c Merge branches 'doctorture.2013.01.29a', 'fixes.2013.01.26a', 'tagcb.2013.01.24a' and 'tiny.2013.01.29b' into HEAD 2013-01-28 22:25:21 -08:00
rcutree.h Merge branches 'doctorture.2013.01.29a', 'fixes.2013.01.26a', 'tagcb.2013.01.24a' and 'tiny.2013.01.29b' into HEAD 2013-01-28 22:25:21 -08:00
rcutree_plugin.h rcu: Make rcu_nocb_poll an early_param instead of module_param 2013-01-08 14:12:19 -08:00
rcutree_trace.c rcu: Separate accounting of callbacks from callback-free CPUs 2012-11-16 10:05:57 -08:00
relay.c new helper: file_inode(file) 2013-02-22 23:31:31 -05:00
res_counter.c res_counter: return amount of charges after res_counter_uncharge() 2012-12-18 15:02:12 -08:00
resource.c kernel/resource.c: fix stack overflow in __reserve_region_with_split() 2012-10-06 03:05:31 +09:00
rtmutex-debug.c sched/rt: Move rt specific bits into new header file 2013-02-07 20:51:08 +01:00
rtmutex-debug.h
rtmutex-tester.c sched/rt: Move rt specific bits into new header file 2013-02-07 20:51:08 +01:00
rtmutex.c sched/rt: Move rt specific bits into new header file 2013-02-07 20:51:08 +01:00
rtmutex.h
rtmutex_common.h
rwsem.c lockdep, rwsem: provide down_write_nest_lock() 2013-01-11 14:54:55 -08:00
seccomp.c seccomp: Make syscall skipping and nr changes more consistent 2012-10-02 21:14:29 +10:00
semaphore.c semaphore: fix improper comment reference to mutex 2012-04-05 17:15:55 -07:00
signal.c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal 2013-03-02 19:32:06 -08:00
smp.c smp: make smp_call_function_many() use logic similar to smp_call_function_single() 2013-02-21 17:22:20 -08:00
smpboot.c hlist: drop the node parameter from iterators 2013-02-27 19:10:24 -08:00
smpboot.h smpboot: Provide infrastructure for percpu hotplug threads 2012-08-13 17:01:07 +02:00
softirq.c Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next 2013-02-20 18:58:50 -08:00
spinlock.c locking/kconfig: Simplify INLINE_SPIN_UNLOCK usage 2012-03-23 13:18:57 +01:00
srcu.c srcu: use ACCESS_ONCE() to access sp->completed in srcu_read_lock() 2013-02-07 15:19:36 -08:00
stacktrace.c kernel: Map most files to use export.h instead of module.h 2011-10-31 09:20:12 -04:00
stop_machine.c stop_machine: Use smpboot threads 2013-02-14 15:29:38 +01:00
sys.c usermodehelper: cleanup/fix __orderly_poweroff() && argv_free() 2013-02-27 19:10:09 -08:00
sys_ni.c module: add syscall to load module from fd 2012-12-14 13:05:22 +10:30
sysctl.c Initial ARC Linux port with some fixes on top for 3.9-rc1 2013-03-02 07:58:56 -08:00
sysctl_binary.c sysctl: fix null checking in bin_dn_node_address() 2013-02-27 19:10:21 -08:00
task_work.c task_work: task_work_add() should not succeed after exit_task_work() 2012-09-13 16:47:34 +02:00
taskstats.c taskstats: cgroupstats_user_cmd() may leak on error 2012-10-06 03:05:31 +09:00
test_kprobes.c
time.c time: don't inline EXPORT_SYMBOL functions 2013-02-21 17:22:19 -08:00
timeconst.bc kernel: Replace timeconst.pl with a bc script 2013-02-16 23:17:25 +01:00
timer.c Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2013-02-19 18:19:48 -08:00
tracepoint.c hlist: drop the node parameter from iterators 2013-02-27 19:10:24 -08:00
tsacct.c cputime: Use accessors to read task cputime stats 2013-01-27 19:23:31 +01:00
uid16.c userns: Convert setting and getting uid and gid system calls to use kuid and kgid 2012-05-03 03:28:41 -07:00
up.c kernel: Map most files to use export.h instead of module.h 2011-10-31 09:20:12 -04:00
user-return-notifier.c hlist: drop the node parameter from iterators 2013-02-27 19:10:24 -08:00
user.c hlist: drop the node parameter from iterators 2013-02-27 19:10:24 -08:00
user_namespace.c userns: Allow any uid or gid mappings that don't overlap. 2013-01-26 22:12:04 -08:00
utsname.c kernel/utsname.c: fix wrong comment about clone_uts_ns() 2013-02-27 19:10:22 -08:00
utsname_sysctl.c kernel/utsname_sysctl.c: put get/get_uts() into CONFIG_PROC_SYSCTL code block 2013-02-27 19:10:22 -08:00
wait.c propagate name change to comments in kernel source 2012-12-06 10:39:54 +01:00
watchdog.c Merge branch 'core-locking-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2013-02-22 19:25:09 -08:00
workqueue.c workqueue: implement NUMA affinity for unbound workqueues 2013-04-01 11:23:36 -07:00
workqueue_internal.h workqueue: directly restore CPU affinity of workers from CPU_ONLINE 2013-03-19 13:45:21 -07:00