linux/kernel
Gautham R Shenoy 8558f8f816 rcu: fix hotplug vs rcu race
Dhaval Giani reported this warning during cpu hotplug stress-tests:

| On running kernel compiles in parallel with cpu hotplug:
|
| WARNING: at arch/x86/kernel/smp.c:118
| native_smp_send_reschedule+0x21/0x36()
| Modules linked in:
| Pid: 27483, comm: cc1 Not tainted 2.6.26-rc7 #1
| [...]
|  [<c0110355>] native_smp_send_reschedule+0x21/0x36
|  [<c014fe8f>] force_quiescent_state+0x47/0x57
|  [<c014fef0>] call_rcu+0x51/0x6d
|  [<c01713b3>] __fput+0x130/0x158
|  [<c0171231>] fput+0x17/0x19
|  [<c016fd99>] filp_close+0x4d/0x57
|  [<c016fdff>] sys_close+0x5c/0x97

IMHO the warning is a spurious one.

cpu_online_map is updated by the _cpu_down() using stop_machine_run().
Since force_quiescent_state is invoked from irqs disabled section,
stop_machine_run() won't be executing while a cpu is executing
force_quiescent_state(). Hence the cpu_online_map is stable while we're
in the irq disabled section.

However, a cpu might have been offlined _just_ before we disabled irqs
while entering force_quiescent_state(). And rcu subsystem might not yet
have handled the CPU_DEAD notification, leading to the offlined cpu's
bit being set in the rcp->cpumask.

Hence cpumask = (rcp->cpumask & cpu_online_map) to prevent sending
smp_reschedule() to an offlined CPU.

Here's the timeline:

CPU_A						 CPU_B
--------------------------------------------------------------
cpu_down():					.
.					   	.
.						.
stop_machine(): /* disables preemption,		.
		 * and irqs */			.
.						.
.						.
take_cpu_down();				.
.						.
.						.
.						.
cpu_disable(); /*this removes cpu 		.
		*from cpu_online_map 		.
		*/				.
.						.
.						.
restart_machine(); /* enables irqs */		.
------WINDOW DURING WHICH rcp->cpumask is stale ---------------
.						call_rcu();
.						/* disables irqs here */
.						.force_quiescent_state();
.CPU_DEAD:					.for_each_cpu(rcp->cpumask)
.						.   smp_send_reschedule();
.						.
.						.   WARN_ON() for offlined CPU!
.
.
.
rcu_cpu_notify:
.
-------- WINDOW ENDS ------------------------------------------
rcu_offline_cpu() /* Which calls cpu_quiet()
		   * which removes
		   * cpu from rcp->cpumask.
		   */

If a new batch was started just before calling stop_machine_run(), the
"tobe-offlined" cpu is still present in rcp-cpumask.

During a cpu-offline, from take_cpu_down(), we queue an rt-prio idle
task as the next task to be picked by the scheduler. We also call
cpu_disable() which will disable any further interrupts and remove the
cpu's bit from the cpu_online_map.

Once the stop_machine_run() successfully calls take_cpu_down(), it calls
schedule(). That's the last time a schedule is called on the offlined
cpu, and hence the last time when rdp->passed_quiesc will be set to 1
through rcu_qsctr_inc().

But the cpu_quiet() will be on this cpu will be called only when the
next RCU_SOFTIRQ occurs on this CPU. So at this time, the offlined CPU
is still set in rcp->cpumask.

Now coming back to the idle_task which truely offlines the CPU, it does
check for a pending RCU and raises the softirq, since it will find
rdp->passed_quiesc to be 0 in this case. However, since the cpu is
offline I am not sure if the softirq will trigger on the CPU.

Even if it doesn't the rcu_offline_cpu() will find that rcp->completed
is not the same as rcp->cur, which means that our cpu could be holding
up the grace period progression. Hence we call cpu_quiet() and move
ahead.

But because of the window explained in the timeline, we could still have
a call_rcu() before the RCU subsystem executes it's CPU_DEAD
notification, and we send smp_send_reschedule() to offlined cpu while
trying to force the quiescent states. The appended patch adds comments
and prevents checking for offlined cpu everytime.

cpu_online_map is updated by the _cpu_down() using stop_machine_run().
Since force_quiescent_state is invoked from irqs disabled section,
stop_machine_run() won't be executing while a cpu is executing
force_quiescent_state(). Hence the cpu_online_map is stable while we're
in the irq disabled section.

Reported-by: Dhaval Giani <dhaval@linux.vnet.ibm.com>
Signed-off-by: Gautham R Shenoy <ego@in.ibm.com>
Acked-by: Dhaval Giani <dhaval@linux.vnet.ibm.com>
Cc: Dipankar Sarma <dipankar@in.ibm.com>
Cc: laijs@cn.fujitsu.com
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Rusty Russel <rusty@rustcorp.com.au>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-07-01 09:27:17 +02:00
..
irq genirq: reenable a nobody cared disabled irq when a new driver arrives 2008-05-02 13:40:34 +02:00
power Merge branches 'release', 'acpica', 'bugzilla-10224', 'bugzilla-9772', 'bugzilla-9916', 'ec', 'eeepc', 'idle', 'misc', 'pm-legacy', 'sysfs-links-2.6.26', 'thermal', 'thinkpad' and 'video' into release 2008-04-30 13:58:00 -04:00
time clocksource: allow read access to available/current_clocksource 2008-05-03 18:11:48 +02:00
.gitignore
acct.c
audit_tree.c [PATCH] list_for_each_rcu must die: audit 2008-05-17 03:30:23 -04:00
audit.c [patch 1/1] audit_send_reply(): fix error-path memory leak 2008-05-17 03:30:22 -04:00
audit.h
auditfilter.c Merge branch 'audit.b50' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/audit-current 2008-04-29 11:41:22 -07:00
auditsc.c
backtracetest.c
bounds.c
capability.c capabilities: remain source compatible with 32-bit raw legacy capability support. 2008-05-31 16:36:16 -07:00
cgroup_debug.c
cgroup.c cgroups: remove node_ prefix_from ns subsystem 2008-05-24 09:56:14 -07:00
compat.c ntp: support for TAI 2008-05-01 08:03:59 -07:00
configs.c
cpu.c kernel: replace remaining __FUNCTION__ occurrences 2008-04-30 08:29:54 -07:00
cpuset.c Merge branch 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip 2008-06-20 12:37:13 -07:00
delayacct.c
dma.c
exec_domain.c
exit.c signals: fix sigqueue_free() vs __exit_signal() race 2008-05-24 09:56:10 -07:00
extable.c
fork.c [PATCH] dup_fd() fixes, part 1 2008-05-16 17:22:26 -04:00
futex_compat.c
futex.c futexes: fix fault handling in futex_lock_pi 2008-06-23 13:31:15 +02:00
hrtimer.c hrtimer: remove duplicate helper function 2008-05-03 18:11:48 +02:00
itimer.c
kallsyms.c
Kconfig.hz
Kconfig.preempt
kexec.c kexec: make extended crashkernel= syntax less confusing 2008-05-01 08:04:00 -07:00
kfifo.c
kgdb.c kgdb: use common ascii helpers and put_unaligned_be32 helper 2008-05-28 12:49:56 -05:00
kmod.c [PATCH] split linux/file.h 2008-05-01 13:08:16 -04:00
kprobes.c kprobes: fix error checking of batch registration 2008-06-12 18:05:40 -07:00
ksysfs.c
kthread.c Deprecate find_task_by_pid() 2008-04-30 08:29:48 -07:00
latencytop.c
lockdep_internals.h
lockdep_proc.c
lockdep.c Subject: lockdep: include all lock classes in all_lock_classes 2008-02-25 23:03:02 +01:00
Makefile sched: add optional support for CONFIG_HAVE_UNSTABLE_SCHED_CLOCK 2008-05-05 23:56:18 +02:00
marker.c make marker_debug static 2008-04-30 08:29:49 -07:00
module.c modules: proper cleanup of kobject without CONFIG_SYSFS 2008-05-23 13:09:33 +10:00
mutex-debug.c
mutex-debug.h
mutex.c
mutex.h
notifier.c
ns_cgroup.c
nsproxy.c
panic.c
params.c
pid_namespace.c pidns: make pid->level and pid_ns->level unsigned 2008-04-30 08:29:49 -07:00
pid.c pids: introduce change_pid() helper 2008-04-30 08:29:48 -07:00
pm_qos_params.c
posix-cpu-timers.c remove div_long_long_rem 2008-05-01 08:03:58 -07:00
posix-timers.c signals: join send_sigqueue() with send_group_sigqueue() 2008-04-30 08:29:36 -07:00
printk.c printk: don't read beyond string arguments' terminating zero 2008-04-30 08:29:52 -07:00
profile.c
ptrace.c make generic sys_ptrace unconditional 2008-05-01 10:21:54 -07:00
rcuclassic.c rcu: fix hotplug vs rcu race 2008-07-01 09:27:17 +02:00
rcupdate.c
rcupreempt_trace.c
rcupreempt.c rcupreempt: remove export of rcu_batches_completed_bh 2008-06-19 09:45:37 +02:00
rcutorture.c
relay.c splice: fix sendfile() issue with relay 2008-05-28 14:49:27 +02:00
res_counter.c
resource.c
rtmutex_common.h
rtmutex-debug.c
rtmutex-debug.h
rtmutex-tester.c
rtmutex.c
rtmutex.h
rwsem.c
sched_clock.c sched: fix sched_clock_cpu() 2008-05-29 11:29:19 +02:00
sched_debug.c revert ("sched: fair-group: SMP-nice for group scheduling") 2008-05-29 11:28:57 +02:00
sched_fair.c sched: stop wake_affine from causing serious imbalance 2008-05-29 11:29:20 +02:00
sched_features.h
sched_idletask.c sched: make rt_sched_class, idle_sched_class static 2008-05-05 23:56:17 +02:00
sched_rt.c sched: rt-group: fix RR buglet 2008-06-19 09:06:59 +02:00
sched_stats.h sched, delay accounting: fix incorrect delay time when constantly waiting on runqueue 2008-06-19 14:15:28 +02:00
sched.c Merge branch 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip 2008-06-20 12:37:13 -07:00
seccomp.c
semaphore.c Revert "semaphore: fix" 2008-05-10 20:43:22 -07:00
signal.c posix timers: discard SI_TIMER signals on exec 2008-05-26 10:37:07 -07:00
softirq.c Fix cpu hotplug problem in softirq code 2008-05-01 08:03:58 -07:00
softlockup.c softlockup: fix NMI hangs due to lock race - 2.6.26-rc regression 2008-06-19 09:45:38 +02:00
spinlock.c
srcu.c
stacktrace.c
stop_machine.c stop_machine: make stop_machine_run more virtualization friendly 2008-05-23 13:09:34 +10:00
sys_ni.c
sys.c sys_prctl(): fix return of uninitialized value 2008-05-24 09:56:13 -07:00
sysctl_check.c
sysctl.c [PATCH] avoid multiplication overflows and signedness issues for max_fds 2008-05-16 17:22:52 -04:00
taskstats.c Use find_task_by_vpid in taskstats 2008-04-30 08:29:48 -07:00
test_kprobes.c
time.c Make constants in kernel/timeconst.h fixed 64 bits 2008-05-02 16:18:42 -07:00
timeconst.pl Make constants in kernel/timeconst.h fixed 64 bits 2008-05-02 16:18:42 -07:00
timer.c debugobjects: add timer specific object debugging code 2008-04-30 08:29:53 -07:00
tsacct.c
uid16.c
user_namespace.c eCryptfs: make key module subsystem respect namespaces 2008-04-29 08:06:07 -07:00
user.c alloc_uid: cleanup 2008-04-30 08:29:53 -07:00
utsname_sysctl.c
utsname.c kernel: explicitly include required header files under kernel/ 2008-04-29 08:06:04 -07:00
wait.c
workqueue.c workqueue: remove redundant function invocation 2008-05-01 08:04:02 -07:00