From bb7e5ce7dde6f42e7793bd6cf4b1eb71a20145aa Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Mon, 7 Aug 2017 17:23:20 -0700 Subject: [PATCH 01/24] documentation: RCU grace-period memory ordering guarantees This commit provides text and diagrams showing how Tree RCU implements its grace-period memory ordering guarantees. Signed-off-by: Paul E. McKenney Cc: Jonathan Corbet Cc: Linus Torvalds --- .../Memory-Ordering/Tree-RCU-Diagram.html | 9 + .../Tree-RCU-Memory-Ordering.html | 707 +++ .../TreeRCU-callback-invocation.svg | 486 ++ .../TreeRCU-callback-registry.svg | 655 +++ .../Memory-Ordering/TreeRCU-dyntick.svg | 700 +++ .../Memory-Ordering/TreeRCU-gp-cleanup.svg | 1126 ++++ .../Design/Memory-Ordering/TreeRCU-gp-fqs.svg | 1309 +++++ .../Memory-Ordering/TreeRCU-gp-init-1.svg | 656 +++ .../Memory-Ordering/TreeRCU-gp-init-2.svg | 656 +++ .../Memory-Ordering/TreeRCU-gp-init-3.svg | 632 ++ .../RCU/Design/Memory-Ordering/TreeRCU-gp.svg | 5135 +++++++++++++++++ .../Memory-Ordering/TreeRCU-hotplug.svg | 775 +++ .../RCU/Design/Memory-Ordering/TreeRCU-qs.svg | 1095 ++++ .../Design/Memory-Ordering/rcu_node-lock.svg | 229 + 14 files changed, 14170 insertions(+) create mode 100644 Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Diagram.html create mode 100644 Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.html create mode 100644 Documentation/RCU/Design/Memory-Ordering/TreeRCU-callback-invocation.svg create mode 100644 Documentation/RCU/Design/Memory-Ordering/TreeRCU-callback-registry.svg create mode 100644 Documentation/RCU/Design/Memory-Ordering/TreeRCU-dyntick.svg create mode 100644 Documentation/RCU/Design/Memory-Ordering/TreeRCU-gp-cleanup.svg create mode 100644 Documentation/RCU/Design/Memory-Ordering/TreeRCU-gp-fqs.svg create mode 100644 Documentation/RCU/Design/Memory-Ordering/TreeRCU-gp-init-1.svg create mode 100644 Documentation/RCU/Design/Memory-Ordering/TreeRCU-gp-init-2.svg create mode 100644 Documentation/RCU/Design/Memory-Ordering/TreeRCU-gp-init-3.svg create mode 100644 Documentation/RCU/Design/Memory-Ordering/TreeRCU-gp.svg create mode 100644 Documentation/RCU/Design/Memory-Ordering/TreeRCU-hotplug.svg create mode 100644 Documentation/RCU/Design/Memory-Ordering/TreeRCU-qs.svg create mode 100644 Documentation/RCU/Design/Memory-Ordering/rcu_node-lock.svg diff --git a/Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Diagram.html b/Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Diagram.html new file mode 100644 index 000000000000..e5b42a798ff3 --- /dev/null +++ b/Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Diagram.html @@ -0,0 +1,9 @@ + + + A Diagram of TREE_RCU's Grace-Period Memory Ordering + + +

TreeRCU-gp.svg + + diff --git a/Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.html b/Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.html new file mode 100644 index 000000000000..8651b0b4fd79 --- /dev/null +++ b/Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.html @@ -0,0 +1,707 @@ + + + A Tour Through TREE_RCU's Grace-Period Memory Ordering + + +

August 8, 2017

+

This article was contributed by Paul E. McKenney

+ +

Introduction

+ +

This document gives a rough visual overview of how Tree RCU's +grace-period memory ordering guarantee is provided. + +

    +
  1. + What Is Tree RCU's Grace Period Memory Ordering Guarantee? +
  2. + Tree RCU Grace Period Memory Ordering Building Blocks +
  3. + Tree RCU Grace Period Memory Ordering Components +
  4. Putting It All Together +
+ +

+What Is Tree RCU's Grace Period Memory Ordering Guarantee?

+ +

RCU grace periods provide extremely strong memory-ordering guarantees +for non-idle non-offline code. +Any code that happens after the end of a given RCU grace period is guaranteed +to see the effects of all accesses prior to the beginning of that grace +period that are within RCU read-side critical sections. +Similarly, any code that happens before the beginning of a given RCU grace +period is guaranteed to see the effects of all accesses following the end +of that grace period that are within RCU read-side critical sections. + +

This guarantee is particularly pervasive for synchronize_sched(), +for which RCU-sched read-side critical sections include any region +of code for which preemption is disabled. +Given that each individual machine instruction can be thought of as +an extremely small region of preemption-disabled code, one can think of +synchronize_sched() as smp_mb() on steroids. + +

RCU updaters use this guarantee by splitting their updates into +two phases, one of which is executed before the grace period and +the other of which is executed after the grace period. +In the most common use case, phase one removes an element from +a linked RCU-protected data structure, and phase two frees that element. +For this to work, any readers that have witnessed state prior to the +phase-one update (in the common case, removal) must not witness state +following the phase-two update (in the common case, freeing). + +

The RCU implementation provides this guarantee using a network +of lock-based critical sections, memory barriers, and per-CPU +processing, as is described in the following sections. + +

+Tree RCU Grace Period Memory Ordering Building Blocks

+ +

The workhorse for RCU's grace-period memory ordering is the +critical section for the rcu_node structure's +->lock. +These critical sections use helper functions for lock acquisition, including +raw_spin_lock_rcu_node(), +raw_spin_lock_irq_rcu_node(), and +raw_spin_lock_irqsave_rcu_node(). +Their lock-release counterparts are +raw_spin_unlock_rcu_node(), +raw_spin_unlock_irq_rcu_node(), and +raw_spin_unlock_irqrestore_rcu_node(), +respectively. +For completeness, a +raw_spin_trylock_rcu_node() +is also provided. +The key point is that the lock-acquisition functions, including +raw_spin_trylock_rcu_node(), all invoke +smp_mb__after_unlock_lock() immediately after successful +acquisition of the lock. + +

Therefore, for any given rcu_node struction, any access +happening before one of the above lock-release functions will be seen +by all CPUs as happening before any access happening after a later +one of the above lock-acquisition functions. +Furthermore, any access happening before one of the +above lock-release function on any given CPU will be seen by all +CPUs as happening before any access happening after a later one +of the above lock-acquisition functions executing on that same CPU, +even if the lock-release and lock-acquisition functions are operating +on different rcu_node structures. +Tree RCU uses these two ordering guarantees to form an ordering +network among all CPUs that were in any way involved in the grace +period, including any CPUs that came online or went offline during +the grace period in question. + +

The following litmus test exhibits the ordering effects of these +lock-acquisition and lock-release functions: + +

+ 1 int x, y, z;
+ 2
+ 3 void task0(void)
+ 4 {
+ 5   raw_spin_lock_rcu_node(rnp);
+ 6   WRITE_ONCE(x, 1);
+ 7   r1 = READ_ONCE(y);
+ 8   raw_spin_unlock_rcu_node(rnp);
+ 9 }
+10
+11 void task1(void)
+12 {
+13   raw_spin_lock_rcu_node(rnp);
+14   WRITE_ONCE(y, 1);
+15   r2 = READ_ONCE(z);
+16   raw_spin_unlock_rcu_node(rnp);
+17 }
+18
+19 void task2(void)
+20 {
+21   WRITE_ONCE(z, 1);
+22   smp_mb();
+23   r3 = READ_ONCE(x);
+24 }
+25
+26 WARN_ON(r1 == 0 && r2 == 0 && r3 == 0);
+
+ +

The WARN_ON() is evaluated at “the end of time”, +after all changes have propagated throughout the system. +Without the smp_mb__after_unlock_lock() provided by the +acquisition functions, this WARN_ON() could trigger, for example +on PowerPC. +The smp_mb__after_unlock_lock() invocations prevent this +WARN_ON() from triggering. + +

This approach must be extended to include idle CPUs, which need +RCU's grace-period memory ordering guarantee to extend to any +RCU read-side critical sections preceding and following the current +idle sojourn. +This case is handled by calls to the strongly ordered +atomic_add_return() read-modify-write atomic operation that +is invoked within rcu_dynticks_eqs_enter() at idle-entry +time and within rcu_dynticks_eqs_exit() at idle-exit time. +The grace-period kthread invokes rcu_dynticks_snap() and +rcu_dynticks_in_eqs_since() (both of which invoke +an atomic_add_return() of zero) to detect idle CPUs. + + + + + + + + +
 
Quick Quiz:
+ But what about CPUs that remain offline for the entire + grace period? +
Answer:
+ Such CPUs will be offline at the beginning of the grace period, + so the grace period won't expect quiescent states from them. + Races between grace-period start and CPU-hotplug operations + are mediated by the CPU's leaf rcu_node structure's + ->lock as described above. +
 
+ +

The approach must be extended to handle one final case, that +of waking a task blocked in synchronize_rcu(). +This task might be affinitied to a CPU that is not yet aware that +the grace period has ended, and thus might not yet be subject to +the grace period's memory ordering. +Therefore, there is an smp_mb() after the return from +wait_for_completion() in the synchronize_rcu() +code path. + + + + + + + + +
 
Quick Quiz:
+ What? Where??? + I don't see any smp_mb() after the return from + wait_for_completion()!!! +
Answer:
+ That would be because I spotted the need for that + smp_mb() during the creation of this documentation, + and it is therefore unlikely to hit mainline before v4.14. + Kudos to Lance Roy, Will Deacon, Peter Zijlstra, and + Jonathan Cameron for asking questions that sensitized me + to the rather elaborate sequence of events that demonstrate + the need for this memory barrier. +
 
+ +

Tree RCU's grace--period memory-ordering guarantees rely most +heavily on the rcu_node structure's ->lock +field, so much so that it is necessary to abbreviate this pattern +in the diagrams in the next section. +For example, consider the rcu_prepare_for_idle() function +shown below, which is one of several functions that enforce ordering +of newly arrived RCU callbacks against future grace periods: + +

+ 1 static void rcu_prepare_for_idle(void)
+ 2 {
+ 3   bool needwake;
+ 4   struct rcu_data *rdp;
+ 5   struct rcu_dynticks *rdtp = this_cpu_ptr(&rcu_dynticks);
+ 6   struct rcu_node *rnp;
+ 7   struct rcu_state *rsp;
+ 8   int tne;
+ 9
+10   if (IS_ENABLED(CONFIG_RCU_NOCB_CPU_ALL) ||
+11       rcu_is_nocb_cpu(smp_processor_id()))
+12     return;
+13   tne = READ_ONCE(tick_nohz_active);
+14   if (tne != rdtp->tick_nohz_enabled_snap) {
+15     if (rcu_cpu_has_callbacks(NULL))
+16       invoke_rcu_core();
+17     rdtp->tick_nohz_enabled_snap = tne;
+18     return;
+19   }
+20   if (!tne)
+21     return;
+22   if (rdtp->all_lazy &&
+23       rdtp->nonlazy_posted != rdtp->nonlazy_posted_snap) {
+24     rdtp->all_lazy = false;
+25     rdtp->nonlazy_posted_snap = rdtp->nonlazy_posted;
+26     invoke_rcu_core();
+27     return;
+28   }
+29   if (rdtp->last_accelerate == jiffies)
+30     return;
+31   rdtp->last_accelerate = jiffies;
+32   for_each_rcu_flavor(rsp) {
+33     rdp = this_cpu_ptr(rsp->rda);
+34     if (rcu_segcblist_pend_cbs(&rdp->cblist))
+35       continue;
+36     rnp = rdp->mynode;
+37     raw_spin_lock_rcu_node(rnp);
+38     needwake = rcu_accelerate_cbs(rsp, rnp, rdp);
+39     raw_spin_unlock_rcu_node(rnp);
+40     if (needwake)
+41       rcu_gp_kthread_wake(rsp);
+42   }
+43 }
+
+ +

But the only part of rcu_prepare_for_idle() that really +matters for this discussion are lines 37–39. +We will therefore abbreviate this function as follows: + +

rcu_node-lock.svg + +

The box represents the rcu_node structure's ->lock +critical section, with the double line on top representing the additional +smp_mb__after_unlock_lock(). + +

+Tree RCU Grace Period Memory Ordering Components

+ +

Tree RCU's grace-period memory-ordering guarantee is provided by +a number of RCU components: + +

    +
  1. Callback Registry +
  2. Grace-Period Initialization +
  3. + Self-Reported Quiescent States +
  4. Dynamic Tick Interface +
  5. CPU-Hotplug Interface +
  6. Forcing Quiescent States +
  7. Grace-Period Cleanup +
  8. Callback Invocation +
+ +

Each of the following section looks at the corresponding component +in detail. + +

Callback Registry

+ +

If RCU's grace-period guarantee is to mean anything at all, any +access that happens before a given invocation of call_rcu() +must also happen before the corresponding grace period. +The implementation of this portion of RCU's grace period guarantee +is shown in the following figure: + +

TreeRCU-callback-registry.svg + +

Because call_rcu() normally acts only on CPU-local state, +it provides no ordering guarantees, either for itself or for +phase one of the update (which again will usually be removal of +an element from an RCU-protected data structure). +It simply enqueues the rcu_head structure on a per-CPU list, +which cannot become associated with a grace period until a later +call to rcu_accelerate_cbs(), as shown in the diagram above. + +

One set of code paths shown on the left invokes +rcu_accelerate_cbs() via +note_gp_changes(), either directly from call_rcu() (if +the current CPU is inundated with queued rcu_head structures) +or more likely from an RCU_SOFTIRQ handler. +Another code path in the middle is taken only in kernels built with +CONFIG_RCU_FAST_NO_HZ=y, which invokes +rcu_accelerate_cbs() via rcu_prepare_for_idle(). +The final code path on the right is taken only in kernels built with +CONFIG_HOTPLUG_CPU=y, which invokes +rcu_accelerate_cbs() via +rcu_advance_cbs(), rcu_migrate_callbacks, +rcutree_migrate_callbacks(), and takedown_cpu(), +which in turn is invoked on a surviving CPU after the outgoing +CPU has been completely offlined. + +

There are a few other code paths within grace-period processing +that opportunistically invoke rcu_accelerate_cbs(). +However, either way, all of the CPU's recently queued rcu_head +structures are associated with a future grace-period number under +the protection of the CPU's lead rcu_node structure's +->lock. +In all cases, there is full ordering against any prior critical section +for that same rcu_node structure's ->lock, and +also full ordering against any of the current task's or CPU's prior critical +sections for any rcu_node structure's ->lock. + +

The next section will show how this ordering ensures that any +accesses prior to the call_rcu() (particularly including phase +one of the update) +happen before the start of the corresponding grace period. + + + + + + + + +
 
Quick Quiz:
+ But what about synchronize_rcu()? +
Answer:
+ The synchronize_rcu() passes call_rcu() + to wait_rcu_gp(), which invokes it. + So either way, it eventually comes down to call_rcu(). +
 
+ +

Grace-Period Initialization

+ +

Grace-period initialization is carried out by +the grace-period kernel thread, which makes several passes over the +rcu_node tree within the rcu_gp_init() function. +This means that showing the full flow of ordering through the +grace-period computation will require duplicating this tree. +If you find this confusing, please note that the state of the +rcu_node changes over time, just like Heraclitus's river. +However, to keep the rcu_node river tractable, the +grace-period kernel thread's traversals are presented in multiple +parts, starting in this section with the various phases of +grace-period initialization. + +

The first ordering-related grace-period initialization action is to +increment the rcu_state structure's ->gpnum +grace-period-number counter, as shown below: + +

TreeRCU-gp-init-1.svg + +

The actual increment is carried out using smp_store_release(), +which helps reject false-positive RCU CPU stall detection. +Note that only the root rcu_node structure is touched. + +

The first pass through the rcu_node tree updates bitmasks +based on CPUs having come online or gone offline since the start of +the previous grace period. +In the common case where the number of online CPUs for this rcu_node +structure has not transitioned to or from zero, +this pass will scan only the leaf rcu_node structures. +However, if the number of online CPUs for a given leaf rcu_node +structure has transitioned from zero, +rcu_init_new_rnp() will be invoked for the first incoming CPU. +Similarly, if the number of online CPUs for a given leaf rcu_node +structure has transitioned to zero, +rcu_cleanup_dead_rnp() will be invoked for the last outgoing CPU. +The diagram below shows the path of ordering if the leftmost +rcu_node structure onlines its first CPU and if the next +rcu_node structure has no online CPUs +(or, alternatively if the leftmost rcu_node structure offlines +its last CPU and if the next rcu_node structure has no online CPUs). + +

TreeRCU-gp-init-1.svg + +

The final rcu_gp_init() pass through the rcu_node +tree traverses breadth-first, setting each rcu_node structure's +->gpnum field to the newly incremented value from the +rcu_state structure, as shown in the following diagram. + +

TreeRCU-gp-init-1.svg + +

This change will also cause each CPU's next call to +__note_gp_changes() +to notice that a new grace period has started, as described in the next +section. +But because the grace-period kthread started the grace period at the +root (with the increment of the rcu_state structure's +->gpnum field) before setting each leaf rcu_node +structure's ->gpnum field, each CPU's observation of +the start of the grace period will happen after the actual start +of the grace period. + + + + + + + + +
 
Quick Quiz:
+ But what about the CPU that started the grace period? + Why wouldn't it see the start of the grace period right when + it started that grace period? +
Answer:
+ In some deep philosophical and overly anthromorphized + sense, yes, the CPU starting the grace period is immediately + aware of having done so. + However, if we instead assume that RCU is not self-aware, + then even the CPU starting the grace period does not really + become aware of the start of this grace period until its + first call to __note_gp_changes(). + On the other hand, this CPU potentially gets early notification + because it invokes __note_gp_changes() during its + last rcu_gp_init() pass through its leaf + rcu_node structure. +
 
+ +

+Self-Reported Quiescent States

+ +

When all entities that might block the grace period have reported +quiescent states (or as described in a later section, had quiescent +states reported on their behalf), the grace period can end. +Online non-idle CPUs report their own quiescent states, as shown +in the following diagram: + +

TreeRCU-qs.svg + +

This is for the last CPU to report a quiescent state, which signals +the end of the grace period. +Earlier quiescent states would push up the rcu_node tree +only until they encountered an rcu_node structure that +is waiting for additional quiescent states. +However, ordering is nevertheless preserved because some later quiescent +state will acquire that rcu_node structure's ->lock. + +

Any number of events can lead up to a CPU invoking +note_gp_changes (or alternatively, directly invoking +__note_gp_changes()), at which point that CPU will notice +the start of a new grace period while holding its leaf +rcu_node lock. +Therefore, all execution shown in this diagram happens after the +start of the grace period. +In addition, this CPU will consider any RCU read-side critical +section that started before the invocation of __note_gp_changes() +to have started before the grace period, and thus a critical +section that the grace period must wait on. + + + + + + + + +
 
Quick Quiz:
+ But a RCU read-side critical section might have started + after the beginning of the grace period + (the ->gpnum++ from earlier), so why should + the grace period wait on such a critical section? +
Answer:
+ It is indeed not necessary for the grace period to wait on such + a critical section. + However, it is permissible to wait on it. + And it is furthermore important to wait on it, as this + lazy approach is far more scalable than a “big bang” + all-at-once grace-period start could possibly be. +
 
+ +

If the CPU does a context switch, a quiescent state will be +noted by rcu_node_context_switch() on the left. +On the other hand, if the CPU takes a scheduler-clock interrupt +while executing in usermode, a quiescent state will be noted by +rcu_check_callbacks() on the right. +Either way, the passage through a quiescent state will be noted +in a per-CPU variable. + +

The next time an RCU_SOFTIRQ handler executes on +this CPU (for example, after the next scheduler-clock +interrupt), __rcu_process_callbacks() will invoke +rcu_check_quiescent_state(), which will notice the +recorded quiescent state, and invoke +rcu_report_qs_rdp(). +If rcu_report_qs_rdp() verifies that the quiescent state +really does apply to the current grace period, it invokes +rcu_report_rnp() which traverses up the rcu_node +tree as shown at the bottom of the diagram, clearing bits from +each rcu_node structure's ->qsmask field, +and propagating up the tree when the result is zero. + +

Note that traversal passes upwards out of a given rcu_node +structure only if the current CPU is reporting the last quiescent +state for the subtree headed by that rcu_node structure. +A key point is that if a CPU's traversal stops at a given rcu_node +structure, then there will be a later traversal by another CPU +(or perhaps the same one) that proceeds upwards +from that point, and the rcu_node ->lock +guarantees that the first CPU's quiescent state happens before the +remainder of the second CPU's traversal. +Applying this line of thought repeatedly shows that all CPUs' +quiescent states happen before the last CPU traverses through +the root rcu_node structure, the “last CPU” +being the one that clears the last bit in the root rcu_node +structure's ->qsmask field. + +

Dynamic Tick Interface

+ +

Due to energy-efficiency considerations, RCU is forbidden from +disturbing idle CPUs. +CPUs are therefore required to notify RCU when entering or leaving idle +state, which they do via fully ordered value-returning atomic operations +on a per-CPU variable. +The ordering effects are as shown below: + +

TreeRCU-dyntick.svg + +

The RCU grace-period kernel thread samples the per-CPU idleness +variable while holding the corresponding CPU's leaf rcu_node +structure's ->lock. +This means that any RCU read-side critical sections that precede the +idle period (the oval near the top of the diagram above) will happen +before the end of the current grace period. +Similarly, the beginning of the current grace period will happen before +any RCU read-side critical sections that follow the +idle period (the oval near the bottom of the diagram above). + +

Plumbing this into the full grace-period execution is described +below. + +

CPU-Hotplug Interface

+ +

RCU is also forbidden from disturbing offline CPUs, which might well +be powered off and removed from the system completely. +CPUs are therefore required to notify RCU of their comings and goings +as part of the corresponding CPU hotplug operations. +The ordering effects are shown below: + +

TreeRCU-hotplug.svg + +

Because CPU hotplug operations are much less frequent than idle transitions, +they are heavier weight, and thus acquire the CPU's leaf rcu_node +structure's ->lock and update this structure's +->qsmaskinitnext. +The RCU grace-period kernel thread samples this mask to detect CPUs +having gone offline since the beginning of this grace period. + +

Plumbing this into the full grace-period execution is described +below. + +

Forcing Quiescent States

+ +

As noted above, idle and offline CPUs cannot report their own +quiescent states, and therefore the grace-period kernel thread +must do the reporting on their behalf. +This process is called “forcing quiescent states”, it is +repeated every few jiffies, and its ordering effects are shown below: + +

TreeRCU-gp-fqs.svg + +

Each pass of quiescent state forcing is guaranteed to traverse the +leaf rcu_node structures, and if there are no new quiescent +states due to recently idled and/or offlined CPUs, then only the +leaves are traversed. +However, if there is a newly offlined CPU as illustrated on the left +or a newly idled CPU as illustrated on the right, the corresponding +quiescent state will be driven up towards the root. +As with self-reported quiescent states, the upwards driving stops +once it reaches an rcu_node structure that has quiescent +states outstanding from other CPUs. + + + + + + + + +
 
Quick Quiz:
+ The leftmost drive to root stopped before it reached + the root rcu_node structure, which means that + there are still CPUs subordinate to that structure on + which the current grace period is waiting. + Given that, how is it possible that the rightmost drive + to root ended the grace period? +
Answer:
+ Good analysis! + It is in fact impossible in the absence of bugs in RCU. + But this diagram is complex enough as it is, so simplicity + overrode accuracy. + You can think of it as poetic license, or you can think of + it as misdirection that is resolved in the + stitched-together diagram. +
 
+ +

Grace-Period Cleanup

+ +

Grace-period cleanup first scans the rcu_node tree +breadth-first setting all the ->completed fields equal +to the number of the newly completed grace period, then it sets +the rcu_state structure's ->completed field, +again to the number of the newly completed grace period. +The ordering effects are shown below: + +

TreeRCU-gp-cleanup.svg + +

As indicated by the oval at the bottom of the diagram, once +grace-period cleanup is complete, the next grace period can begin. + + + + + + + + +
 
Quick Quiz:
+ But when precisely does the grace period end? +
Answer:
+ There is no useful single point at which the grace period + can be said to end. + The earliest reasonable candidate is as soon as the last + CPU has reported its quiescent state, but it may be some + milliseconds before RCU becomes aware of this. + The latest reasonable candidate is once the rcu_state + structure's ->completed field has been updated, + but it is quite possible that some CPUs have already completed + phase two of their updates by that time. + In short, if you are going to work with RCU, you need to + learn to embrace uncertainty. +
 
+ + +

Callback Invocation

+ +

Once a given CPU's leaf rcu_node structure's +->completed field has been updated, that CPU can begin +invoking its RCU callbacks that were waiting for this grace period +to end. +These callbacks are identified by rcu_advance_cbs(), +which is usually invoked by __note_gp_changes(). +As shown in the diagram below, this invocation can be triggered by +the scheduling-clock interrupt (rcu_check_callbacks() on +the left) or by idle entry (rcu_cleanup_after_idle() on +the right, but only for kernels build with +CONFIG_RCU_FAST_NO_HZ=y). +Either way, RCU_SOFTIRQ is raised, which results in +rcu_do_batch() invoking the callbacks, which in turn +allows those callbacks to carry out (either directly or indirectly +via wakeup) the needed phase-two processing for each update. + +

TreeRCU-callback-invocation.svg + +

Please note that callback invocation can also be prompted by any +number of corner-case code paths, for example, when a CPU notes that +it has excessive numbers of callbacks queued. +In all cases, the CPU acquires its leaf rcu_node structure's +->lock before invoking callbacks, which preserves the +required ordering against the newly completed grace period. + +

However, if the callback function communicates to other CPUs, +for example, doing a wakeup, then it is that function's responsibility +to maintain ordering. +For example, if the callback function wakes up a task that runs on +some other CPU, proper ordering must in place in both the callback +function and the task being awakened. +To see why this is important, consider the top half of the +grace-period cleanup diagram. +The callback might be running on a CPU corresponding to the leftmost +leaf rcu_node structure, and awaken a task that is to run on +a CPU corresponding to the rightmost leaf rcu_node structure, +and the grace-period kernel thread might not yet have reached the +rightmost leaf. +In this case, the grace period's memory ordering might not yet have +reached that CPU, so again the callback function and the awakened +task must supply proper ordering. + +

Putting It All Together

+ +

A stitched-together diagram is +here. + +

+Legal Statement

+ +

This work represents the view of the author and does not necessarily +represent the view of IBM. + +

Linux is a registered trademark of Linus Torvalds. + +

Other company, product, and service names may be trademarks or +service marks of others. + + diff --git a/Documentation/RCU/Design/Memory-Ordering/TreeRCU-callback-invocation.svg b/Documentation/RCU/Design/Memory-Ordering/TreeRCU-callback-invocation.svg new file mode 100644 index 000000000000..832408313d93 --- /dev/null +++ b/Documentation/RCU/Design/Memory-Ordering/TreeRCU-callback-invocation.svg @@ -0,0 +1,486 @@ + + + + + + + + + + + + image/svg+xml + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + rcu_check_callbacks() + + rcu_cleanup_after_idle() + + rcu_advance_cbs() + + + Leaf + __note_gp_changes() + + + + Phase Two + of Update + + + RCU_SOFTIRQ + rcu_do_batch() + + diff --git a/Documentation/RCU/Design/Memory-Ordering/TreeRCU-callback-registry.svg b/Documentation/RCU/Design/Memory-Ordering/TreeRCU-callback-registry.svg new file mode 100644 index 000000000000..7ac6f9269806 --- /dev/null +++ b/Documentation/RCU/Design/Memory-Ordering/TreeRCU-callback-registry.svg @@ -0,0 +1,655 @@ + + + + + + + + + + + + image/svg+xml + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + rcu_accelerate_cbs() + + + + + rcu_prepare_for_idle() + + rcu_accelerate_cbs() + + + + + note_gp_changes() + rcu_advance_cbs() + __note_gp_changes() + + call_rcu() + + Wake up + grace-period + kernel thread + + rcu_accelerate_cbs() + + + + + takedown_cpu() + rcutree_migrate_callbacks() + rcu_migrate_callbacks() + rcu_advance_cbs() + Leaf + Leaf + Leaf + + Phase One + of Update + + diff --git a/Documentation/RCU/Design/Memory-Ordering/TreeRCU-dyntick.svg b/Documentation/RCU/Design/Memory-Ordering/TreeRCU-dyntick.svg new file mode 100644 index 000000000000..423df00c4df9 --- /dev/null +++ b/Documentation/RCU/Design/Memory-Ordering/TreeRCU-dyntick.svg @@ -0,0 +1,700 @@ + + + + + + + + + + + + image/svg+xml + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ->qsmask &= ~->grpmask + Leaf + dyntick_save_progress_counter() + rcu_implicit_dynticks_qs() + + + + RCU + read-side + critical section + + + + rcu_dynticks_eqs_enter() + atomic_add_return() + + + + rcu_dynticks_eqs_exit() + atomic_add_return() + + + + RCU + read-side + critical section + + diff --git a/Documentation/RCU/Design/Memory-Ordering/TreeRCU-gp-cleanup.svg b/Documentation/RCU/Design/Memory-Ordering/TreeRCU-gp-cleanup.svg new file mode 100644 index 000000000000..754f426b297a --- /dev/null +++ b/Documentation/RCU/Design/Memory-Ordering/TreeRCU-gp-cleanup.svg @@ -0,0 +1,1126 @@ + + + + + + + + + + + + image/svg+xml + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ->completed = ->gpnum + + + + + Root + + + rcu_gp_cleanup() + + + + + + ->completed = ->gpnum + + + + + + + Leaf + ->completed = ->gpnum + + + rsp->completed = + + + + + Root + rnp->completed + + + + + + + + + + + + Leaf + + + + + + + + + + + + Leaf + + + + + + + Leaf + + + + + + + Leaf + + + + + + + + + + + + + + ->completed = ->gpnum + + + + + + + Leaf + + + + + + + Leaf + ->completed = ->gpnum + + + + + + + Leaf + ->completed = ->gpnum + + + + + + + + ->completed = ->gpnum + + + Start of + Next Grace + Period + + + diff --git a/Documentation/RCU/Design/Memory-Ordering/TreeRCU-gp-fqs.svg b/Documentation/RCU/Design/Memory-Ordering/TreeRCU-gp-fqs.svg new file mode 100644 index 000000000000..7ddc094d7f28 --- /dev/null +++ b/Documentation/RCU/Design/Memory-Ordering/TreeRCU-gp-fqs.svg @@ -0,0 +1,1309 @@ + + + + + + + + + + + + image/svg+xml + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + rcu_gp_fqs() + + + + + + ->qsmask &= ~->grpmask + + + + + + + Leaf + + + + + + + ->qsmask &= ~->grpmask + + + + + + + Leaf + + + + + + + Leaf + + + + + + + Leaf + ->qsmask &= ~->grpmask + + + + + + + + + force_qs_rnp() + dyntick_save_progress_counter() + + + + + + Root + ->qsmask &= ~->grpmask + + rcu_implicit_dynticks_qs() + ->qsmask &= ~->grpmask + + + RCU + read-side + critical section + + + + rcu_dynticks_eqs_enter() + atomic_add_return() + + + + rcu_dynticks_eqs_exit() + atomic_add_return() + + + + RCU + read-side + critical section + + + + RCU + read-side + critical section + + + + rcu_report_dead() + rcu_cleanup_dying_idle_cpu() + + + + + ->qsmaskinitnext + Leaf + + + + RCU + read-side + critical section + + + + rcu_cpu_starting() + + + + + ->qsmaskinitnext + Leaf + + + diff --git a/Documentation/RCU/Design/Memory-Ordering/TreeRCU-gp-init-1.svg b/Documentation/RCU/Design/Memory-Ordering/TreeRCU-gp-init-1.svg new file mode 100644 index 000000000000..0161262904ec --- /dev/null +++ b/Documentation/RCU/Design/Memory-Ordering/TreeRCU-gp-init-1.svg @@ -0,0 +1,656 @@ + + + + + + + + + + + + image/svg+xml + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + rsp->gpnum++ + + + + + Root + + + rcu_gp_init() + + + + + + + + + + + + Leaf + + + + + + + + + + + + + Leaf + + + + + + + Leaf + + + + + + + Leaf + + + + + + + + + + End of + Last Grace + Period + + + diff --git a/Documentation/RCU/Design/Memory-Ordering/TreeRCU-gp-init-2.svg b/Documentation/RCU/Design/Memory-Ordering/TreeRCU-gp-init-2.svg new file mode 100644 index 000000000000..4d956a732685 --- /dev/null +++ b/Documentation/RCU/Design/Memory-Ordering/TreeRCU-gp-init-2.svg @@ -0,0 +1,656 @@ + + + + + + + + + + + + image/svg+xml + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + rcu_gp_init() + + + + + + + + + + + + Leaf + + + + + + + ->qsmaskinit + ->qsmaskinitnext + + + + + + + Leaf + + + + + + + Leaf + + + + + + + Leaf + ->qsmaskinit + + + + + + + + + rcu_init_new_rnp() or + rcu_cleanup_dead_rnp() + (optional) + + ->qsmaskinit + + + + + Root + ->qsmaskinitnext + + diff --git a/Documentation/RCU/Design/Memory-Ordering/TreeRCU-gp-init-3.svg b/Documentation/RCU/Design/Memory-Ordering/TreeRCU-gp-init-3.svg new file mode 100644 index 000000000000..de6ecc51b00e --- /dev/null +++ b/Documentation/RCU/Design/Memory-Ordering/TreeRCU-gp-init-3.svg @@ -0,0 +1,632 @@ + + + + + + + + + + + + image/svg+xml + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ->gpnum = rsp->gpnum + + + + + Root + + + rcu_gp_init() + + + + + + ->gpnum = rsp->gpnum + + + + + + + Leaf + ->gpnum = rsp->gpnum + + + + + + + ->gpnum = rsp->gpnum + + + + + + + Leaf + + + + + + + Leaf + ->gpnum = rsp->gpnum + + + + + + + Leaf + ->gpnum = rsp->gpnum + + + + + + + + ->gpnum = rsp->gpnum + diff --git a/Documentation/RCU/Design/Memory-Ordering/TreeRCU-gp.svg b/Documentation/RCU/Design/Memory-Ordering/TreeRCU-gp.svg new file mode 100644 index 000000000000..b13b7b01bb3a --- /dev/null +++ b/Documentation/RCU/Design/Memory-Ordering/TreeRCU-gp.svg @@ -0,0 +1,5135 @@ + + + + + + + + + + + + image/svg+xml + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + rcu_accelerate_cbs() + + + + + rcu_prepare_for_idle() + + rcu_accelerate_cbs() + + + + + note_gp_changes() + rcu_advance_cbs() + __note_gp_changes() + + call_rcu() + + Wake up + grace-period + kernel thread + + rcu_accelerate_cbs() + + + + + takedown_cpu() + rcutree_migrate_callbacks() + rcu_migrate_callbacks() + rcu_advance_cbs() + Leaf + Leaf + Leaf + + Phase One + of Update + + + rsp->gpnum++ + + + + + Root + + + rcu_gp_init() + + + + + + + + + + + + Leaf + + + + + + + + + + + + + Leaf + + + + + + + Leaf + + + + + + + Leaf + + + + + + + + + + End of + Last Grace + Period + + + + Grace-period + kernel thread + awakened + + + + + + + + + + + + + + Leaf + + + + + + + ->qsmaskinit + ->qsmaskinitnext + + + + + + + Leaf + + + + + + + Leaf + + + + + + + Leaf + ->qsmaskinit + + + + + + + + + rcu_init_new_rnp() or + rcu_cleanup_dead_rnp() + (optional) + + ->qsmaskinit + + + + + Root + ->qsmaskinitnext + + + + ->gpnum = rsp->gpnum + + + + + Root + + + + + + + ->gpnum = rsp->gpnum + + + + + + + Leaf + ->gpnum = rsp->gpnum + + + + + + + ->gpnum = rsp->gpnum + + + + + + + Leaf + + + + + + + Leaf + ->gpnum = rsp->gpnum + + + + + + + Leaf + ->gpnum = rsp->gpnum + + + + + + + + ->gpnum = rsp->gpnum + + + + + + + rcu_gp_fqs() + + + + + + ->qsmask &= ~->grpmask + + + + + + + Leaf + + + + + + + ->qsmask &= ~->grpmask + + + + + + + Leaf + + + + + + + Leaf + + + + + + + Leaf + ->qsmask &= ~->grpmask + + + + + + + + + force_qs_rnp() + dyntick_save_progress_counter() + + + + + + Root + ->qsmask &= ~->grpmask + + rcu_implicit_dynticks_qs() + ->qsmask &= ~->grpmask + + + RCU + read-side + critical section + + + + rcu_dynticks_eqs_enter() + atomic_add_return() + + + + rcu_dynticks_eqs_exit() + atomic_add_return() + + + + RCU + read-side + critical section + + + + RCU + read-side + critical section + + + + rcu_report_dead() + rcu_cleanup_dying_idle_cpu() + + + + + ->qsmaskinitnext + Leaf + + + + RCU + read-side + critical section + + + + rcu_cpu_starting() + + + + + ->qsmaskinitnext + Leaf + + + + + ->qsmask &= ~->grpmask + + + + + Root + + + rcu_report_rnp() + + + + + + + + + + + + Leaf + + + + + + + ->qsmask &= ~->grpmask + + + + + + + Leaf + + + + + + + Leaf + + + + + + + Leaf + ->qsmask &= ~->grpmask + + + + + + + + + + + + + + + + + note_gp_changes() + rdp->gpnum + __note_gp_changes() + Leaf + + + + rcu_node_context_switch() + + + + rcu_check_callbacks() + + + + rcu_process_callbacks() + rcu_check_quiescent_state()) + rcu__report_qs_rdp()) + + + + RCU + read-side + critical section + + + + RCU + read-side + critical section + + + + RCU + read-side + critical section + + + + RCU + read-side + critical section + + + + + + Wake up + grace-period + kernel thread + + + + rcu_report_qs_rsp() + + + Grace-period + kernel thread + awakened + + + + ->completed = ->gpnum + + + + + Root + + + rcu_gp_cleanup() + + + + + + ->completed = ->gpnum + + + + + + Leaf + ->completed = ->gpnum + + rsp->completed = + + + + + Root + rnp->completed + + + + + + + + + + + Leaf + + + + + + + + + + + + Leaf + + + + + + + Leaf + + + + + + + Leaf + + + + + + + + + + + + + ->completed = ->gpnum + + + + + + + Leaf + + + + + + + Leaf + ->completed = ->gpnum + + + + + + + Leaf + ->completed = ->gpnum + + + + + + + + ->completed = ->gpnum + + + Start of + Next Grace + Period + + + + + + rcu_check_callbacks() + + rcu_cleanup_after_idle() + rcu_advance_cbs() + + + Leaf + __note_gp_changes() + + + Phase Two + of Update + + + RCU_SOFTIRQ + rcu_do_batch() + + diff --git a/Documentation/RCU/Design/Memory-Ordering/TreeRCU-hotplug.svg b/Documentation/RCU/Design/Memory-Ordering/TreeRCU-hotplug.svg new file mode 100644 index 000000000000..2c9310ba29ba --- /dev/null +++ b/Documentation/RCU/Design/Memory-Ordering/TreeRCU-hotplug.svg @@ -0,0 +1,775 @@ + + + + + + + + + + + + image/svg+xml + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ->qsmask &= ~->grpmask + dyntick_save_progress_counter() + rcu_implicit_dynticks_qs() + Leaf + + + + RCU + read-side + critical section + + + + rcu_report_dead() + rcu_cleanup_dying_idle_cpu() + + + + + ->qsmaskinitnext + Leaf + + + + RCU + read-side + critical section + + + + rcu_cpu_starting() + + + + + ->qsmaskinitnext + Leaf + + diff --git a/Documentation/RCU/Design/Memory-Ordering/TreeRCU-qs.svg b/Documentation/RCU/Design/Memory-Ordering/TreeRCU-qs.svg new file mode 100644 index 000000000000..de3992f4cbe1 --- /dev/null +++ b/Documentation/RCU/Design/Memory-Ordering/TreeRCU-qs.svg @@ -0,0 +1,1095 @@ + + + + + + + + + + + + image/svg+xml + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ->qsmask &= ~->grpmask + + + + + Root + + + rcu_report_rnp() + + + + + + + + + + + + Leaf + + + + + + + ->qsmask &= ~->grpmask + + + + + + + Leaf + + + + + + + Leaf + + + + + + + Leaf + ->qsmask &= ~->grpmask + + + + + + + + + + + + + + + + + + note_gp_changes() + rdp->gpnum + __note_gp_changes() + Leaf + + + + rcu_node_context_switch() + + + + rcu_check_callbacks() + + + + rcu_process_callbacks() + rcu_check_quiescent_state()) + rcu__report_qs_rdp()) + + + + RCU + read-side + critical section + + + + RCU + read-side + critical section + + + + RCU + read-side + critical section + + + + RCU + read-side + critical section + + + + + + + Wake up + grace-period + kernel thread + + + + rcu_report_qs_rsp() + + diff --git a/Documentation/RCU/Design/Memory-Ordering/rcu_node-lock.svg b/Documentation/RCU/Design/Memory-Ordering/rcu_node-lock.svg new file mode 100644 index 000000000000..94c96c595aed --- /dev/null +++ b/Documentation/RCU/Design/Memory-Ordering/rcu_node-lock.svg @@ -0,0 +1,229 @@ + + + + + + + + + + + + image/svg+xml + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + rcu_accelerate_cbs() + + + + + + + rcu_prepare_for_idle() + + From dfa0ee48ef86f79430d2be9d1e2e1b509abb3cce Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Wed, 9 Aug 2017 10:16:29 -0700 Subject: [PATCH 02/24] documentation: Long-running irq handlers can stall RCU grace periods If a periodic interrupt's handler takes longer to execute than the period between successive interrupts, RCU's kthreads and softirq handlers can be prevented from executing, resulting in otherwise inexplicable RCU CPU stall warnings. This commit therefore calls out this possibility in Documentation/RCU/stallwarn.txt. Reported-by: Daniel Lezcano Signed-off-by: Paul E. McKenney --- Documentation/RCU/stallwarn.txt | 12 +++++++++++- 1 file changed, 11 insertions(+), 1 deletion(-) diff --git a/Documentation/RCU/stallwarn.txt b/Documentation/RCU/stallwarn.txt index 96a3d81837e1..21b8913acbdf 100644 --- a/Documentation/RCU/stallwarn.txt +++ b/Documentation/RCU/stallwarn.txt @@ -40,7 +40,9 @@ o Booting Linux using a console connection that is too slow to o Anything that prevents RCU's grace-period kthreads from running. This can result in the "All QSes seen" console-log message. This message will include information on when the kthread last - ran and how often it should be expected to run. + ran and how often it should be expected to run. It can also + result in the "rcu_.*kthread starved for" console-log message, + which will include additional debugging information. o A CPU-bound real-time task in a CONFIG_PREEMPT kernel, which might happen to preempt a low-priority task in the middle of an RCU @@ -60,6 +62,14 @@ o A CPU-bound real-time task in a CONFIG_PREEMPT_RT kernel that CONFIG_PREEMPT_RCU case, you might see stall-warning messages. +o A periodic interrupt whose handler takes longer than the time + interval between successive pairs of interrupts. This can + prevent RCU's kthreads and softirq handlers from running. + Note that certain high-overhead debugging options, for example + the function_graph tracer, can result in interrupt handler taking + considerably longer than normal, which can in turn result in + RCU CPU stall warnings. + o A hardware or software issue shuts off the scheduler-clock interrupt on a CPU that is not in dyntick-idle mode. This problem really has happened, and seems to be most likely to From 3d916a443e97169a3d88765c4e0b07ac813f439f Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Thu, 10 Aug 2017 14:33:17 -0700 Subject: [PATCH 03/24] documentation: Slow systems can stall RCU grace periods If a fast system has a worst-case grace-period duration of (say) ten seconds, then running the same workload on a system ten times as slow will get you an RCU CPU stall warning given default stall-warning timeout settings. This commit therefore adds this possibility to stallwarn.txt. Reported-by: Daniel Lezcano Signed-off-by: Paul E. McKenney --- Documentation/RCU/stallwarn.txt | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/Documentation/RCU/stallwarn.txt b/Documentation/RCU/stallwarn.txt index 21b8913acbdf..238acbd94917 100644 --- a/Documentation/RCU/stallwarn.txt +++ b/Documentation/RCU/stallwarn.txt @@ -70,6 +70,12 @@ o A periodic interrupt whose handler takes longer than the time considerably longer than normal, which can in turn result in RCU CPU stall warnings. +o Testing a workload on a fast system, tuning the stall-warning + timeout down to just barely avoid RCU CPU stall warnings, and then + running the same workload with the same stall-warning timeout on a + slow system. Note that thermal throttling and on-demand governors + can cause a single system to be sometimes fast and sometimes slow! + o A hardware or software issue shuts off the scheduler-clock interrupt on a CPU that is not in dyntick-idle mode. This problem really has happened, and seems to be most likely to From d3cf5176d0b17610b7c7f1562e496ec401fe01f8 Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Thu, 17 Aug 2017 12:29:22 -0700 Subject: [PATCH 04/24] documentation: Update RCU CPU stall warning messages The RCU CPU stall warnings have morphed significantly since the last update, so this commit brings the documentation up to date. Signed-off-by: Paul E. McKenney --- Documentation/RCU/stallwarn.txt | 180 ++++++++++++++++++-------------- 1 file changed, 104 insertions(+), 76 deletions(-) diff --git a/Documentation/RCU/stallwarn.txt b/Documentation/RCU/stallwarn.txt index 238acbd94917..a08f928c8557 100644 --- a/Documentation/RCU/stallwarn.txt +++ b/Documentation/RCU/stallwarn.txt @@ -171,67 +171,32 @@ Interpreting RCU's CPU Stall-Detector "Splats" For non-RCU-tasks flavors of RCU, when a CPU detects that it is stalling, it will print a message similar to the following: -INFO: rcu_sched_state detected stall on CPU 5 (t=2500 jiffies) + INFO: rcu_sched detected stalls on CPUs/tasks: + 2-...: (3 GPs behind) idle=06c/0/0 softirq=1453/1455 fqs=0 + 16-...: (0 ticks this GP) idle=81c/0/0 softirq=764/764 fqs=0 + (detected by 32, t=2603 jiffies, g=7073, c=7072, q=625) -This message indicates that CPU 5 detected that it was causing a stall, -and that the stall was affecting RCU-sched. This message will normally be -followed by a stack dump of the offending CPU. On TREE_RCU kernel builds, -RCU and RCU-sched are implemented by the same underlying mechanism, -while on PREEMPT_RCU kernel builds, RCU is instead implemented -by rcu_preempt_state. - -On the other hand, if the offending CPU fails to print out a stall-warning -message quickly enough, some other CPU will print a message similar to -the following: - -INFO: rcu_bh_state detected stalls on CPUs/tasks: { 3 5 } (detected by 2, 2502 jiffies) - -This message indicates that CPU 2 detected that CPUs 3 and 5 were both -causing stalls, and that the stall was affecting RCU-bh. This message +This message indicates that CPU 32 detected that CPUs 2 and 16 were both +causing stalls, and that the stall was affecting RCU-sched. This message will normally be followed by stack dumps for each CPU. Please note that -PREEMPT_RCU builds can be stalled by tasks as well as by CPUs, -and that the tasks will be indicated by PID, for example, "P3421". -It is even possible for a rcu_preempt_state stall to be caused by both -CPUs -and- tasks, in which case the offending CPUs and tasks will all -be called out in the list. +PREEMPT_RCU builds can be stalled by tasks as well as by CPUs, and that +the tasks will be indicated by PID, for example, "P3421". It is even +possible for a rcu_preempt_state stall to be caused by both CPUs -and- +tasks, in which case the offending CPUs and tasks will all be called +out in the list. -Finally, if the grace period ends just as the stall warning starts -printing, there will be a spurious stall-warning message: - -INFO: rcu_bh_state detected stalls on CPUs/tasks: { } (detected by 4, 2502 jiffies) - -This is rare, but does happen from time to time in real life. It is also -possible for a zero-jiffy stall to be flagged in this case, depending -on how the stall warning and the grace-period initialization happen to -interact. Please note that it is not possible to entirely eliminate this -sort of false positive without resorting to things like stop_machine(), -which is overkill for this sort of problem. - -Recent kernels will print a long form of the stall-warning message: - - INFO: rcu_preempt detected stall on CPU - 0: (63959 ticks this GP) idle=241/3fffffffffffffff/0 softirq=82/543 - (t=65000 jiffies) - -In kernels with CONFIG_RCU_FAST_NO_HZ, more information is printed: - - INFO: rcu_preempt detected stall on CPU - 0: (64628 ticks this GP) idle=dd5/3fffffffffffffff/0 softirq=82/543 last_accelerate: a345/d342 nonlazy_posted: 25 .D - (t=65000 jiffies) - -The "(64628 ticks this GP)" indicates that this CPU has taken more -than 64,000 scheduling-clock interrupts during the current stalled -grace period. If the CPU was not yet aware of the current grace -period (for example, if it was offline), then this part of the message -indicates how many grace periods behind the CPU is. +CPU 2's "(3 GPs behind)" indicates that this CPU has not interacted with +the RCU core for the past three grace periods. In contrast, CPU 16's "(0 +ticks this GP)" indicates that this CPU has not taken any scheduling-clock +interrupts during the current stalled grace period. The "idle=" portion of the message prints the dyntick-idle state. The hex number before the first "/" is the low-order 12 bits of the -dynticks counter, which will have an even-numbered value if the CPU is -in dyntick-idle mode and an odd-numbered value otherwise. The hex -number between the two "/"s is the value of the nesting, which will -be a small positive number if in the idle loop and a very large positive -number (as shown above) otherwise. +dynticks counter, which will have an even-numbered value if the CPU +is in dyntick-idle mode and an odd-numbered value otherwise. The hex +number between the two "/"s is the value of the nesting, which will be +a small non-negative number if in the idle loop (as shown above) and a +very large positive number otherwise. The "softirq=" portion of the message tracks the number of RCU softirq handlers that the stalled CPU has executed. The number before the "/" @@ -246,24 +211,72 @@ handlers are no longer able to execute on this CPU. This can happen if the stalled CPU is spinning with interrupts are disabled, or, in -rt kernels, if a high-priority process is starving RCU's softirq handler. -For CONFIG_RCU_FAST_NO_HZ kernels, the "last_accelerate:" prints the -low-order 16 bits (in hex) of the jiffies counter when this CPU last -invoked rcu_try_advance_all_cbs() from rcu_needs_cpu() or last invoked -rcu_accelerate_cbs() from rcu_prepare_for_idle(). The "nonlazy_posted:" -prints the number of non-lazy callbacks posted since the last call to -rcu_needs_cpu(). Finally, an "L" indicates that there are currently -no non-lazy callbacks ("." is printed otherwise, as shown above) and -"D" indicates that dyntick-idle processing is enabled ("." is printed -otherwise, for example, if disabled via the "nohz=" kernel boot parameter). +The "fps=" shows the number of force-quiescent-state idle/offline +detection passes that the grace-period kthread has made across this +CPU since the last time that this CPU noted the beginning of a grace +period. + +The "detected by" line indicates which CPU detected the stall (in this +case, CPU 32), how many jiffies have elapsed since the start of the +grace period (in this case 2603), the number of the last grace period +to start and to complete (7073 and 7072, respectively), and an estimate +of the total number of RCU callbacks queued across all CPUs (625 in +this case). + +In kernels with CONFIG_RCU_FAST_NO_HZ, more information is printed +for each CPU: + + 0: (64628 ticks this GP) idle=dd5/3fffffffffffffff/0 softirq=82/543 last_accelerate: a345/d342 nonlazy_posted: 25 .D + +The "last_accelerate:" prints the low-order 16 bits (in hex) of the +jiffies counter when this CPU last invoked rcu_try_advance_all_cbs() +from rcu_needs_cpu() or last invoked rcu_accelerate_cbs() from +rcu_prepare_for_idle(). The "nonlazy_posted:" prints the number +of non-lazy callbacks posted since the last call to rcu_needs_cpu(). +Finally, an "L" indicates that there are currently no non-lazy callbacks +("." is printed otherwise, as shown above) and "D" indicates that +dyntick-idle processing is enabled ("." is printed otherwise, for example, +if disabled via the "nohz=" kernel boot parameter). + +If the grace period ends just as the stall warning starts printing, +there will be a spurious stall-warning message, which will include +the following: + + INFO: Stall ended before state dump start + +This is rare, but does happen from time to time in real life. It is also +possible for a zero-jiffy stall to be flagged in this case, depending +on how the stall warning and the grace-period initialization happen to +interact. Please note that it is not possible to entirely eliminate this +sort of false positive without resorting to things like stop_machine(), +which is overkill for this sort of problem. + +If all CPUs and tasks have passed through quiescent states, but the +grace period has nevertheless failed to end, the stall-warning splat +will include something like the following: + + All QSes seen, last rcu_preempt kthread activity 23807 (4297905177-4297881370), jiffies_till_next_fqs=3, root ->qsmask 0x0 + +The "23807" indicates that it has been more than 23 thousand jiffies +since the grace-period kthread ran. The "jiffies_till_next_fqs" +indicates how frequently that kthread should run, giving the number +of jiffies between force-quiescent-state scans, in this case three, +which is way less than 23807. Finally, the root rcu_node structure's +->qsmask field is printed, which will normally be zero. If the relevant grace-period kthread has been unable to run prior to -the stall warning, the following additional line is printed: +the stall warning, as was the case in the "All QSes seen" line above, +the following additional line is printed: - rcu_preempt kthread starved for 2023 jiffies! + kthread starved for 23807 jiffies! g7073 c7072 f0x0 RCU_GP_WAIT_FQS(3) ->state=0x1 -Starving the grace-period kthreads of CPU time can of course result in -RCU CPU stall warnings even when all CPUs and tasks have passed through -the required quiescent states. +Starving the grace-period kthreads of CPU time can of course result +in RCU CPU stall warnings even when all CPUs and tasks have passed +through the required quiescent states. The "g" and "c" numbers flag the +number of the last grace period started and completed, respectively, +the "f" precedes the ->gp_flags command to the grace-period kthread, +the "RCU_GP_WAIT_FQS" indicates that the kthread is waiting for a short +timeout, and the "state" precedes value of the task_struct ->state field. Multiple Warnings From One Stall @@ -280,13 +293,28 @@ Stall Warnings for Expedited Grace Periods If an expedited grace period detects a stall, it will place a message like the following in dmesg: - INFO: rcu_sched detected expedited stalls on CPUs: { 1 2 6 } 26009 jiffies s: 1043 + INFO: rcu_sched detected expedited stalls on CPUs/tasks: { 7-... } 21119 jiffies s: 73 root: 0x2/. -This indicates that CPUs 1, 2, and 6 have failed to respond to a -reschedule IPI, that the expedited grace period has been going on for -26,009 jiffies, and that the expedited grace-period sequence counter is -1043. The fact that this last value is odd indicates that an expedited -grace period is in flight. +This indicates that CPU 7 has failed to respond to a reschedule IPI. +The three periods (".") following the CPU number indicate that the CPU +is online (otherwise the first period would instead have been "O"), +that the CPU was online at the beginning of the expedited grace period +(otherwise the second period would have instead been "o"), and that +the CPU has been online at least once since boot (otherwise, the third +period would instead have been "N"). The number before the "jiffies" +indicates that the expedited grace period has been going on for 21,119 +jiffies. The number following the "s:" indicates that the expedited +grace-period sequence counter is 73. The fact that this last value is +odd indicates that an expedited grace period is in flight. The number +following "root:" is a bitmask that indicates which children of the root +rcu_node structure correspond to CPUs and/or tasks that are blocking the +current expedited grace period. If the tree had more than one level, +additional hex numbers would be printed for the states of the other +rcu_node structures in the tree. + +As with normal grace periods, PREEMPT_RCU builds can be stalled by +tasks as well as by CPUs, and that the tasks will be indicated by PID, +for example, "P3421". It is entirely possible to see stall warnings from normal and from -expedited grace periods at about the same time from the same run. +expedited grace periods at about the same time during the same run. From f1ab25a30ce81f4e9be3cb33cd9bb9fb2db64b28 Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Tue, 29 Aug 2017 15:49:21 -0700 Subject: [PATCH 05/24] memory-barriers: Replace uses of "transitive" The current version of memory-barriers.txt misuses the term "transitive", so this commit replaces it with multi-copy atomic, also adding a definition of this term. Reported-by: Alan Stern Signed-off-by: Paul E. McKenney --- Documentation/memory-barriers.txt | 175 +++++++++++++++--------------- 1 file changed, 86 insertions(+), 89 deletions(-) diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt index b759a60624fd..b6882680247e 100644 --- a/Documentation/memory-barriers.txt +++ b/Documentation/memory-barriers.txt @@ -53,7 +53,7 @@ CONTENTS - SMP barrier pairing. - Examples of memory barrier sequences. - Read memory barriers vs load speculation. - - Transitivity + - Multicopy atomicity. (*) Explicit kernel barriers. @@ -635,6 +635,11 @@ can be used to record rare error conditions and the like, and the CPUs' naturally occurring ordering prevents such records from being lost. +Note well that the ordering provided by a data dependency is local to +the CPU containing it. See the section on "Multicopy atomicity" for +more information. + + The data dependency barrier is very important to the RCU system, for example. See rcu_assign_pointer() and rcu_dereference() in include/linux/rcupdate.h. This permits the current target of an RCU'd @@ -851,38 +856,11 @@ In short, control dependencies apply only to the stores in the then-clause and else-clause of the if-statement in question (including functions invoked by those two clauses), not to code following that if-statement. -Finally, control dependencies do -not- provide transitivity. This is -demonstrated by two related examples, with the initial values of -'x' and 'y' both being zero: - CPU 0 CPU 1 - ======================= ======================= - r1 = READ_ONCE(x); r2 = READ_ONCE(y); - if (r1 > 0) if (r2 > 0) - WRITE_ONCE(y, 1); WRITE_ONCE(x, 1); +Note well that the ordering provided by a control dependency is local +to the CPU containing it. See the section on "Multicopy atomicity" +for more information. - assert(!(r1 == 1 && r2 == 1)); - -The above two-CPU example will never trigger the assert(). However, -if control dependencies guaranteed transitivity (which they do not), -then adding the following CPU would guarantee a related assertion: - - CPU 2 - ===================== - WRITE_ONCE(x, 2); - - assert(!(r1 == 2 && r2 == 1 && x == 2)); /* FAILS!!! */ - -But because control dependencies do -not- provide transitivity, the above -assertion can fail after the combined three-CPU example completes. If you -need the three-CPU example to provide ordering, you will need smp_mb() -between the loads and stores in the CPU 0 and CPU 1 code fragments, -that is, just before or just after the "if" statements. Furthermore, -the original two-CPU example is very fragile and should be avoided. - -These two examples are the LB and WWC litmus tests from this paper: -http://www.cl.cam.ac.uk/users/pes20/ppc-supplemental/test6.pdf and this -site: https://www.cl.cam.ac.uk/~pes20/ppcmem/index.html. In summary: @@ -922,8 +900,8 @@ In summary: (*) Control dependencies pair normally with other types of barriers. - (*) Control dependencies do -not- provide transitivity. If you - need transitivity, use smp_mb(). + (*) Control dependencies do -not- provide multicopy atomicity. If you + need all the CPUs to see a given store at the same time, use smp_mb(). (*) Compilers do not understand control dependencies. It is therefore your job to ensure that they do not break your code. @@ -936,13 +914,14 @@ When dealing with CPU-CPU interactions, certain types of memory barrier should always be paired. A lack of appropriate pairing is almost certainly an error. General barriers pair with each other, though they also pair with most -other types of barriers, albeit without transitivity. An acquire barrier -pairs with a release barrier, but both may also pair with other barriers, -including of course general barriers. A write barrier pairs with a data -dependency barrier, a control dependency, an acquire barrier, a release -barrier, a read barrier, or a general barrier. Similarly a read barrier, -control dependency, or a data dependency barrier pairs with a write -barrier, an acquire barrier, a release barrier, or a general barrier: +other types of barriers, albeit without multicopy atomicity. An acquire +barrier pairs with a release barrier, but both may also pair with other +barriers, including of course general barriers. A write barrier pairs +with a data dependency barrier, a control dependency, an acquire barrier, +a release barrier, a read barrier, or a general barrier. Similarly a +read barrier, control dependency, or a data dependency barrier pairs +with a write barrier, an acquire barrier, a release barrier, or a +general barrier: CPU 1 CPU 2 =============== =============== @@ -1359,64 +1338,77 @@ the speculation will be cancelled and the value reloaded: retrieved : : +-------+ -TRANSITIVITY ------------- +MULTICOPY ATOMICITY +-------------------- -Transitivity is a deeply intuitive notion about ordering that is not -always provided by real computer systems. The following example -demonstrates transitivity: +Multicopy atomicity is a deeply intuitive notion about ordering that is +not always provided by real computer systems, namely that a given store +is visible at the same time to all CPUs, or, alternatively, that all +CPUs agree on the order in which all stores took place. However, use of +full multicopy atomicity would rule out valuable hardware optimizations, +so a weaker form called ``other multicopy atomicity'' instead guarantees +that a given store is observed at the same time by all -other- CPUs. The +remainder of this document discusses this weaker form, but for brevity +will call it simply ``multicopy atomicity''. + +The following example demonstrates multicopy atomicity: CPU 1 CPU 2 CPU 3 ======================= ======================= ======================= { X = 0, Y = 0 } - STORE X=1 LOAD X STORE Y=1 - - LOAD Y LOAD X + STORE X=1 r1=LOAD X (reads 1) LOAD Y (reads 1) + + STORE Y=r1 LOAD X -Suppose that CPU 2's load from X returns 1 and its load from Y returns 0. -This indicates that CPU 2's load from X in some sense follows CPU 1's -store to X and that CPU 2's load from Y in some sense preceded CPU 3's -store to Y. The question is then "Can CPU 3's load from X return 0?" +Suppose that CPU 2's load from X returns 1 which it then stores to Y and +that CPU 3's load from Y returns 1. This indicates that CPU 2's load +from X in some sense follows CPU 1's store to X and that CPU 2's store +to Y in some sense preceded CPU 3's load from Y. The question is then +"Can CPU 3's load from X return 0?" -Because CPU 2's load from X in some sense came after CPU 1's store, it +Because CPU 3's load from X in some sense came after CPU 2's load, it is natural to expect that CPU 3's load from X must therefore return 1. -This expectation is an example of transitivity: if a load executing on -CPU A follows a load from the same variable executing on CPU B, then -CPU A's load must either return the same value that CPU B's load did, -or must return some later value. +This expectation is an example of multicopy atomicity: if a load executing +on CPU A follows a load from the same variable executing on CPU B, then +an understandable but incorrect expectation is that CPU A's load must +either return the same value that CPU B's load did, or must return some +later value. -In the Linux kernel, use of general memory barriers guarantees -transitivity. Therefore, in the above example, if CPU 2's load from X -returns 1 and its load from Y returns 0, then CPU 3's load from X must -also return 1. +In the Linux kernel, the above use of a general memory barrier compensates +for any lack of multicopy atomicity. Therefore, in the above example, +if CPU 2's load from X returns 1 and its load from Y returns 0, and CPU 3's +load from Y returns 1, then CPU 3's load from X must also return 1. -However, transitivity is -not- guaranteed for read or write barriers. -For example, suppose that CPU 2's general barrier in the above example -is changed to a read barrier as shown below: +However, dependencies, read barriers, and write barriers are not always +able to compensate for non-multicopy atomicity. For example, suppose +that CPU 2's general barrier is removed from the above example, leaving +only the data dependency shown below: CPU 1 CPU 2 CPU 3 ======================= ======================= ======================= { X = 0, Y = 0 } - STORE X=1 LOAD X STORE Y=1 - - LOAD Y LOAD X + STORE X=1 r1=LOAD X (reads 1) LOAD Y (reads 1) + + STORE Y=r1 LOAD X (reads 0) -This substitution destroys transitivity: in this example, it is perfectly -legal for CPU 2's load from X to return 1, its load from Y to return 0, -and CPU 3's load from X to return 0. +This substitution allows non-multicopy atomicity to run rampant: in +this example, it is perfectly legal for CPU 2's load from X to return 1, +CPU 3's load from Y to return 1, and its load from X to return 0. -The key point is that although CPU 2's read barrier orders its pair -of loads, it does not guarantee to order CPU 1's store. Therefore, if -this example runs on a system where CPUs 1 and 2 share a store buffer -or a level of cache, CPU 2 might have early access to CPU 1's writes. -General barriers are therefore required to ensure that all CPUs agree -on the combined order of CPU 1's and CPU 2's accesses. +The key point is that although CPU 2's data dependency orders its load +and store, it does not guarantee to order CPU 1's store. Therefore, +if this example runs on a non-multicopy-atomic system where CPUs 1 and 2 +share a store buffer or a level of cache, CPU 2 might have early access +to CPU 1's writes. A general barrier is therefore required to ensure +that all CPUs agree on the combined order of CPU 1's and CPU 2's accesses. -General barriers provide "global transitivity", so that all CPUs will -agree on the order of operations. In contrast, a chain of release-acquire -pairs provides only "local transitivity", so that only those CPUs on -the chain are guaranteed to agree on the combined order of the accesses. -For example, switching to C code in deference to Herman Hollerith: +General barriers can compensate not only for non-multicopy atomicity, +but can also generate additional ordering that can ensure that -all- +CPUs will perceive the same order of -all- operations. In contrast, a +chain of release-acquire pairs do not provide this additional ordering, +which means that only those CPUs on the chain are guaranteed to agree +on the combined order of the accesses. For example, switching to C code +in deference to the ghost of Herman Hollerith: int u, v, x, y, z; @@ -1448,9 +1440,9 @@ For example, switching to C code in deference to Herman Hollerith: r3 = READ_ONCE(u); } -Because cpu0(), cpu1(), and cpu2() participate in a local transitive -chain of smp_store_release()/smp_load_acquire() pairs, the following -outcome is prohibited: +Because cpu0(), cpu1(), and cpu2() participate in a chain of +smp_store_release()/smp_load_acquire() pairs, the following outcome +is prohibited: r0 == 1 && r1 == 1 && r2 == 1 @@ -1460,9 +1452,9 @@ outcome is prohibited: r1 == 1 && r5 == 0 -However, the transitivity of release-acquire is local to the participating -CPUs and does not apply to cpu3(). Therefore, the following outcome -is possible: +However, the ordering provided by a release-acquire chain is local +to the CPUs participating in that chain and does not apply to cpu3(), +at least aside from stores. Therefore, the following outcome is possible: r0 == 0 && r1 == 1 && r2 == 1 && r3 == 0 && r4 == 0 @@ -1490,8 +1482,8 @@ following outcome is possible: Note that this outcome can happen even on a mythical sequentially consistent system where nothing is ever reordered. -To reiterate, if your code requires global transitivity, use general -barriers throughout. +To reiterate, if your code requires full ordering of all operations, +use general barriers throughout. ======================== @@ -3101,6 +3093,9 @@ AMD64 Architecture Programmer's Manual Volume 2: System Programming Chapter 7.1: Memory-Access Ordering Chapter 7.4: Buffering and Combining Memory Writes +ARM Architecture Reference Manual (ARMv8, for ARMv8-A architecture profile) + Chapter B2: The AArch64 Application Level Memory Model + IA-32 Intel Architecture Software Developer's Manual, Volume 3: System Programming Guide Chapter 7.1: Locked Atomic Operations @@ -3112,6 +3107,8 @@ The SPARC Architecture Manual, Version 9 Appendix D: Formal Specification of the Memory Models Appendix J: Programming with the Memory Models +Storage in the PowerPC (Stone and Fitzgerald) + UltraSPARC Programmer Reference Manual Chapter 5: Memory Accesses and Cacheability Chapter 15: Sparc-V9 Memory Models From 0902b1f44a72558aece92f074154044861681f84 Mon Sep 17 00:00:00 2001 From: Alan Stern Date: Fri, 1 Sep 2017 07:53:34 -0700 Subject: [PATCH 06/24] memory-barriers: Rework multicopy-atomicity section Signed-off-by: Alan Stern Signed-off-by: Paul E. McKenney --- Documentation/memory-barriers.txt | 56 ++++++++++++++++--------------- 1 file changed, 29 insertions(+), 27 deletions(-) diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt index b6882680247e..7deee1441640 100644 --- a/Documentation/memory-barriers.txt +++ b/Documentation/memory-barriers.txt @@ -1343,13 +1343,13 @@ MULTICOPY ATOMICITY Multicopy atomicity is a deeply intuitive notion about ordering that is not always provided by real computer systems, namely that a given store -is visible at the same time to all CPUs, or, alternatively, that all -CPUs agree on the order in which all stores took place. However, use of -full multicopy atomicity would rule out valuable hardware optimizations, -so a weaker form called ``other multicopy atomicity'' instead guarantees -that a given store is observed at the same time by all -other- CPUs. The -remainder of this document discusses this weaker form, but for brevity -will call it simply ``multicopy atomicity''. +becomes visible at the same time to all CPUs, or, alternatively, that all +CPUs agree on the order in which all stores become visible. However, +support of full multicopy atomicity would rule out valuable hardware +optimizations, so a weaker form called ``other multicopy atomicity'' +instead guarantees only that a given store becomes visible at the same +time to all -other- CPUs. The remainder of this document discusses this +weaker form, but for brevity will call it simply ``multicopy atomicity''. The following example demonstrates multicopy atomicity: @@ -1360,24 +1360,26 @@ The following example demonstrates multicopy atomicity: STORE Y=r1 LOAD X -Suppose that CPU 2's load from X returns 1 which it then stores to Y and -that CPU 3's load from Y returns 1. This indicates that CPU 2's load -from X in some sense follows CPU 1's store to X and that CPU 2's store -to Y in some sense preceded CPU 3's load from Y. The question is then -"Can CPU 3's load from X return 0?" +Suppose that CPU 2's load from X returns 1, which it then stores to Y, +and CPU 3's load from Y returns 1. This indicates that CPU 1's store +to X precedes CPU 2's load from X and that CPU 2's store to Y precedes +CPU 3's load from Y. In addition, the memory barriers guarantee that +CPU 2 executes its load before its store, and CPU 3 loads from Y before +it loads from X. The question is then "Can CPU 3's load from X return 0?" -Because CPU 3's load from X in some sense came after CPU 2's load, it +Because CPU 3's load from X in some sense comes after CPU 2's load, it is natural to expect that CPU 3's load from X must therefore return 1. -This expectation is an example of multicopy atomicity: if a load executing -on CPU A follows a load from the same variable executing on CPU B, then -an understandable but incorrect expectation is that CPU A's load must -either return the same value that CPU B's load did, or must return some -later value. +This expectation follows from multicopy atomicity: if a load executing +on CPU B follows a load from the same variable executing on CPU A (and +CPU A did not originally store the value which it read), then on +multicopy-atomic systems, CPU B's load must return either the same value +that CPU A's load did or some later value. However, the Linux kernel +does not require systems to be multicopy atomic. -In the Linux kernel, the above use of a general memory barrier compensates -for any lack of multicopy atomicity. Therefore, in the above example, -if CPU 2's load from X returns 1 and its load from Y returns 0, and CPU 3's -load from Y returns 1, then CPU 3's load from X must also return 1. +The use of a general memory barrier in the example above compensates +for any lack of multicopy atomicity. In the example, if CPU 2's load +from X returns 1 and CPU 3's load from Y returns 1, then CPU 3's load +from X must indeed also return 1. However, dependencies, read barriers, and write barriers are not always able to compensate for non-multicopy atomicity. For example, suppose @@ -1396,11 +1398,11 @@ this example, it is perfectly legal for CPU 2's load from X to return 1, CPU 3's load from Y to return 1, and its load from X to return 0. The key point is that although CPU 2's data dependency orders its load -and store, it does not guarantee to order CPU 1's store. Therefore, -if this example runs on a non-multicopy-atomic system where CPUs 1 and 2 -share a store buffer or a level of cache, CPU 2 might have early access -to CPU 1's writes. A general barrier is therefore required to ensure -that all CPUs agree on the combined order of CPU 1's and CPU 2's accesses. +and store, it does not guarantee to order CPU 1's store. Thus, if this +example runs on a non-multicopy-atomic system where CPUs 1 and 2 share a +store buffer or a level of cache, CPU 2 might have early access to CPU 1's +writes. General barriers are therefore required to ensure that all CPUs +agree on the combined order of multiple accesses. General barriers can compensate not only for non-multicopy atomicity, but can also generate additional ordering that can ensure that -all- From 135bd1a230bb69a68c9808a7d25467318900b80a Mon Sep 17 00:00:00 2001 From: Neeraj Upadhyay Date: Mon, 7 Aug 2017 11:20:10 +0530 Subject: [PATCH 07/24] rcu: Fix up pending cbs check in rcu_prepare_for_idle The pending-callbacks check in rcu_prepare_for_idle() is backwards. It should accelerate if there are pending callbacks, but the check rather uselessly accelerates only if there are no callbacks. This commit therefore inverts this check. Fixes: 15fecf89e46a ("srcu: Abstract multi-tail callback list handling") Signed-off-by: Neeraj Upadhyay Signed-off-by: Paul E. McKenney Cc: # 4.12.x --- kernel/rcu/tree_plugin.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h index e012b9be777e..fed95fa941e6 100644 --- a/kernel/rcu/tree_plugin.h +++ b/kernel/rcu/tree_plugin.h @@ -1507,7 +1507,7 @@ static void rcu_prepare_for_idle(void) rdtp->last_accelerate = jiffies; for_each_rcu_flavor(rsp) { rdp = this_cpu_ptr(rsp->rda); - if (rcu_segcblist_pend_cbs(&rdp->cblist)) + if (!rcu_segcblist_pend_cbs(&rdp->cblist)) continue; rnp = rdp->mynode; raw_spin_lock_rcu_node(rnp); /* irqs already disabled. */ From c63eb17ff06dbcf73e771b9b425c531cc0a9c17b Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Fri, 11 Aug 2017 12:37:07 -0700 Subject: [PATCH 08/24] rcu: Create call_rcu_tasks() kthread at boot time Currently the call_rcu_tasks() kthread is created upon first invocation of call_rcu_tasks(). This has the advantage of avoiding creation if there are never any invocations of call_rcu_tasks() and of synchronize_rcu_tasks(), but it requires an unreliable heuristic to determine when it is safe to create the kthread. For example, it is not safe to create the kthread when call_rcu_tasks() is invoked with a spinlock held, but there is no good way to detect this in !PREEMPT kernels. This commit therefore creates this kthread unconditionally at core_initcall() time. If you don't want this kthread created, then build with CONFIG_TASKS_RCU=n. Signed-off-by: Paul E. McKenney --- kernel/rcu/update.c | 24 +++++------------------- 1 file changed, 5 insertions(+), 19 deletions(-) diff --git a/kernel/rcu/update.c b/kernel/rcu/update.c index 5033b66d2753..e9bbedbb8745 100644 --- a/kernel/rcu/update.c +++ b/kernel/rcu/update.c @@ -575,7 +575,6 @@ DEFINE_STATIC_SRCU(tasks_rcu_exit_srcu); static int rcu_task_stall_timeout __read_mostly = RCU_TASK_STALL_TIMEOUT; module_param(rcu_task_stall_timeout, int, 0644); -static void rcu_spawn_tasks_kthread(void); static struct task_struct *rcu_tasks_kthread_ptr; /** @@ -600,7 +599,6 @@ void call_rcu_tasks(struct rcu_head *rhp, rcu_callback_t func) { unsigned long flags; bool needwake; - bool havetask = READ_ONCE(rcu_tasks_kthread_ptr); rhp->next = NULL; rhp->func = func; @@ -610,11 +608,8 @@ void call_rcu_tasks(struct rcu_head *rhp, rcu_callback_t func) rcu_tasks_cbs_tail = &rhp->next; raw_spin_unlock_irqrestore(&rcu_tasks_cbs_lock, flags); /* We can't create the thread unless interrupts are enabled. */ - if ((needwake && havetask) || - (!havetask && !irqs_disabled_flags(flags))) { - rcu_spawn_tasks_kthread(); + if (needwake && READ_ONCE(rcu_tasks_kthread_ptr)) wake_up(&rcu_tasks_cbs_wq); - } } EXPORT_SYMBOL_GPL(call_rcu_tasks); @@ -853,27 +848,18 @@ static int __noreturn rcu_tasks_kthread(void *arg) } } -/* Spawn rcu_tasks_kthread() at first call to call_rcu_tasks(). */ -static void rcu_spawn_tasks_kthread(void) +/* Spawn rcu_tasks_kthread() at core_initcall() time. */ +static int __init rcu_spawn_tasks_kthread(void) { - static DEFINE_MUTEX(rcu_tasks_kthread_mutex); struct task_struct *t; - if (READ_ONCE(rcu_tasks_kthread_ptr)) { - smp_mb(); /* Ensure caller sees full kthread. */ - return; - } - mutex_lock(&rcu_tasks_kthread_mutex); - if (rcu_tasks_kthread_ptr) { - mutex_unlock(&rcu_tasks_kthread_mutex); - return; - } t = kthread_run(rcu_tasks_kthread, NULL, "rcu_tasks_kthread"); BUG_ON(IS_ERR(t)); smp_mb(); /* Ensure others see full kthread. */ WRITE_ONCE(rcu_tasks_kthread_ptr, t); - mutex_unlock(&rcu_tasks_kthread_mutex); + return 0; } +core_initcall(rcu_spawn_tasks_kthread); /* Do the srcu_read_lock() for the above synchronize_srcu(). */ void exit_tasks_rcu_start(void) From 6733bab7bc09b67668028dab562caea1b4ff3c69 Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Fri, 18 Aug 2017 10:59:16 -0700 Subject: [PATCH 09/24] irq_work: Map irq_work_on_queue() to irq_work_on() in !SMP Commit 478850160636 ("irq_work: Implement remote queueing") provides irq_work_on_queue() only for SMP builds. However, providing it simplifies code that submits irq_work to lists of CPUs, eliminating the !SMP special cases. This commit therefore maps irq_work_on_queue() to irq_work_on() in !SMP builds, but validating the specified CPU. Signed-off-by: Paul E. McKenney --- include/linux/irq_work.h | 3 --- kernel/irq_work.c | 9 ++++++--- 2 files changed, 6 insertions(+), 6 deletions(-) diff --git a/include/linux/irq_work.h b/include/linux/irq_work.h index 47b9ebd4a74f..d8c9876d5da4 100644 --- a/include/linux/irq_work.h +++ b/include/linux/irq_work.h @@ -33,10 +33,7 @@ void init_irq_work(struct irq_work *work, void (*func)(struct irq_work *)) #define DEFINE_IRQ_WORK(name, _f) struct irq_work name = { .func = (_f), } bool irq_work_queue(struct irq_work *work); - -#ifdef CONFIG_SMP bool irq_work_queue_on(struct irq_work *work, int cpu); -#endif void irq_work_tick(void); void irq_work_sync(struct irq_work *work); diff --git a/kernel/irq_work.c b/kernel/irq_work.c index bcf107ce0854..9f20f6c72579 100644 --- a/kernel/irq_work.c +++ b/kernel/irq_work.c @@ -56,7 +56,6 @@ void __weak arch_irq_work_raise(void) */ } -#ifdef CONFIG_SMP /* * Enqueue the irq_work @work on @cpu unless it's already pending * somewhere. @@ -68,6 +67,8 @@ bool irq_work_queue_on(struct irq_work *work, int cpu) /* All work should have been flushed before going offline */ WARN_ON_ONCE(cpu_is_offline(cpu)); +#ifdef CONFIG_SMP + /* Arch remote IPI send/receive backend aren't NMI safe */ WARN_ON_ONCE(in_nmi()); @@ -78,10 +79,12 @@ bool irq_work_queue_on(struct irq_work *work, int cpu) if (llist_add(&work->llnode, &per_cpu(raised_list, cpu))) arch_send_call_function_single_ipi(cpu); +#else /* #ifdef CONFIG_SMP */ + irq_work_queue(work); +#endif /* #else #ifdef CONFIG_SMP */ + return true; } -EXPORT_SYMBOL_GPL(irq_work_queue_on); -#endif /* Enqueue the irq work @work on the current CPU */ bool irq_work_queue(struct irq_work *work) From 7c2102e56a3f7d85b5d8f33efbd7aecc1f36fdd8 Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Mon, 18 Sep 2017 08:54:40 -0700 Subject: [PATCH 10/24] sched: Make resched_cpu() unconditional The current implementation of synchronize_sched_expedited() incorrectly assumes that resched_cpu() is unconditional, which it is not. This means that synchronize_sched_expedited() can hang when resched_cpu()'s trylock fails as follows (analysis by Neeraj Upadhyay): o CPU1 is waiting for expedited wait to complete: sync_rcu_exp_select_cpus rdp->exp_dynticks_snap & 0x1 // returns 1 for CPU5 IPI sent to CPU5 synchronize_sched_expedited_wait ret = swait_event_timeout(rsp->expedited_wq, sync_rcu_preempt_exp_done(rnp_root), jiffies_stall); expmask = 0x20, CPU 5 in idle path (in cpuidle_enter()) o CPU5 handles IPI and fails to acquire rq lock. Handles IPI sync_sched_exp_handler resched_cpu returns while failing to try lock acquire rq->lock need_resched is not set o CPU5 calls rcu_idle_enter() and as need_resched is not set, goes to idle (schedule() is not called). o CPU 1 reports RCU stall. Given that resched_cpu() is now used only by RCU, this commit fixes the assumption by making resched_cpu() unconditional. Reported-by: Neeraj Upadhyay Suggested-by: Neeraj Upadhyay Signed-off-by: Paul E. McKenney Acked-by: Steven Rostedt (VMware) Acked-by: Peter Zijlstra (Intel) Cc: stable@vger.kernel.org --- kernel/sched/core.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index d17c5da523a0..8fa7b6f9e19b 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -505,8 +505,7 @@ void resched_cpu(int cpu) struct rq *rq = cpu_rq(cpu); unsigned long flags; - if (!raw_spin_trylock_irqsave(&rq->lock, flags)) - return; + raw_spin_lock_irqsave(&rq->lock, flags); resched_curr(rq); raw_spin_unlock_irqrestore(&rq->lock, flags); } From f79c3ad6189624c3de0ad5521610c9e22a1c33cf Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Wed, 30 Nov 2016 06:24:30 -0800 Subject: [PATCH 11/24] sched,rcu: Make cond_resched() provide RCU quiescent state There is some confusion as to which of cond_resched() or cond_resched_rcu_qs() should be added to long in-kernel loops. This commit therefore eliminates the decision by adding RCU quiescent states to cond_resched(). This commit also simplifies the code that used to interact with cond_resched_rcu_qs(), and that now interacts with cond_resched(), to reduce its overhead. This reduction is necessary to allow the heavier-weight cond_resched_rcu_qs() mechanism to be invoked everywhere that cond_resched() is invoked. Part of that reduction in overhead converts the jiffies_till_sched_qs kernel parameter to read-only at runtime, thus eliminating the need for bounds checking. Reported-by: Michal Hocko Signed-off-by: Paul E. McKenney Cc: Peter Zijlstra [ paulmck: Keep PREEMPT=n cond_resched a no-op, per Peter Zijlstra. ] --- kernel/rcu/tree.c | 25 +++++-------------------- kernel/sched/core.c | 1 + 2 files changed, 6 insertions(+), 20 deletions(-) diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index b0ad62b0e7b8..0dda57a28276 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -534,8 +534,8 @@ module_param(rcu_kick_kthreads, bool, 0644); * How long the grace period must be before we start recruiting * quiescent-state help from rcu_note_context_switch(). */ -static ulong jiffies_till_sched_qs = HZ / 20; -module_param(jiffies_till_sched_qs, ulong, 0644); +static ulong jiffies_till_sched_qs = HZ / 10; +module_param(jiffies_till_sched_qs, ulong, 0444); static bool rcu_start_gp_advanced(struct rcu_state *rsp, struct rcu_node *rnp, struct rcu_data *rdp); @@ -1235,7 +1235,6 @@ static int rcu_implicit_dynticks_qs(struct rcu_data *rdp) unsigned long jtsq; bool *rnhqp; bool *ruqp; - unsigned long rjtsc; struct rcu_node *rnp; /* @@ -1252,23 +1251,13 @@ static int rcu_implicit_dynticks_qs(struct rcu_data *rdp) return 1; } - /* Compute and saturate jiffies_till_sched_qs. */ - jtsq = jiffies_till_sched_qs; - rjtsc = rcu_jiffies_till_stall_check(); - if (jtsq > rjtsc / 2) { - WRITE_ONCE(jiffies_till_sched_qs, rjtsc); - jtsq = rjtsc / 2; - } else if (jtsq < 1) { - WRITE_ONCE(jiffies_till_sched_qs, 1); - jtsq = 1; - } - /* * Has this CPU encountered a cond_resched_rcu_qs() since the * beginning of the grace period? For this to be the case, * the CPU has to have noticed the current grace period. This * might not be the case for nohz_full CPUs looping in the kernel. */ + jtsq = jiffies_till_sched_qs; rnp = rdp->mynode; ruqp = per_cpu_ptr(&rcu_dynticks.rcu_urgent_qs, rdp->cpu); if (time_after(jiffies, rdp->rsp->gp_start + jtsq) && @@ -1276,7 +1265,7 @@ static int rcu_implicit_dynticks_qs(struct rcu_data *rdp) READ_ONCE(rdp->gpnum) == rnp->gpnum && !rdp->gpwrap) { trace_rcu_fqs(rdp->rsp->name, rdp->gpnum, rdp->cpu, TPS("rqc")); return 1; - } else { + } else if (time_after(jiffies, rdp->rsp->gp_start + jtsq)) { /* Load rcu_qs_ctr before store to rcu_urgent_qs. */ smp_store_release(ruqp, true); } @@ -1304,10 +1293,6 @@ static int rcu_implicit_dynticks_qs(struct rcu_data *rdp) * updates are only once every few jiffies, the probability of * lossage (and thus of slight grace-period extension) is * quite low. - * - * Note that if the jiffies_till_sched_qs boot/sysfs parameter - * is set too high, we override with half of the RCU CPU stall - * warning delay. */ rnhqp = &per_cpu(rcu_dynticks.rcu_need_heavy_qs, rdp->cpu); if (!READ_ONCE(*rnhqp) && @@ -1316,7 +1301,7 @@ static int rcu_implicit_dynticks_qs(struct rcu_data *rdp) WRITE_ONCE(*rnhqp, true); /* Store rcu_need_heavy_qs before rcu_urgent_qs. */ smp_store_release(ruqp, true); - rdp->rsp->jiffies_resched += 5; /* Re-enable beating. */ + rdp->rsp->jiffies_resched += jtsq; /* Re-enable beating. */ } /* diff --git a/kernel/sched/core.c b/kernel/sched/core.c index d17c5da523a0..904d3ab35c83 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -4842,6 +4842,7 @@ int __sched _cond_resched(void) preempt_schedule_common(); return 1; } + rcu_all_qs(); return 0; } EXPORT_SYMBOL(_cond_resched); From 9b9500da81502738efa1b485a8835f174ff7be6d Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Thu, 17 Aug 2017 17:05:59 -0700 Subject: [PATCH 12/24] rcu: Make RCU CPU stall warnings check for irq-disabled CPUs One common question upon seeing an RCU CPU stall warning is "did the stalled CPUs have interrupts disabled?" However, the current stall warnings are silent on this point. This commit therefore uses irq_work to check whether stalled CPUs still respond to IPIs, and flags this state in the RCU CPU stall warning console messages. Reported-by: Steven Rostedt Signed-off-by: Paul E. McKenney --- kernel/rcu/tree.c | 104 ++++++++++++++++++++++++++++++++++----- kernel/rcu/tree.h | 5 ++ kernel/rcu/tree_plugin.h | 7 ++- 3 files changed, 103 insertions(+), 13 deletions(-) diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index 0dda57a28276..12838a9a128e 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -1206,6 +1206,22 @@ static int rcu_is_cpu_rrupt_from_idle(void) return __this_cpu_read(rcu_dynticks.dynticks_nesting) <= 1; } +/* + * We are reporting a quiescent state on behalf of some other CPU, so + * it is our responsibility to check for and handle potential overflow + * of the rcu_node ->gpnum counter with respect to the rcu_data counters. + * After all, the CPU might be in deep idle state, and thus executing no + * code whatsoever. + */ +static void rcu_gpnum_ovf(struct rcu_node *rnp, struct rcu_data *rdp) +{ + lockdep_assert_held(&rnp->lock); + if (ULONG_CMP_LT(READ_ONCE(rdp->gpnum) + ULONG_MAX / 4, rnp->gpnum)) + WRITE_ONCE(rdp->gpwrap, true); + if (ULONG_CMP_LT(rdp->rcu_iw_gpnum + ULONG_MAX / 4, rnp->gpnum)) + rdp->rcu_iw_gpnum = rnp->gpnum + ULONG_MAX / 4; +} + /* * Snapshot the specified CPU's dynticks counter so that we can later * credit them with an implicit quiescent state. Return 1 if this CPU @@ -1216,14 +1232,33 @@ static int dyntick_save_progress_counter(struct rcu_data *rdp) rdp->dynticks_snap = rcu_dynticks_snap(rdp->dynticks); if (rcu_dynticks_in_eqs(rdp->dynticks_snap)) { trace_rcu_fqs(rdp->rsp->name, rdp->gpnum, rdp->cpu, TPS("dti")); - if (ULONG_CMP_LT(READ_ONCE(rdp->gpnum) + ULONG_MAX / 4, - rdp->mynode->gpnum)) - WRITE_ONCE(rdp->gpwrap, true); + rcu_gpnum_ovf(rdp->mynode, rdp); return 1; } return 0; } +/* + * Handler for the irq_work request posted when a grace period has + * gone on for too long, but not yet long enough for an RCU CPU + * stall warning. Set state appropriately, but just complain if + * there is unexpected state on entry. + */ +static void rcu_iw_handler(struct irq_work *iwp) +{ + struct rcu_data *rdp; + struct rcu_node *rnp; + + rdp = container_of(iwp, struct rcu_data, rcu_iw); + rnp = rdp->mynode; + raw_spin_lock_rcu_node(rnp); + if (!WARN_ON_ONCE(!rdp->rcu_iw_pending)) { + rdp->rcu_iw_gpnum = rnp->gpnum; + rdp->rcu_iw_pending = false; + } + raw_spin_unlock_rcu_node(rnp); +} + /* * Return true if the specified CPU has passed through a quiescent * state by virtue of being in or having passed through an dynticks @@ -1235,7 +1270,7 @@ static int rcu_implicit_dynticks_qs(struct rcu_data *rdp) unsigned long jtsq; bool *rnhqp; bool *ruqp; - struct rcu_node *rnp; + struct rcu_node *rnp = rdp->mynode; /* * If the CPU passed through or entered a dynticks idle phase with @@ -1248,6 +1283,7 @@ static int rcu_implicit_dynticks_qs(struct rcu_data *rdp) if (rcu_dynticks_in_eqs_since(rdp->dynticks, rdp->dynticks_snap)) { trace_rcu_fqs(rdp->rsp->name, rdp->gpnum, rdp->cpu, TPS("dti")); rdp->dynticks_fqs++; + rcu_gpnum_ovf(rnp, rdp); return 1; } @@ -1258,12 +1294,12 @@ static int rcu_implicit_dynticks_qs(struct rcu_data *rdp) * might not be the case for nohz_full CPUs looping in the kernel. */ jtsq = jiffies_till_sched_qs; - rnp = rdp->mynode; ruqp = per_cpu_ptr(&rcu_dynticks.rcu_urgent_qs, rdp->cpu); if (time_after(jiffies, rdp->rsp->gp_start + jtsq) && READ_ONCE(rdp->rcu_qs_ctr_snap) != per_cpu(rcu_dynticks.rcu_qs_ctr, rdp->cpu) && READ_ONCE(rdp->gpnum) == rnp->gpnum && !rdp->gpwrap) { trace_rcu_fqs(rdp->rsp->name, rdp->gpnum, rdp->cpu, TPS("rqc")); + rcu_gpnum_ovf(rnp, rdp); return 1; } else if (time_after(jiffies, rdp->rsp->gp_start + jtsq)) { /* Load rcu_qs_ctr before store to rcu_urgent_qs. */ @@ -1274,6 +1310,7 @@ static int rcu_implicit_dynticks_qs(struct rcu_data *rdp) if (!(rdp->grpmask & rcu_rnp_online_cpus(rnp))) { trace_rcu_fqs(rdp->rsp->name, rdp->gpnum, rdp->cpu, TPS("ofl")); rdp->offline_fqs++; + rcu_gpnum_ovf(rnp, rdp); return 1; } @@ -1305,11 +1342,22 @@ static int rcu_implicit_dynticks_qs(struct rcu_data *rdp) } /* - * If more than halfway to RCU CPU stall-warning time, do - * a resched_cpu() to try to loosen things up a bit. + * If more than halfway to RCU CPU stall-warning time, do a + * resched_cpu() to try to loosen things up a bit. Also check to + * see if the CPU is getting hammered with interrupts, but only + * once per grace period, just to keep the IPIs down to a dull roar. */ - if (jiffies - rdp->rsp->gp_start > rcu_jiffies_till_stall_check() / 2) + if (jiffies - rdp->rsp->gp_start > rcu_jiffies_till_stall_check() / 2) { resched_cpu(rdp->cpu); + if (IS_ENABLED(CONFIG_IRQ_WORK) && + !rdp->rcu_iw_pending && rdp->rcu_iw_gpnum != rnp->gpnum && + (rnp->ffmask & rdp->grpmask)) { + init_irq_work(&rdp->rcu_iw, rcu_iw_handler); + rdp->rcu_iw_pending = true; + rdp->rcu_iw_gpnum = rnp->gpnum; + irq_work_queue_on(&rdp->rcu_iw, rdp->cpu); + } + } return 0; } @@ -1498,6 +1546,7 @@ static void print_cpu_stall(struct rcu_state *rsp) { int cpu; unsigned long flags; + struct rcu_data *rdp = this_cpu_ptr(rsp->rda); struct rcu_node *rnp = rcu_get_root(rsp); long totqlen = 0; @@ -1513,7 +1562,9 @@ static void print_cpu_stall(struct rcu_state *rsp) */ pr_err("INFO: %s self-detected stall on CPU", rsp->name); print_cpu_stall_info_begin(); + raw_spin_lock_irqsave_rcu_node(rdp->mynode, flags); print_cpu_stall_info(rsp, smp_processor_id()); + raw_spin_unlock_irqrestore_rcu_node(rdp->mynode, flags); print_cpu_stall_info_end(); for_each_possible_cpu(cpu) totqlen += rcu_segcblist_n_cbs(&per_cpu_ptr(rsp->rda, @@ -1907,6 +1958,7 @@ static bool __note_gp_changes(struct rcu_state *rsp, struct rcu_node *rnp, rdp->core_needs_qs = need_gp; zero_cpu_stall_ticks(rdp); WRITE_ONCE(rdp->gpwrap, false); + rcu_gpnum_ovf(rnp, rdp); } return ret; } @@ -3685,6 +3737,8 @@ rcu_init_percpu_data(int cpu, struct rcu_state *rsp) rdp->cpu_no_qs.b.norm = true; rdp->rcu_qs_ctr_snap = per_cpu(rcu_dynticks.rcu_qs_ctr, cpu); rdp->core_needs_qs = false; + rdp->rcu_iw_pending = false; + rdp->rcu_iw_gpnum = rnp->gpnum - 1; trace_rcu_grace_period(rsp->name, rdp->gpnum, TPS("cpuonl")); raw_spin_unlock_irqrestore_rcu_node(rnp, flags); } @@ -3722,10 +3776,24 @@ static void rcutree_affinity_setting(unsigned int cpu, int outgoing) */ int rcutree_online_cpu(unsigned int cpu) { - sync_sched_exp_online_cleanup(cpu); - rcutree_affinity_setting(cpu, -1); + unsigned long flags; + struct rcu_data *rdp; + struct rcu_node *rnp; + struct rcu_state *rsp; + + for_each_rcu_flavor(rsp) { + rdp = per_cpu_ptr(rsp->rda, cpu); + rnp = rdp->mynode; + raw_spin_lock_irqsave_rcu_node(rnp, flags); + rnp->ffmask |= rdp->grpmask; + raw_spin_unlock_irqrestore_rcu_node(rnp, flags); + } if (IS_ENABLED(CONFIG_TREE_SRCU)) srcu_online_cpu(cpu); + if (rcu_scheduler_active == RCU_SCHEDULER_INACTIVE) + return 0; /* Too early in boot for scheduler work. */ + sync_sched_exp_online_cleanup(cpu); + rcutree_affinity_setting(cpu, -1); return 0; } @@ -3735,6 +3803,19 @@ int rcutree_online_cpu(unsigned int cpu) */ int rcutree_offline_cpu(unsigned int cpu) { + unsigned long flags; + struct rcu_data *rdp; + struct rcu_node *rnp; + struct rcu_state *rsp; + + for_each_rcu_flavor(rsp) { + rdp = per_cpu_ptr(rsp->rda, cpu); + rnp = rdp->mynode; + raw_spin_lock_irqsave_rcu_node(rnp, flags); + rnp->ffmask &= ~rdp->grpmask; + raw_spin_unlock_irqrestore_rcu_node(rnp, flags); + } + rcutree_affinity_setting(cpu, cpu); if (IS_ENABLED(CONFIG_TREE_SRCU)) srcu_offline_cpu(cpu); @@ -4183,8 +4264,7 @@ void __init rcu_init(void) for_each_online_cpu(cpu) { rcutree_prepare_cpu(cpu); rcu_cpu_starting(cpu); - if (IS_ENABLED(CONFIG_TREE_SRCU)) - srcu_online_cpu(cpu); + rcutree_online_cpu(cpu); } } diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h index 8e1f285f0a70..46a5d1991450 100644 --- a/kernel/rcu/tree.h +++ b/kernel/rcu/tree.h @@ -103,6 +103,7 @@ struct rcu_node { /* Online CPUs for next expedited GP. */ /* Any CPU that has ever been online will */ /* have its bit set. */ + unsigned long ffmask; /* Fully functional CPUs. */ unsigned long grpmask; /* Mask to apply to parent qsmask. */ /* Only one bit will be set in this mask. */ int grplo; /* lowest-numbered CPU or group here. */ @@ -285,6 +286,10 @@ struct rcu_data { /* 8) RCU CPU stall data. */ unsigned int softirq_snap; /* Snapshot of softirq activity. */ + /* ->rcu_iw* fields protected by leaf rcu_node ->lock. */ + struct irq_work rcu_iw; /* Check for non-irq activity. */ + bool rcu_iw_pending; /* Is ->rcu_iw pending? */ + unsigned long rcu_iw_gpnum; /* ->gpnum associated with ->rcu_iw. */ int cpu; struct rcu_state *rsp; diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h index e012b9be777e..14977d0470d1 100644 --- a/kernel/rcu/tree_plugin.h +++ b/kernel/rcu/tree_plugin.h @@ -1671,6 +1671,7 @@ static void print_cpu_stall_info_begin(void) */ static void print_cpu_stall_info(struct rcu_state *rsp, int cpu) { + unsigned long delta; char fast_no_hz[72]; struct rcu_data *rdp = per_cpu_ptr(rsp->rda, cpu); struct rcu_dynticks *rdtp = rdp->dynticks; @@ -1685,11 +1686,15 @@ static void print_cpu_stall_info(struct rcu_state *rsp, int cpu) ticks_value = rsp->gpnum - rdp->gpnum; } print_cpu_stall_fast_no_hz(fast_no_hz, cpu); - pr_err("\t%d-%c%c%c: (%lu %s) idle=%03x/%llx/%d softirq=%u/%u fqs=%ld %s\n", + delta = rdp->mynode->gpnum - rdp->rcu_iw_gpnum; + pr_err("\t%d-%c%c%c%c: (%lu %s) idle=%03x/%llx/%d softirq=%u/%u fqs=%ld %s\n", cpu, "O."[!!cpu_online(cpu)], "o."[!!(rdp->grpmask & rdp->mynode->qsmaskinit)], "N."[!!(rdp->grpmask & rdp->mynode->qsmaskinitnext)], + !IS_ENABLED(CONFIG_IRQ_WORK) ? '?' : + rdp->rcu_iw_pending ? (int)min(delta, 9UL) + '0' : + "!."[!delta], ticks_value, ticks_title, rcu_dynticks_snap(rdtp) & 0xfff, rdtp->dynticks_nesting, rdtp->dynticks_nmi_nesting, From 83b6ca1fede773eebcdfb44f5a94eb410d48b886 Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Thu, 31 Aug 2017 16:47:08 -0700 Subject: [PATCH 13/24] rcu: Turn off tracing before dumping trace Currently, RCU allows tracing to continue when it automatically does ftrace_dump() after detecting an error condition, which can result in excessively large traces and lost trace events. This commit therefore does a tracing_off() before any of these ftrace_dump() calls. Signed-off-by: Paul E. McKenney --- kernel/rcu/rcu.h | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/kernel/rcu/rcu.h b/kernel/rcu/rcu.h index e4b43fef89f5..b8729eb09a5d 100644 --- a/kernel/rcu/rcu.h +++ b/kernel/rcu/rcu.h @@ -220,8 +220,10 @@ do { \ static atomic_t ___rfd_beenhere = ATOMIC_INIT(0); \ \ if (!atomic_read(&___rfd_beenhere) && \ - !atomic_xchg(&___rfd_beenhere, 1)) \ + !atomic_xchg(&___rfd_beenhere, 1)) { \ + tracing_off(); \ ftrace_dump(oops_dump_mode); \ + } \ } while (0) void rcu_early_boot_tests(void); From f22ce09157239aab08eae99c678ef664f71a9097 Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Fri, 1 Sep 2017 14:40:54 -0700 Subject: [PATCH 14/24] rcu: Suppress RCU CPU stall warnings while dumping trace Currently, RCU emits Suppress RCU CPU stall warnings during its automatically initiated ftrace_dump() calls after detecting an error condition, which can result in excessively excessive console output and lost trace events. This commit therefore suppresses RCU CPU stall warnings across any of these ftrace_dump() calls. Signed-off-by: Paul E. McKenney --- kernel/rcu/rcu.h | 17 +++++++++++++++++ kernel/rcu/update.c | 1 + 2 files changed, 18 insertions(+) diff --git a/kernel/rcu/rcu.h b/kernel/rcu/rcu.h index b8729eb09a5d..59c471de342a 100644 --- a/kernel/rcu/rcu.h +++ b/kernel/rcu/rcu.h @@ -203,6 +203,21 @@ static inline bool __rcu_reclaim(const char *rn, struct rcu_head *head) extern int rcu_cpu_stall_suppress; int rcu_jiffies_till_stall_check(void); +#define rcu_ftrace_dump_stall_suppress() \ +do { \ + if (!rcu_cpu_stall_suppress) \ + rcu_cpu_stall_suppress = 3; \ +} while (0) + +#define rcu_ftrace_dump_stall_unsuppress() \ +do { \ + if (rcu_cpu_stall_suppress == 3) \ + rcu_cpu_stall_suppress = 0; \ +} while (0) + +#else /* #endif #ifdef CONFIG_RCU_STALL_COMMON */ +#define rcu_ftrace_dump_stall_suppress() +#define rcu_ftrace_dump_stall_unsuppress() #endif /* #ifdef CONFIG_RCU_STALL_COMMON */ /* @@ -222,7 +237,9 @@ do { \ if (!atomic_read(&___rfd_beenhere) && \ !atomic_xchg(&___rfd_beenhere, 1)) { \ tracing_off(); \ + rcu_ftrace_dump_stall_suppress(); \ ftrace_dump(oops_dump_mode); \ + rcu_ftrace_dump_stall_unsuppress(); \ } \ } while (0) diff --git a/kernel/rcu/update.c b/kernel/rcu/update.c index 5033b66d2753..3dc8efb16dc7 100644 --- a/kernel/rcu/update.c +++ b/kernel/rcu/update.c @@ -494,6 +494,7 @@ EXPORT_SYMBOL_GPL(do_trace_rcu_torture_read); #endif int rcu_cpu_stall_suppress __read_mostly; /* 1 = suppress stall warnings. */ +EXPORT_SYMBOL_GPL(rcu_cpu_stall_suppress); static int rcu_cpu_stall_timeout __read_mostly = CONFIG_RCU_CPU_STALL_TIMEOUT; module_param(rcu_cpu_stall_suppress, int, 0644); From 2b1516e55f8416acfb48d5f43d41222d180fb5a3 Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Fri, 18 Aug 2017 16:11:37 -0700 Subject: [PATCH 15/24] rcutorture: Add interrupt-disable capability to stall-warning tests When rcutorture sees the rcutorture.stall_cpu kernel boot parameter, it loops with preemption disabled, which does in fact normally generate an RCU CPU stall warning message. However, there are test scenarios that need the stalling CPU to have interrupts disabled. This commit therefore adds an rcutorture.stall_cpu_irqsoff kernel boot parameter that causes the stalling CPU to disable interrupts. Signed-off-by: Paul E. McKenney --- .../admin-guide/kernel-parameters.txt | 3 +++ kernel/rcu/rcutorture.c | 18 +++++++++++++----- 2 files changed, 16 insertions(+), 5 deletions(-) diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index 05496622b4ef..bc94c47085a5 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -3539,6 +3539,9 @@ rcutorture.stall_cpu_holdoff= [KNL] Time to wait (s) after boot before inducing stall. + rcutorture.stall_cpu_irqsoff= [KNL] + Disable interrupts while stalling if set. + rcutorture.stat_interval= [KNL] Time (s) between statistics printk()s. diff --git a/kernel/rcu/rcutorture.c b/kernel/rcu/rcutorture.c index 45f2ffbc1e78..0273bc0a8586 100644 --- a/kernel/rcu/rcutorture.c +++ b/kernel/rcu/rcutorture.c @@ -89,6 +89,7 @@ torture_param(int, shutdown_secs, 0, "Shutdown time (s), <= zero to disable."); torture_param(int, stall_cpu, 0, "Stall duration (s), zero to disable."); torture_param(int, stall_cpu_holdoff, 10, "Time to wait before starting stall (s)."); +torture_param(int, stall_cpu_irqsoff, 0, "Disable interrupts while stalling."); torture_param(int, stat_interval, 60, "Number of seconds between stats printk()s"); torture_param(int, stutter, 5, "Number of seconds to run/halt test"); @@ -1357,7 +1358,7 @@ rcu_torture_print_module_parms(struct rcu_torture_ops *cur_ops, const char *tag) "fqs_duration=%d fqs_holdoff=%d fqs_stutter=%d " "test_boost=%d/%d test_boost_interval=%d " "test_boost_duration=%d shutdown_secs=%d " - "stall_cpu=%d stall_cpu_holdoff=%d " + "stall_cpu=%d stall_cpu_holdoff=%d stall_cpu_irqsoff=%d " "n_barrier_cbs=%d " "onoff_interval=%d onoff_holdoff=%d\n", torture_type, tag, nrealreaders, nfakewriters, @@ -1365,7 +1366,7 @@ rcu_torture_print_module_parms(struct rcu_torture_ops *cur_ops, const char *tag) stutter, irqreader, fqs_duration, fqs_holdoff, fqs_stutter, test_boost, cur_ops->can_boost, test_boost_interval, test_boost_duration, shutdown_secs, - stall_cpu, stall_cpu_holdoff, + stall_cpu, stall_cpu_holdoff, stall_cpu_irqsoff, n_barrier_cbs, onoff_interval, onoff_holdoff); } @@ -1430,12 +1431,19 @@ static int rcu_torture_stall(void *args) if (!kthread_should_stop()) { stop_at = get_seconds() + stall_cpu; /* RCU CPU stall is expected behavior in following code. */ - pr_alert("rcu_torture_stall start.\n"); rcu_read_lock(); - preempt_disable(); + if (stall_cpu_irqsoff) + local_irq_disable(); + else + preempt_disable(); + pr_alert("rcu_torture_stall start on CPU %d.\n", + smp_processor_id()); while (ULONG_CMP_LT(get_seconds(), stop_at)) continue; /* Induce RCU CPU stall warning. */ - preempt_enable(); + if (stall_cpu_irqsoff) + local_irq_enable(); + else + preempt_enable(); rcu_read_unlock(); pr_alert("rcu_torture_stall end.\n"); } From 0032f4e889764d22ccccb6a15742071d6f0d1f5a Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Wed, 30 Aug 2017 10:40:17 -0700 Subject: [PATCH 16/24] rcutorture: Dump writer stack if stalled Right now, rcutorture warns if an rcu_torture_writer() kthread stalls, but this warning is not always all that helpful. This commit therefore makes the first such warning include a stack dump. This in turn requires that sched_show_task() be exported to GPL modules, so this commit makes that change as well. Signed-off-by: Paul E. McKenney --- kernel/rcu/rcutorture.c | 6 ++++++ kernel/sched/core.c | 1 + 2 files changed, 7 insertions(+) diff --git a/kernel/rcu/rcutorture.c b/kernel/rcu/rcutorture.c index 0273bc0a8586..362eb2f78b3c 100644 --- a/kernel/rcu/rcutorture.c +++ b/kernel/rcu/rcutorture.c @@ -51,6 +51,7 @@ #include #include #include +#include #include "rcu.h" @@ -1240,6 +1241,7 @@ rcu_torture_stats_print(void) long pipesummary[RCU_TORTURE_PIPE_LEN + 1] = { 0 }; long batchsummary[RCU_TORTURE_PIPE_LEN + 1] = { 0 }; static unsigned long rtcv_snap = ULONG_MAX; + static bool splatted; struct task_struct *wtp; for_each_possible_cpu(cpu) { @@ -1325,6 +1327,10 @@ rcu_torture_stats_print(void) gpnum, completed, flags, wtp == NULL ? ~0UL : wtp->state, wtp == NULL ? -1 : (int)task_cpu(wtp)); + if (!splatted && wtp) { + sched_show_task(wtp); + splatted = true; + } show_rcu_gp_kthreads(); rcu_ftrace_dump(DUMP_ALL); } diff --git a/kernel/sched/core.c b/kernel/sched/core.c index d17c5da523a0..7ae0151dcc1d 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -5165,6 +5165,7 @@ void sched_show_task(struct task_struct *p) show_stack(p, NULL); put_task_stack(p); } +EXPORT_SYMBOL_GPL(sched_show_task); static inline bool state_filter_match(unsigned long state_filter, struct task_struct *p) From b038c58bd258522d9cde6619d730ebdb1f7be198 Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Wed, 30 Aug 2017 15:33:49 -0700 Subject: [PATCH 17/24] torture: Provide TMPDIR environment variable to specify tmpdir Both rcutorture and locktorture currently place temporary files in /tmp, in keeping with decades-long tradition. However, sometimes it is useful to specify an alternative temporary directory, for example, for space or performance reasons. This commit therefore causes the torture-test scripting to use the path specified in the TMPDIR environment variable, or to fall back to traditional /tmp if this variable is not set. Signed-off-by: Paul E. McKenney --- tools/testing/selftests/rcutorture/bin/config_override.sh | 2 +- tools/testing/selftests/rcutorture/bin/configcheck.sh | 2 +- tools/testing/selftests/rcutorture/bin/configinit.sh | 2 +- tools/testing/selftests/rcutorture/bin/kvm-build.sh | 2 +- tools/testing/selftests/rcutorture/bin/kvm-test-1-run.sh | 2 +- tools/testing/selftests/rcutorture/bin/kvm.sh | 4 ++-- tools/testing/selftests/rcutorture/bin/parse-build.sh | 2 +- tools/testing/selftests/rcutorture/bin/parse-torture.sh | 2 +- 8 files changed, 9 insertions(+), 9 deletions(-) diff --git a/tools/testing/selftests/rcutorture/bin/config_override.sh b/tools/testing/selftests/rcutorture/bin/config_override.sh index 49fa51726ce3..ef7fcbac3d42 100755 --- a/tools/testing/selftests/rcutorture/bin/config_override.sh +++ b/tools/testing/selftests/rcutorture/bin/config_override.sh @@ -42,7 +42,7 @@ else exit 1 fi -T=/tmp/config_override.sh.$$ +T=${TMPDIR-/tmp}/config_override.sh.$$ trap 'rm -rf $T' 0 mkdir $T diff --git a/tools/testing/selftests/rcutorture/bin/configcheck.sh b/tools/testing/selftests/rcutorture/bin/configcheck.sh index 70fca318a82b..197deece7c7c 100755 --- a/tools/testing/selftests/rcutorture/bin/configcheck.sh +++ b/tools/testing/selftests/rcutorture/bin/configcheck.sh @@ -19,7 +19,7 @@ # # Authors: Paul E. McKenney -T=/tmp/abat-chk-config.sh.$$ +T=${TMPDIR-/tmp}/abat-chk-config.sh.$$ trap 'rm -rf $T' 0 mkdir $T diff --git a/tools/testing/selftests/rcutorture/bin/configinit.sh b/tools/testing/selftests/rcutorture/bin/configinit.sh index 3f81a1095206..51f66a7ce876 100755 --- a/tools/testing/selftests/rcutorture/bin/configinit.sh +++ b/tools/testing/selftests/rcutorture/bin/configinit.sh @@ -32,7 +32,7 @@ # # Authors: Paul E. McKenney -T=/tmp/configinit.sh.$$ +T=${TMPDIR-/tmp}/configinit.sh.$$ trap 'rm -rf $T' 0 mkdir $T diff --git a/tools/testing/selftests/rcutorture/bin/kvm-build.sh b/tools/testing/selftests/rcutorture/bin/kvm-build.sh index 46752c164676..fb66d0173638 100755 --- a/tools/testing/selftests/rcutorture/bin/kvm-build.sh +++ b/tools/testing/selftests/rcutorture/bin/kvm-build.sh @@ -35,7 +35,7 @@ then exit 1 fi -T=/tmp/test-linux.sh.$$ +T=${TMPDIR-/tmp}/test-linux.sh.$$ trap 'rm -rf $T' 0 mkdir $T diff --git a/tools/testing/selftests/rcutorture/bin/kvm-test-1-run.sh b/tools/testing/selftests/rcutorture/bin/kvm-test-1-run.sh index 0af36a721b9c..ab14b97c942c 100755 --- a/tools/testing/selftests/rcutorture/bin/kvm-test-1-run.sh +++ b/tools/testing/selftests/rcutorture/bin/kvm-test-1-run.sh @@ -38,7 +38,7 @@ # # Authors: Paul E. McKenney -T=/tmp/kvm-test-1-run.sh.$$ +T=${TMPDIR-/tmp}/kvm-test-1-run.sh.$$ trap 'rm -rf $T' 0 mkdir $T diff --git a/tools/testing/selftests/rcutorture/bin/kvm.sh b/tools/testing/selftests/rcutorture/bin/kvm.sh index b55895fb10ed..ccd49e958fd2 100755 --- a/tools/testing/selftests/rcutorture/bin/kvm.sh +++ b/tools/testing/selftests/rcutorture/bin/kvm.sh @@ -30,7 +30,7 @@ scriptname=$0 args="$*" -T=/tmp/kvm.sh.$$ +T=${TMPDIR-/tmp}/kvm.sh.$$ trap 'rm -rf $T' 0 mkdir $T @@ -222,7 +222,7 @@ do exit 1 fi done -sort -k2nr $T/cfgcpu > $T/cfgcpu.sort +sort -k2nr $T/cfgcpu -T="$T" > $T/cfgcpu.sort # Use a greedy bin-packing algorithm, sorting the list accordingly. awk < $T/cfgcpu.sort > $T/cfgcpu.pack -v ncpus=$cpus ' diff --git a/tools/testing/selftests/rcutorture/bin/parse-build.sh b/tools/testing/selftests/rcutorture/bin/parse-build.sh index a6b57622c2e5..24fe5f822b28 100755 --- a/tools/testing/selftests/rcutorture/bin/parse-build.sh +++ b/tools/testing/selftests/rcutorture/bin/parse-build.sh @@ -28,7 +28,7 @@ F=$1 title=$2 -T=/tmp/parse-build.sh.$$ +T=${TMPDIR-/tmp}/parse-build.sh.$$ trap 'rm -rf $T' 0 mkdir $T diff --git a/tools/testing/selftests/rcutorture/bin/parse-torture.sh b/tools/testing/selftests/rcutorture/bin/parse-torture.sh index e3c5f0705696..f12c38909b00 100755 --- a/tools/testing/selftests/rcutorture/bin/parse-torture.sh +++ b/tools/testing/selftests/rcutorture/bin/parse-torture.sh @@ -27,7 +27,7 @@ # # Authors: Paul E. McKenney -T=/tmp/parse-torture.sh.$$ +T=${TMPDIR-/tmp}/parse-torture.sh.$$ file="$1" title="$2" From b88697810d7c1d102a529990f9071b0f14cfe6df Mon Sep 17 00:00:00 2001 From: Sebastian Andrzej Siewior Date: Wed, 18 Oct 2017 08:33:44 -0700 Subject: [PATCH 18/24] rcu: Do not include rtmutex_common.h unconditionally This commit adjusts include files and provides definitions in preparation for suppressing lockdep false-positive ->boost_mtx complaints. Without this preparation, architectures not supporting rt_mutex will get build failures. Reported-by: kbuild test robot Signed-off-by: Sebastian Andrzej Siewior Signed-off-by: Paul E. McKenney --- kernel/rcu/tree_plugin.h | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h index fed95fa941e6..969eae45f05d 100644 --- a/kernel/rcu/tree_plugin.h +++ b/kernel/rcu/tree_plugin.h @@ -54,6 +54,7 @@ DEFINE_PER_CPU(char, rcu_cpu_has_work); * This probably needs to be excluded from -rt builds. */ #define rt_mutex_owner(a) ({ WARN_ON_ONCE(1); NULL; }) +#define rt_mutex_futex_unlock(x) WARN_ON_ONCE(1) #endif /* #else #ifdef CONFIG_RCU_BOOST */ @@ -911,8 +912,6 @@ void exit_rcu(void) #ifdef CONFIG_RCU_BOOST -#include "../locking/rtmutex_common.h" - static void rcu_wake_cond(struct task_struct *t, int status) { /* From 02a7c234e54052101164368ff981bd72f7acdd65 Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Tue, 19 Sep 2017 15:36:42 -0700 Subject: [PATCH 19/24] rcu: Suppress lockdep false-positive ->boost_mtx complaints RCU priority boosting uses rt_mutex_init_proxy_locked() to initialize an rt_mutex structure in locked state held by some other task. When that other task releases it, lockdep complains (quite accurately, but a bit uselessly) that the other task never acquired it. This complaint can suppress other, more helpful, lockdep complaints, and in any case it is a false positive. This commit therefore switches from rt_mutex_unlock() to rt_mutex_futex_unlock(), thereby avoiding the lockdep annotations. Of course, if lockdep ever learns about rt_mutex_init_proxy_locked(), addtional adjustments will be required. Suggested-by: Peter Zijlstra Signed-off-by: Paul E. McKenney --- kernel/rcu/tree_plugin.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h index 969eae45f05d..1eaab96d1a3c 100644 --- a/kernel/rcu/tree_plugin.h +++ b/kernel/rcu/tree_plugin.h @@ -531,7 +531,7 @@ void rcu_read_unlock_special(struct task_struct *t) /* Unboost if we were boosted. */ if (IS_ENABLED(CONFIG_RCU_BOOST) && drop_boost_mutex) - rt_mutex_unlock(&rnp->boost_mtx); + rt_mutex_futex_unlock(&rnp->boost_mtx); /* * If this was the last task on the expedited lists, From c0da313e090d6e7130d0c3005245176296c24e4a Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Fri, 22 Sep 2017 09:58:47 -0700 Subject: [PATCH 20/24] rcu: Add extended-quiescent-state testing advice If you add or remove calls to rcu_idle_enter(), rcu_user_enter(), rcu_irq_exit(), rcu_irq_exit_irqson(), rcu_idle_exit(), rcu_user_exit(), rcu_irq_enter(), rcu_irq_enter_irqson(), rcu_nmi_enter(), or rcu_nmi_exit(), you should run a full set of tests on a kernel built with CONFIG_RCU_EQS_DEBUG=y. Signed-off-by: Paul E. McKenney --- kernel/rcu/tree.c | 30 ++++++++++++++++++++++++++++++ 1 file changed, 30 insertions(+) diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index b0ad62b0e7b8..d234356d7afc 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -837,6 +837,9 @@ static void rcu_eqs_enter(bool user) * We crowbar the ->dynticks_nesting field to zero to allow for * the possibility of usermode upcalls having messed up our count * of interrupt nesting level during the prior busy period. + * + * If you add or remove a call to rcu_idle_enter(), be sure to test with + * CONFIG_RCU_EQS_DEBUG=y. */ void rcu_idle_enter(void) { @@ -852,6 +855,9 @@ void rcu_idle_enter(void) * is permitted between this call and rcu_user_exit(). This way the * CPU doesn't need to maintain the tick for RCU maintenance purposes * when the CPU runs in userspace. + * + * If you add or remove a call to rcu_user_enter(), be sure to test with + * CONFIG_RCU_EQS_DEBUG=y. */ void rcu_user_enter(void) { @@ -875,6 +881,9 @@ void rcu_user_enter(void) * Use things like work queues to work around this limitation. * * You have been warned. + * + * If you add or remove a call to rcu_irq_exit(), be sure to test with + * CONFIG_RCU_EQS_DEBUG=y. */ void rcu_irq_exit(void) { @@ -899,6 +908,9 @@ void rcu_irq_exit(void) /* * Wrapper for rcu_irq_exit() where interrupts are enabled. + * + * If you add or remove a call to rcu_irq_exit_irqson(), be sure to test + * with CONFIG_RCU_EQS_DEBUG=y. */ void rcu_irq_exit_irqson(void) { @@ -971,6 +983,9 @@ static void rcu_eqs_exit(bool user) * allow for the possibility of usermode upcalls messing up our count * of interrupt nesting level during the busy period that is just * now starting. + * + * If you add or remove a call to rcu_idle_exit(), be sure to test with + * CONFIG_RCU_EQS_DEBUG=y. */ void rcu_idle_exit(void) { @@ -987,6 +1002,9 @@ void rcu_idle_exit(void) * * Exit RCU idle mode while entering the kernel because it can * run a RCU read side critical section anytime. + * + * If you add or remove a call to rcu_user_exit(), be sure to test with + * CONFIG_RCU_EQS_DEBUG=y. */ void rcu_user_exit(void) { @@ -1012,6 +1030,9 @@ void rcu_user_exit(void) * Use things like work queues to work around this limitation. * * You have been warned. + * + * If you add or remove a call to rcu_irq_enter(), be sure to test with + * CONFIG_RCU_EQS_DEBUG=y. */ void rcu_irq_enter(void) { @@ -1037,6 +1058,9 @@ void rcu_irq_enter(void) /* * Wrapper for rcu_irq_enter() where interrupts are enabled. + * + * If you add or remove a call to rcu_irq_enter_irqson(), be sure to test + * with CONFIG_RCU_EQS_DEBUG=y. */ void rcu_irq_enter_irqson(void) { @@ -1055,6 +1079,9 @@ void rcu_irq_enter_irqson(void) * that the CPU is active. This implementation permits nested NMIs, as * long as the nesting level does not overflow an int. (You will probably * run out of stack space first.) + * + * If you add or remove a call to rcu_nmi_enter(), be sure to test + * with CONFIG_RCU_EQS_DEBUG=y. */ void rcu_nmi_enter(void) { @@ -1087,6 +1114,9 @@ void rcu_nmi_enter(void) * RCU-idle period, update rdtp->dynticks and rdtp->dynticks_nmi_nesting * to let the RCU grace-period handling know that the CPU is back to * being RCU-idle. + * + * If you add or remove a call to rcu_nmi_exit(), be sure to test + * with CONFIG_RCU_EQS_DEBUG=y. */ void rcu_nmi_exit(void) { From 56628a7fc84a6e3c4c84886d70ec26da3d7f2ce4 Mon Sep 17 00:00:00 2001 From: Sebastian Andrzej Siewior Date: Fri, 22 Sep 2017 17:28:06 +0200 Subject: [PATCH 21/24] rcu/segcblist: Include rcupdate.h The RT build on ARM complains about non-existing ULONG_CMP_LT. This commit therefore includes rcupdate.h into rcu_segcblist.c. Signed-off-by: Sebastian Andrzej Siewior Signed-off-by: Paul E. McKenney --- kernel/rcu/rcu_segcblist.c | 1 + 1 file changed, 1 insertion(+) diff --git a/kernel/rcu/rcu_segcblist.c b/kernel/rcu/rcu_segcblist.c index 7649fcd2c4c7..88cba7c2956c 100644 --- a/kernel/rcu/rcu_segcblist.c +++ b/kernel/rcu/rcu_segcblist.c @@ -23,6 +23,7 @@ #include #include #include +#include #include "rcu_segcblist.h" From d92f842bb30f52beedad63a4a850b39ca0dbc45f Mon Sep 17 00:00:00 2001 From: Scott Tsai Date: Wed, 20 Sep 2017 02:16:00 +0800 Subject: [PATCH 22/24] memory-barriers.txt: Fix typo in pairing example In the "general barrier pairing with implicit control depdendency" example, the last write by CPU 1 was meant to change variable x and not y. The example would be pretty uninteresting if no CPU ever changes x and the variable was initialized to zero. Signed-off-by: Scott Tsai Signed-off-by: Paul E. McKenney --- Documentation/memory-barriers.txt | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt index 7deee1441640..f37375544d71 100644 --- a/Documentation/memory-barriers.txt +++ b/Documentation/memory-barriers.txt @@ -947,7 +947,7 @@ Or even: =============== =============================== r1 = READ_ONCE(y); - WRITE_ONCE(y, 1); if (r2 = READ_ONCE(x)) { + WRITE_ONCE(x, 1); if (r2 = READ_ONCE(x)) { WRITE_ONCE(y, 1); } From 5692fcc671ac8a10bfd0e75d52f0259fe767b56c Mon Sep 17 00:00:00 2001 From: "Guilherme G. Piccoli" Date: Thu, 21 Sep 2017 16:29:01 -0300 Subject: [PATCH 23/24] doc: Rewrite confusing statement about memory barriers The "Write (or store) memory barriers" bullet of the "Variety of memory barriers" section, calls out a sequential order of stores, which is confusing since sequential ordering is not guaranteed. This commit therefore rewords to avoid mentioning a sequence of stores to clarify the intent. Cc: Paul E. McKenney Signed-off-by: Guilherme G. Piccoli Signed-off-by: Paul E. McKenney --- Documentation/memory-barriers.txt | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt index f37375544d71..519940ec767f 100644 --- a/Documentation/memory-barriers.txt +++ b/Documentation/memory-barriers.txt @@ -383,8 +383,8 @@ Memory barriers come in four basic varieties: to have any effect on loads. A CPU can be viewed as committing a sequence of store operations to the - memory system as time progresses. All stores before a write barrier will - occur in the sequence _before_ all the stores after the write barrier. + memory system as time progresses. All stores _before_ a write barrier + will occur _before_ all the stores after the write barrier. [!] Note that write barriers should normally be paired with read or data dependency barriers; see the "SMP barrier pairing" subsection. From e4d0b679a8467b6d0c169642f0b5f57d8d6eacc2 Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Sun, 17 Sep 2017 14:30:06 -0700 Subject: [PATCH 24/24] srcu: Add parameters to SRCU docbook comments Signed-off-by: Paul E. McKenney --- include/linux/srcu.h | 1 + kernel/rcu/srcutree.c | 2 +- 2 files changed, 2 insertions(+), 1 deletion(-) diff --git a/include/linux/srcu.h b/include/linux/srcu.h index 39af9bc0f653..62be8966e837 100644 --- a/include/linux/srcu.h +++ b/include/linux/srcu.h @@ -78,6 +78,7 @@ void synchronize_srcu(struct srcu_struct *sp); /** * srcu_read_lock_held - might we be in SRCU read-side critical section? + * @sp: The srcu_struct structure to check * * If CONFIG_DEBUG_LOCK_ALLOC is selected, returns nonzero iff in an SRCU * read-side critical section. In absence of CONFIG_DEBUG_LOCK_ALLOC, diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c index 729a8706751d..6d5880089ff6 100644 --- a/kernel/rcu/srcutree.c +++ b/kernel/rcu/srcutree.c @@ -854,7 +854,7 @@ void __call_srcu(struct srcu_struct *sp, struct rcu_head *rhp, /** * call_srcu() - Queue a callback for invocation after an SRCU grace period * @sp: srcu_struct in queue the callback - * @head: structure to be used for queueing the SRCU callback. + * @rhp: structure to be used for queueing the SRCU callback. * @func: function to be invoked after the SRCU grace period * * The callback function will be invoked some time after a full SRCU