July 27, 2012 (v3.5+)
This article was contributed by Paul E. McKenney
And then there are the preordained answers to the quick quizzes.
RCU handles boot-time CPU bringup in the same way that it handles runtime CPU-hotplug operations. This means that RCU's CPU-hotplug handling is intertwined with the way that it handles boot-up.
RCU handles boot-up in the following phases:
rcu_init()
is called: All RCU APIs other than
call_rcu()
, rcu_barrier()
, and friends
may be used.
Note that synchronize_rcu()
and friends are all
no-ops.
rcu_init()
and
rcu_scheduler_starting()
, all RCU APIs may be
invoked.
However, callbacks will be queued but not invoked.
Note that synchronize_rcu()
and friends are
still all no-ops.
rcu_scheduler_starting()
and
rcu_scheduler_really_started()
,
synchronize_rcu()
and friends will hang.
However, the other RCU APIs may still be used, but callbacks
from call_rcu()
, rcu_barrier()
, and friends
are still queued but not invoked.
rcu_scheduler_starting()
is invoked, RCU enters full run-time functionality.
At this point, in CONFIG_PREEMPT=y
kernels,
synchronize_rcu()
stops being a no-op.
synchronize_rcu_bh()
and synchronize_sched()
stop being
no-ops. On CONFIG_PREEMPT=n
kernels,
synchronize_rcu()
also stops being a no-op at
this point.
Quick Quiz 1:
Why is it a bad idea to block waiting for a callback to be invoked
during early boot?
Answer
Quick Quiz 2:
But what if my special CPU needs more than one jiffy to
come online or go offline?
Answer
Quick Quiz 3:
Under what circumstances can synchronize_sched()
be a no-op at runtime (in other words, after boot has fully completed)?
Answer
Some of RCU's boot-time code is shared with its CPU-hotplug online code
path, primarily in the rcu_init_percpu_data()
function.
However, because RCU does not do anything special for system shutdown,
RCU's CPU-hotplug offline code stands alone.
The general operation of this code is described in the following section.
RCU tracks which CPUs are online and offline using the
->qsmaskinit
bitmasks in the rcu_node
tree, which are analogous to the ->qsmask
fields
that handle grace-period detection.
At the beginning of each grace period, the initial value for each
->qsmask
bitmask is loaded from the corresponding
->qsmaskinit
bitmask.
Each CPU-offline event clears the bit corresponding to the newly offlined
CPU in that CPU's rcu_node
structure's ->qsmaskinit
bitmask, and, if the result is zero,
propagates up the rcu_node
tree.
Similarly, each CPU-online event sets the bit corresponding to the newly
onlined CPU in that CPU's rcu_node
structure's
structure's ->qsmaskinit
bitmask, and, if the result was
previously zero,
propagates up the rcu_node
tree.
This process is quite similar to the way that quiescent state events
propagate up the rcu_node
tree.
However, the CPU-hotplug process is complicated by the fact that hotplug operations are not atomic. The CPU-hotplug process invokes a series of notifiers, each of which causes the corresponding Linux-kernel subsystem to consider the CPU to be offline. Therefore, there is a significant period of time during which the CPU is neither fully online or fully offline. RCU nevertheless needs to fully understand the CPU's state: Is the CPU online enough that RCU grace periods must wait on it. Worse yet, newly offline CPUs take one final pass through the scheduler on their way to the idle loop. Because the scheduler uses RCU, RCU must continue paying attention to CPUs after they have marked themselves offline. RCU currently works around this with the horrible (but apparently reliable) hack of considering a CPU to be online for a jiffy after it has marked itself as offline.
Quick Quiz 4:
This is ridiculous! How did this kludge get into the Linux kernel?
Answer
RCU's CPU-hotplug notifier function
is called rcu_cpu_notify()
,
and it is invoked repeatedly during each CPU-hotplug operation.
When a CPU is taken offline, this function is called as shown in
the following diagram:
The rcu_cpu_notify()
function is first invoked
with CPU_DOWN_PREPARE
, thereby informing RCU that the
specified CPU is to be offlined.
RCU's CPU_DOWN_PREPARE
code adjusts the affinity of
the corresponding leaf rcu_node
structure's
priority boost kthread if CONFIG_RCU_BOOST=y
,
and does nothing otherwise.
RCU always returns NOTIFY_OK
, but has the option of
returning NOTIFY_BAD
, which will cause the CPU-hotplug
operation to fail.
However, if all of the CPU_DOWN_PREPARE
notifiers
return NOTIFY_OK
, the CPU-hotplug infrastructure will
invoke all of the notifier functions (including rcu_cpu_notify()
)
again, but this time with CPU_DYING
.
The CPU_DYING
notifiers are invoked in stop_machine()
context, which means that the outgoing CPU is executing the notifier
functions with interrupts disabled, and the rest of the CPUs are spinning
with interrupts disabled.
This is clearly a very heavy-weight operation that degrades real-time
response, and you should avoid depending on its semantics,
hence the red color.
RCU therefore only does tracing from this notifier call unless
CONFIG_RCU_FAST_NO_HZ=y
, in which case it also
does CPU-local cleanup.
Once again, RCU always returns NOTIFY_OK
, but has the option of
returning NOTIFY_BAD
, which will cause the CPU-hotplug
operation to fail.
If any of the CPU_DOWN_PREPARE
or the
CPU_DYING
notifiers return NOTIFY_BAD
,
the CPU-hotplug infrastructure will invoke the notifiers again,
but this time with CPU_DOWN_FAILED
.
RCU's CPU_DOWN_FAILED
code adjusts the affinity of
the corresponding leaf rcu_node
structure's
priority boost kthread if CONFIG_RCU_BOOST=y
,
thus backing out any changes made by its RCU_DOWN_PREPARE
notifier, and does nothing otherwise.
The CPU_DOWN_FAILED
are not permitted to fail.
On the other hand, if all of the CPU_DOWN_PREPARE
or the
CPU_DYING
notifiers return NOTIFY_OK
,
the CPU-hotplug infrastructure will invoke the notifiers again,
but this time with CPU_DEAD
.
RCU's CPU_DEAD
code moves any remaining RCU callbacks from
the dead CPU to some other CPU (taking care to maintain their order),
clears the dead CPU's ->qsmaskinit
bits from the
rcu_node
hierarchy, reports a quiescent state for
the dead CPU, and, if all CPUs corresponding
to this CPU's rcu_node
structure are now offline,
moves any tasks in this structure's ->blkd_tasks
list to the root rcu_node
structure.
The CPU_DEAD
are not permitted to fail.
Finally, the CPU-hotplug infrastructure invokes the notifiers
one last time with CPU_POST_DEAD
.
RCU takes no action at this time.
The CPU_POST_DEAD
are not permitted to fail.
The CPU-hotplug CPU-online procedure is similar, as shown in the following figure:
The main differences are that: (1) Unlike CPU_DYING
,
CPU_STARTING
does not use stop_machine()
,
although it still runs with interrupts disabled on the incoming CPU, and
(2) There is no equivalent of CPU_POST_DEAD
.
RCU responds to the CPU_UP_PREPARE
notifier
by initializing the incoming CPU's rcu_data
structure
and by setting the incoming CPU's bits in the rcu_node
structure's ->qsmaskinit
fields.
In CONFIG_RCU_BOOST=y
kernels, it also spawns the
per-rcu_node
-structure priority-boost kthread.
RCU ignores the CPU_STARTING
notifier.
RCU responds to the CPU_ONLINE
notifier in
the same way as for the CPU_DOWN_FAILED
notifier
described earlier.
Similarly, RCU responds to the CPU_UP_CANCELED
notifier in the same way as for the CPU_DEAD
notifier described earlier.
Given this background, we are now ready to look at some code.
First is rcu_init_percpu_data()
, which initializes RCU's
per-CPU data structures for a CPU that is in the process of coming online
and sets its rcu_node
tree bits,
shown below:
1 static void __cpuinit 2 rcu_init_percpu_data(int cpu, struct rcu_state *rsp, int preemptible) 3 { 4 unsigned long flags; 5 unsigned long mask; 6 struct rcu_data *rdp = per_cpu_ptr(rsp->rda, cpu); 7 struct rcu_node *rnp = rcu_get_root(rsp); 8 9 raw_spin_lock_irqsave(&rnp->lock, flags); 10 rdp->beenonline = 1; 11 rdp->preemptible = preemptible; 12 rdp->qlen_last_fqs_check = 0; 13 rdp->n_force_qs_snap = rsp->n_force_qs; 14 rdp->blimit = blimit; 15 rdp->dynticks->dynticks_nesting = DYNTICK_TASK_EXIT_IDLE; 16 atomic_set(&rdp->dynticks->dynticks, 17 (atomic_read(&rdp->dynticks->dynticks) & ~0x1) + 1); 18 rcu_prepare_for_idle_init(cpu); 19 raw_spin_unlock(&rnp->lock); 20 raw_spin_lock(&rsp->onofflock); 21 rnp = rdp->mynode; 22 mask = rdp->grpmask; 23 do { 24 raw_spin_lock(&rnp->lock); 25 rnp->qsmaskinit |= mask; 26 mask = rnp->grpmask; 27 if (rnp == rdp->mynode) { 28 rdp->gpnum = rnp->completed; 29 rdp->completed = rnp->completed; 30 rdp->passed_quiesce = 0; 31 rdp->qs_pending = 0; 32 rdp->passed_quiesce_gpnum = rnp->gpnum - 1; 33 trace_rcu_grace_period(rsp->name, rdp->gpnum, "cpuonl"); 34 } 35 raw_spin_unlock(&rnp->lock); 36 rnp = rnp->parent; 37 } while (rnp != NULL && !(rnp->qsmaskinit & mask)); 38 raw_spin_unlock_irqrestore(&rsp->onofflock, flags); 39 }
Lines 9-19 initialize this CPU's rcu_data
fields,
at least those that are independent of this CPU's leaf
rcu_node
structure.
Line 9 acquires the root rcu_node
structure's
->lock
, and line 19 releases it.
Line 10 marks this CPU has having been online, which is used by
debugfs tracing to avoid dumping data structures corresponding to
nonexistent CPUs.
Line 11 records whether or not this flavor of RCU is preemptible,
which in the past has been used to choose among quiescent-state-forcing
strategies.
Lines 12-14 initialize state variables used by heuristics that
will govern handling of overly long queues of callbacks on this CPU.
Lines 15-17 reset this CPU's dyntick-idle state to not-idle,
which is needed because CPUs can be taken offline in indeterminate
state because of momentary exits from idle (from an RCU perspective)
in the idle loop.
Finally, line 18 initializes this CPU's RCU_FAST_NO_HZ
state.
Quick Quiz 5:
Why is the root rcu_node
structure's
->lock
required on line 9 of
rcu_init_percpu_data()
?
Answer
Lines 20-38 set this CPU's rcu_node
tree bits,
and also initializes this CPU's rcu_data
fields that
depend on its leaf rcu_node
structure.
Line 20 acquires this RCU flavor's ->onofflock
and line 38 releases it.
Lines 21 and 22 initialize for the loop spanning
lines 23-37, each pass though which handles one of this CPU's
ancestors in the rcu_node
tree, starting at this
CPU's leaf rcu_node
structure.
Line 24 acquires the current rcu_node
structure's
->lock
, and line 35 releasees it.
Line 25 sets this CPU's bit in the current rcu_node
structure's ->qsmaskinit
field, and line 26
picks up the relevant mask for the next level up in the rcu_node
tree.
If line 27 determines that the current rcu_node
structure is this CPU's leaf, lines 28-33 initialize the
grace-period-related fields of this CPU's rcu_data
structure.
This CPU will know only about the last fully completed grace period
(lines 28 and 29),
which it will be noted as not having passed through a quiescent state
for (line 30), and not needing to pass through one (line 31).
Its non-existent quiescent-state passage will be attributed to the
prior grace period (line 32), in other words, rendered irrelevant.
Line 33 does event tracing.
Line 36 advances to the parent rcu_node
structure
up one level in the tree, and line 37 repeats the loop until
either (1) we reach the root rcu_node
structure or
(2) we find an rcu_node
structure for which this
CPU's bit is already set, due to a sibling CPU already being online.
The rcu_init_percpu_data()
function is invoked
from rcu_prepare_cpu()
, which is shown below.
1 static void __cpuinit rcu_prepare_cpu(int cpu) 2 { 3 struct rcu_state *rsp; 4 5 for_each_rcu_flavor(rsp) 6 rcu_init_percpu_data(cpu, rsp, 7 strcmp(rsp->name, "rcu_preempt") == 0); 8 }
Line 5 iterates across the rcu_state
structures
for each flavor of RCU (not including SRCU),
and lines 6 and 7 invoke rcu_init_percpu_data()
.
The call to strcmp()
checks for whether the current flavor
of RCU is RCU-preempt, but in a way that works correctly even if RCU-preempt
is not present in the running kernel.
Quick Quiz 6:
But nothing in RCU cares whether the current flavor is
RCU-preempt, so why bother?
Answer
The rcu_cleanup_dying_cpu()
, which is invoked for
CPU_DYING
notifiers, is as follows:
1 static void rcu_cleanup_dying_cpu(struct rcu_state *rsp) 2 { 3 RCU_TRACE(unsigned long mask); 4 RCU_TRACE(struct rcu_data *rdp = this_cpu_ptr(rsp->rda)); 5 RCU_TRACE(struct rcu_node *rnp = rdp->mynode); 6 7 RCU_TRACE(mask = rdp->grpmask); 8 trace_rcu_grace_period(rsp->name, 9 rnp->gpnum + 1 - !!(rnp->qsmask & mask), 10 "cpuofl"); 11 }
As you can see, this function simply does tracing.
The rcu_preempt_offline_tasks()
function shown
below moves any tasks on the specified leaf rcu_node
structure to the root rcu_node
structure.
This function is to be called only for leaf rcu_node
structures whose CPUs are all offline.
After this function returns, the specified rcu_node
function is ignored for purposes of determining when a grace period
can end.
1 static int rcu_preempt_offline_tasks(struct rcu_state *rsp, 2 struct rcu_node *rnp, 3 struct rcu_data *rdp) 4 { 5 struct list_head *lp; 6 struct list_head *lp_root; 7 int retval = 0; 8 struct rcu_node *rnp_root = rcu_get_root(rsp); 9 struct task_struct *t; 10 11 if (rnp == rnp_root) { 12 WARN_ONCE(1, "Last CPU thought to be offlined?"); 13 return 0; 14 } 15 WARN_ON_ONCE(rnp != rdp->mynode); 16 WARN_ON_ONCE(rnp->qsmask != 0); 17 if (rcu_preempt_blocked_readers_cgp(rnp)) 18 retval |= RCU_OFL_TASKS_NORM_GP; 19 if (rcu_preempted_readers_exp(rnp)) 20 retval |= RCU_OFL_TASKS_EXP_GP; 21 lp = &rnp->blkd_tasks; 22 lp_root = &rnp_root->blkd_tasks; 23 while (!list_empty(lp)) { 24 t = list_entry(lp->next, typeof(*t), rcu_node_entry); 25 raw_spin_lock(&rnp_root->lock); 26 list_del(&t->rcu_node_entry); 27 t->rcu_blocked_node = rnp_root; 28 list_add(&t->rcu_node_entry, lp_root); 29 if (&t->rcu_node_entry == rnp->gp_tasks) 30 rnp_root->gp_tasks = rnp->gp_tasks; 31 if (&t->rcu_node_entry == rnp->exp_tasks) 32 rnp_root->exp_tasks = rnp->exp_tasks; 33 #ifdef CONFIG_RCU_BOOST 34 if (&t->rcu_node_entry == rnp->boost_tasks) 35 rnp_root->boost_tasks = rnp->boost_tasks; 36 #endif /* #ifdef CONFIG_RCU_BOOST */ 37 raw_spin_unlock(&rnp_root->lock); 38 } 39 rnp->gp_tasks = NULL; 40 rnp->exp_tasks = NULL; 41 #ifdef CONFIG_RCU_BOOST 42 rnp->boost_tasks = NULL; 43 raw_spin_lock(&rnp_root->lock); /* irqs already disabled */ 44 if (rnp_root->boost_tasks != NULL && 45 rnp_root->boost_tasks != rnp_root->gp_tasks) 46 rnp_root->boost_tasks = rnp_root->gp_tasks; 47 raw_spin_unlock(&rnp_root->lock); /* irqs still disabled */ 48 #endif /* #ifdef CONFIG_RCU_BOOST */ 49 return retval; 50 }
Line 11 checks to see if this function has been invoked
on the root rcu_node
structure, which is illegal.
Therefore, in this case, line 12 gives a warning and line 13
returns.
Line 15 complains if this function is invoked on the root
rcu_node
structure, and line 16 complains if
one of the now-offline CPUs is somehow thought to still be executing
within an RCU read-side critical section.
Quick Quiz 7:
Why is it a problem if rcu_preempt_offline_tasks()
is invoked on the root rcu_node
structure?
After all, the rcu_node
tree for a small system would
consist of only a single rcu_node
structure, so in
that case, what other rcu_node
structure could it possibly
be invoked on?
Answer
Line 17 checks to see if any tasks queued on this
rcu_node
structure are blocking the current grace
period, and, if so, line 18 flags this in the return value.
Similarly, line 19 checks to see if any tasks queued on this
rcu_node
structure are blocking the current expedited
grace period, and, if so, line 20 flags this in the return
value.
As we will see, the caller uses these flags to determine how to
propagate quiescent states up the rcu_node
tree.
Lines 21 and 22 pick up pointers to the leaf and
root rcu_node
structures' ->blkd_tasks
lists, respectively, thus initializing for the loop spanning
lines 23-38.
Each pass through this loop moves one task from the leaf to the root.
Line 24 obtains a pointer to the first task on the leaf's list.
Line 25 acquires the root's ->lock
(the caller
already held that of the leaf), and line 37 releases it.
Line 26 removes the task from the leaf's list, line 37 switches
the task's allegience from the leaf to the root, and line 38
adds the task to the beginning of the root's list.
Line 29 checks to see if this task was the first in the leaf's list
that was blocking the current grace period, and if so, line 30
marks this task as blocking the current grace period for the root.
Line 31 checks to see if this task was the first in the leaf's list
that was blocking the current expedited grace period, and if so, line 32
marks this task as blocking the current expedited grace period for the root.
If RCU priority boosting is enabled, then line 34 checks to see if
this task is about to be boosted, and if so, line 35
marks this task as next to boost for the root.
Finally, lines 39, 40, and 41 clear out the pointers into the
->blkd_tasks
list.
Quick Quiz 8:
What if rcu_preempt_offline_tasks()
was
executing while a grace period was being initialized, so that
the leaf and root rcu_node
structures had different
ideas about what the current grace period was?
Wouldn't that cause confusion when a task blocking what the leaf
thought was the current grace period was moved to the root?
Answer
Lines 43-47 handle the possibility that the root node was doing priority boosting, but the leaf was not. In this case, it may be necessary to boost some of the tasks coming from the leaf.
Finally, line 49 returns the flags to the caller.
The rcu_cleanup_dead_cpu()
, which is invoked for
CPU_DEAD
notifiers, is as follows:
1 static void rcu_cleanup_dead_cpu(int cpu, struct rcu_state *rsp) 2 { 3 unsigned long flags; 4 unsigned long mask; 5 int need_report = 0; 6 struct rcu_data *rdp = per_cpu_ptr(rsp->rda, cpu); 7 struct rcu_node *rnp = rdp->mynode; 8 9 rcu_stop_cpu_kthread(cpu); 10 rcu_node_kthread_setaffinity(rnp, -1); 11 raw_spin_lock_irqsave(&rsp->onofflock, flags); 12 rcu_send_cbs_to_orphanage(cpu, rsp, rnp, rdp); 13 rcu_adopt_orphan_cbs(rsp); 14 mask = rdp->grpmask; 15 do { 16 raw_spin_lock(&rnp->lock); 17 rnp->qsmaskinit &= ~mask; 18 if (rnp->qsmaskinit != 0) { 19 if (rnp != rdp->mynode) 20 raw_spin_unlock(&rnp->lock); 21 break; 22 } 23 if (rnp == rdp->mynode) 24 need_report = rcu_preempt_offline_tasks(rsp, rnp, rdp); 25 else 26 raw_spin_unlock(&rnp->lock); 27 mask = rnp->grpmask; 28 rnp = rnp->parent; 29 } while (rnp != NULL); 30 raw_spin_unlock(&rsp->onofflock); 31 rnp = rdp->mynode; 32 if (need_report & RCU_OFL_TASKS_NORM_GP) 33 rcu_report_unblock_qs_rnp(rnp, flags); 34 else 35 raw_spin_unlock_irqrestore(&rnp->lock, flags); 36 if (need_report & RCU_OFL_TASKS_EXP_GP) 37 rcu_report_exp_rnp(rsp, rnp, true); 38 WARN_ONCE(rdp->qlen != 0 || rdp->nxtlist != NULL, 39 "rcu_cleanup_dead_cpu: Callbacks on offline CPU %d: qlen=%lu, nxtlist=%p\n", 40 cpu, rdp->qlen, rdp->nxtlist); 41 }
The rcu_cleanup_dead_cpu()
function dispositions
the now-dead CPU's RCU priority-boosting kthreads (lines 9 and 10),
dispositions the now-dead CPU's callbacks (lines 12 and 13),
and adjusts the rcu_node
tree to account for the CPU's departure
(lines 14-40).
If RCU priority boosting is enabled, line 9 invokes
rcu_stop_cpu_kthread()
which stops the per-CPU kthread,
while line 10 invokes rcu_node_kthread_setaffinity()
which adjusts the affinity mask for the per-rcu_node
priority-boost kthread.
On the other hand, if RCU priority boosting is not enabled,
these two functions do nothing.
Line 11 acquires the rcu_state
structure's
->onofflock
to exclude rcu_barrier()
and RCU-preempt synchronize_rcu_expedited()
This lock is released on line 30, so that it covers both
the callback dispositioning and portions of the subsequent
rcu_node
tree adjustment.
Line 12 and 13 transfer the dead CPU's callbacks to the
current CPU.
Line 14 starts the rcu_node
tree adjustment
by picking up the mask with a bit set for the dead CPU within its
leaf rcu_node
bitmasks.
Each pass through the loop spanning lines 14-29 handles one
level of the rcu_node
tree.
Line 16 acquires the current rcu_node
structure's
->lock
and line 17 clears the bit in
the rcu_node
structure's ->qsmaskinit
corresponding to the lower-level structure, which will be the
CPU's rcu_data
structure on the first pass through
the loop and will be the previous pass's rcu_node
structure on subsequent passes through the loop.
If line 18 determines that there are still bits remaining in
the current rcu_node
structure's ->qsmaskinit
field, we need not go any further up the tree, in which case
lines 19 and 20 release this rcu_node
structure's
->lock
(but only if this is not the leaf
rcu_node
structure) and line 20 breaks out of the loop,
holding the leaf rcu_node
structure's ->lock
.
Otherwise, execution continues in the loop.
If line 23 determines that we are still on the leaf
rcu_node
structure, it invokes
rcu_preempt_offline_tasks()
in order to move any blocked
tasks queued on this rcu_node
structure's
->blkd_tasks
list to the root rcu_node
structure.
On the other hand, if we are on a non-leaf rcu_node
structure, line 26 releases that structure's ->lock
.
Lines 27 and 28 prepare for the next pass through the loop
by picking up the mask containing the bit corresponding to this
rcu_node
structure within the masks of its parent and
advancing to that parent, respectively.
Finally, line 29 ends the loop if we have passed up out of the
root of the tree.
Quick Quiz 9:
But given that it there will always be at least one CPU online,
won't the root rcu_node
structure always have at least
one bit set in its ->qsmaskinit
field, forcing the
loop to exit at line 21 of rcu_cleanup_dead_cpu()
?
So how can execution possibly exit at the bottom of the loop?
Answer
Line 31 picks up a pointer to the dead CPU's leaf rcu_node
structure.
If line 32 indicates that the call to
rcu_preempt_offline_tasks()
on line 24
found that there were some tasks blocking the current grace period
on the leaf rcu_node
structure, then line 33
invokes rcu_report_unblock_qs_rnp()
in order to report
that this rcu_node
structure is no longer blocking
the grace period (and to release the leaf rcu_node
structure's ->lock
).
Otherwise, line 35 releases the leaf rcu_node
structure's ->lock
.
Quick Quiz 10:
But the tasks might still be blocked in their RCU
read-side critical sections, so why is it safe for line 33 of
rcu_cleanup_dead_cpu()
to report
that the leaf rcu_node
structure is no longer blocking
the current grace period?
Answer
If line 36 indicates that the call to
rcu_preempt_offline_tasks()
on line 24
found that there were some tasks blocking the current expedited
grace period on the leaf rcu_node
structure, then
line 37 invokes rcu_report_exp_rnp()
in order to report
that this rcu_node
structure is no longer blocking
the expedited grace period.
Finally, lines 38-40 give a warning if callbacks have somehow
remained queued on the dead CPU.
This whole process is orchestrated by rcu_cpu_notify()
,
which is as follows:
1 static int __cpuinit rcu_cpu_notify(struct notifier_block *self, 2 unsigned long action, void *hcpu) 3 { 4 long cpu = (long)hcpu; 5 struct rcu_data *rdp = per_cpu_ptr(rcu_state->rda, cpu); 6 struct rcu_node *rnp = rdp->mynode; 7 struct rcu_state *rsp; 8 9 trace_rcu_utilization("Start CPU hotplug"); 10 switch (action) { 11 case CPU_UP_PREPARE: 12 case CPU_UP_PREPARE_FROZEN: 13 rcu_prepare_cpu(cpu); 14 rcu_prepare_kthreads(cpu); 15 break; 16 case CPU_ONLINE: 17 case CPU_DOWN_FAILED: 18 rcu_node_kthread_setaffinity(rnp, -1); 19 rcu_cpu_kthread_setrt(cpu, 1); 20 break; 21 case CPU_DOWN_PREPARE: 22 rcu_node_kthread_setaffinity(rnp, cpu); 23 rcu_cpu_kthread_setrt(cpu, 0); 24 break; 25 case CPU_DYING: 26 case CPU_DYING_FROZEN: 27 for_each_rcu_flavor(rsp) 28 rcu_cleanup_dying_cpu(rsp); 29 rcu_cleanup_after_idle(cpu); 30 break; 31 case CPU_DEAD: 32 case CPU_DEAD_FROZEN: 33 case CPU_UP_CANCELED: 34 case CPU_UP_CANCELED_FROZEN: 35 for_each_rcu_flavor(rsp) 36 rcu_cleanup_dead_cpu(cpu, rsp); 37 break; 38 default: 39 break; 40 } 41 trace_rcu_utilization("End CPU hotplug"); 42 return NOTIFY_OK; 43 }
Lines 9 and 41 do tracing.
Lines 13 and 14 handle the CPU_UP_PREPARE
notifier action, with the former initializing this CPU's per-CPU
data and the latter dealing with RCU priority boosting.
Lines 18 and 19 handle the CPU_ONLINE
notifier action, both dealing with RCU priority boosting.
These same two lines also handle CPU_DOWN_FAILED
,
and note that lines 22 and 23 (which handle
CPU_DOWN_PREPARE
are the inverse operations of
lines 18 and 19.
Lines 27-29 handle the CPU_DYING
notifier action,
with the first two lines just doing tracing and the last line
calling into the RCU_FAST_NO_HZ
to clean up idle state
for the outgoing CPU.
Lines 35 and 36 handle the CPU_DEAD
notifier
action, dealing with the now-dead CPU's priority-boosting kthreads,
dispositioning the now-dead CPU's callbacks, and reporting the
resulting extended quiescent state up the rcu_node
tree.
These same two lines also handle CPU_UP_CANCELED
.
Quick Quiz 11:
Why aren't rcu_cpu_notify
's CPU_UP_CANCELED
actions the inverse of it CPU_UP_PREPARE
actions?
Answer
Finally, line 42 gives the CPU-hotplug operation RCU's authorization to proceed in all cases.
This work represents the view of the author and does not necessarily represent the view of IBM.
Linux is a registered trademark of Linus Torvalds.
Other company, product, and service names may be trademarks or service marks of others.
Quick Quiz 1: Why is it a bad idea to block waiting for a callback to be invoked during early boot?
Answer: Because the scheduler has not yet been initialized to the point where blocking is possible!
Quick Quiz 2: But what if my special CPU needs more than one jiffy to come online or go offline?
Answer:
You only have one jiffy.
If you need more time when going offline, move the work to a
CPU_DYING notifier, which RCU cannot interrupt.
If you need more time when coming online, then move the work to follow
your architecture-specific code that marks the CPU online in
the cpu_online_mask
.
Quick Quiz 3:
Under what circumstances can synchronize_sched()
be a no-op at runtime (in other words, after boot has fully completed)?
Answer:
Any time there is only one online CPU,
synchronize_sched()
will be a no-op.
Quick Quiz 4: This is ridiculous! How did this kludge get into the Linux kernel?
Answer: I am sure that it seemed like a good idea at the time, but it is nevertheless in the process of being fixed.
Quick Quiz 5:
Why is the root rcu_node
structure's
->lock
required on line 9 of
rcu_init_percpu_data()
?
Answer:
It prevents the ->n_force_qs
field from
overflowing.
This issue will require some thought should quiescent-state forcing
ever be optimized to visit a subset of the leaf rcu_node
structures.
Quick Quiz 6: But nothing in RCU cares whether the current flavor is RCU-preempt, so why bother?
Answer: If there is still nothing that cares after a few releases, this will be removed. Until then, the question is instead “Why bother removing it?”
Quick Quiz 7:
Why is it a problem if rcu_preempt_offline_tasks()
is invoked on the root rcu_node
structure?
After all, the rcu_node
tree for a small system would
consist of only a single rcu_node
structure, so in
that case, what other rcu_node
structure could it possibly
be invoked on?
Answer:
Given that it is illegal to invoke
rcu_preempt_offline_tasks()
on any rcu_node
structure that has any corresponding online CPUs, it can only be
legal to invoke rcu_preempt_offline_tasks()
on the
root rcu_node
if all of the CPUs are offline.
But in that case, what could possibly be executing the function?
Besides, invoking this function on the root rcu_node
structure is pointless in any case, because the whole purpose of
this function is to move queued tasks to the root rcu_node
structure.
And yes, this means that on a system with only a single
rcu_node
structure, it is illegal to ever call
rcu_preempt_offline_tasks()
.
Quick Quiz 8:
What if rcu_preempt_offline_tasks()
was
executing while a grace period was being initialized, so that
the leaf and root rcu_node
structures had different
ideas about what the current grace period was?
Wouldn't that cause confusion when a task blocking what the leaf
thought was the current grace period was moved to the root?
Answer:
This cannot happen.
First, from an rcu_node
perspective, grace periods are
consecutive, with one grace period completely
finishing before the next one starts, so that the root's current grace
period will either be the same as or
one later than that of the leaf, which means that if anything is blocking
the leaf's current grace period, it must necessarily be blocking the
root's as well.
Second, grace-period initialization is carried out under the protection
of get_online_cpus()
, which blocks CPU-hotplug operations so that
rcu_preempt_offline_tasks()
cannot execute while a grace
period is initialized.
Third and last, at least for the purposes of this Quick Quiz,
if the grace period is being initialized, the leaf
rcu_node
structure must see the previous grace period as
having been completed, which would mean that the leaf rcu_node
structure cannot possibly have any tasks blocking that grace period.
Quick Quiz 9:
But given that it there will always be at least one CPU online,
won't the root rcu_node
structure always have at least
one bit set in its ->qsmaskinit
field, forcing the
loop to exit at line 21 of rcu_cleanup_dead_cpu()
?
So how can execution possibly exit at the bottom of the loop?
Answer: Indeed, it cannot exit at the bottom of the loop. You have a problem with this?
Quick Quiz 10:
But the tasks might still be blocked in their RCU
read-side critical sections, so why is it safe for line 33 of
rcu_cleanup_dead_cpu()
to report
that the leaf rcu_node
structure is no longer blocking
the current grace period?
Answer:
Because those tasks have been moved to the root rcu_node
structure, which will be continuing to block the current grace period.
Quick Quiz 11:
Why aren't rcu_cpu_notify
's CPU_UP_CANCELED
actions the inverse of it CPU_UP_PREPARE
actions?
Answer:
The reporting of the extended quiescent state by
rcu_cleanup_dead_cpu()
does act as the inverse of the
critical parts of rcu_prepare_cpu()
, but it does not
make sense to undo the initialization of the CPU's per-CPU data
structures.
So we don't!