July 27, 2012 (v3.5+)

This article was contributed by Paul E. McKenney

Introduction

  1. RCU CPU-Hotplug Overview
  2. RCU CPU-Hotplug Operation
  3. RCU CPU-Hotplug Implementation
  4. Summary

And then there are the preordained answers to the quick quizzes.

RCU CPU-Hotplug Overview

RCU handles boot-time CPU bringup in the same way that it handles runtime CPU-hotplug operations. This means that RCU's CPU-hotplug handling is intertwined with the way that it handles boot-up.

RCU handles boot-up in the following phases:

  1. Before rcu_init() is called: All RCU APIs other than call_rcu(), rcu_barrier(), and friends may be used. Note that synchronize_rcu() and friends are all no-ops.
  2. Between rcu_init() and rcu_scheduler_starting(), all RCU APIs may be invoked. However, callbacks will be queued but not invoked. Note that synchronize_rcu() and friends are still all no-ops.
  3. Between rcu_scheduler_starting() and rcu_scheduler_really_started(), synchronize_rcu() and friends will hang. However, the other RCU APIs may still be used, but callbacks from call_rcu(), rcu_barrier(), and friends are still queued but not invoked.
  4. After rcu_scheduler_starting() is invoked, RCU enters full run-time functionality. At this point, in CONFIG_PREEMPT=y kernels, synchronize_rcu() stops being a no-op.
  5. Once the second CPU comes online, synchronize_rcu_bh() and synchronize_sched() stop being no-ops. On CONFIG_PREEMPT=n kernels, synchronize_rcu() also stops being a no-op at this point.
  6. At runtime, RCU needs to properly handle the following different CPU states:
    1. Running. This is normal operation for RCU.
    2. Idle. RCU ignores idle CPUs. Use of RCU from a CPU that has told RCU that it is idle results in a lockdep-RCU warning message. Note that if a CPU takes an interrupt or NMI from idle, RCU considers that CPU to not be idle.
    3. Offline. RCU ignores CPUs that are fully offline. Use of RCU from a CPU that has told RCU that it is fully offline results in a lockdep-RCU warning message.
    4. Going offline. Even though RCU ignores CPUs that are offline, when a CPU is in CPU_DYING state, all the other CPUs are stalled, waiting for the CPU to go offline. Because all other CPUs are stalled, grace periods cannot complete. Therefore, RCU read-side critical sections may safely appear on CPUs in CPU_DYING state. In addition, RCU grants the outgoing CPU an additional full jiffy of legal RCU usage after leaving CPU_DYING state, in other words, after marking themselves offline. This is required because dying CPUs usually make it into the scheduler and possibly also into the idle loop before dying completely.
    5. Coming online. CPUs coming online normally go through an abbreviated boot sequence, during which time they sometimes use RCU. RCU therefore grants incoming CPUs a full jiffy of legal RCU usage before marking themselves online.

Quick Quiz 1: Why is it a bad idea to block waiting for a callback to be invoked during early boot?
Answer

Quick Quiz 2: But what if my special CPU needs more than one jiffy to come online or go offline?
Answer

Quick Quiz 3: Under what circumstances can synchronize_sched() be a no-op at runtime (in other words, after boot has fully completed)?
Answer

Some of RCU's boot-time code is shared with its CPU-hotplug online code path, primarily in the rcu_init_percpu_data() function. However, because RCU does not do anything special for system shutdown, RCU's CPU-hotplug offline code stands alone. The general operation of this code is described in the following section.

RCU CPU-Hotplug Operation

RCU tracks which CPUs are online and offline using the ->qsmaskinit bitmasks in the rcu_node tree, which are analogous to the ->qsmask fields that handle grace-period detection. At the beginning of each grace period, the initial value for each ->qsmask bitmask is loaded from the corresponding ->qsmaskinit bitmask. Each CPU-offline event clears the bit corresponding to the newly offlined CPU in that CPU's rcu_node structure's ->qsmaskinit bitmask, and, if the result is zero, propagates up the rcu_node tree. Similarly, each CPU-online event sets the bit corresponding to the newly onlined CPU in that CPU's rcu_node structure's structure's ->qsmaskinit bitmask, and, if the result was previously zero, propagates up the rcu_node tree. This process is quite similar to the way that quiescent state events propagate up the rcu_node tree.

However, the CPU-hotplug process is complicated by the fact that hotplug operations are not atomic. The CPU-hotplug process invokes a series of notifiers, each of which causes the corresponding Linux-kernel subsystem to consider the CPU to be offline. Therefore, there is a significant period of time during which the CPU is neither fully online or fully offline. RCU nevertheless needs to fully understand the CPU's state: Is the CPU online enough that RCU grace periods must wait on it. Worse yet, newly offline CPUs take one final pass through the scheduler on their way to the idle loop. Because the scheduler uses RCU, RCU must continue paying attention to CPUs after they have marked themselves offline. RCU currently works around this with the horrible (but apparently reliable) hack of considering a CPU to be online for a jiffy after it has marked itself as offline.

Quick Quiz 4: This is ridiculous! How did this kludge get into the Linux kernel?
Answer

RCU's CPU-hotplug notifier function is called rcu_cpu_notify(), and it is invoked repeatedly during each CPU-hotplug operation. When a CPU is taken offline, this function is called as shown in the following diagram:

CPU-Offline.png

The rcu_cpu_notify() function is first invoked with CPU_DOWN_PREPARE, thereby informing RCU that the specified CPU is to be offlined. RCU's CPU_DOWN_PREPARE code adjusts the affinity of the corresponding leaf rcu_node structure's priority boost kthread if CONFIG_RCU_BOOST=y, and does nothing otherwise. RCU always returns NOTIFY_OK, but has the option of returning NOTIFY_BAD, which will cause the CPU-hotplug operation to fail.

However, if all of the CPU_DOWN_PREPARE notifiers return NOTIFY_OK, the CPU-hotplug infrastructure will invoke all of the notifier functions (including rcu_cpu_notify()) again, but this time with CPU_DYING. The CPU_DYING notifiers are invoked in stop_machine() context, which means that the outgoing CPU is executing the notifier functions with interrupts disabled, and the rest of the CPUs are spinning with interrupts disabled. This is clearly a very heavy-weight operation that degrades real-time response, and you should avoid depending on its semantics, hence the red color. RCU therefore only does tracing from this notifier call unless CONFIG_RCU_FAST_NO_HZ=y, in which case it also does CPU-local cleanup. Once again, RCU always returns NOTIFY_OK, but has the option of returning NOTIFY_BAD, which will cause the CPU-hotplug operation to fail.

If any of the CPU_DOWN_PREPARE or the CPU_DYING notifiers return NOTIFY_BAD, the CPU-hotplug infrastructure will invoke the notifiers again, but this time with CPU_DOWN_FAILED. RCU's CPU_DOWN_FAILED code adjusts the affinity of the corresponding leaf rcu_node structure's priority boost kthread if CONFIG_RCU_BOOST=y, thus backing out any changes made by its RCU_DOWN_PREPARE notifier, and does nothing otherwise. The CPU_DOWN_FAILED are not permitted to fail.

On the other hand, if all of the CPU_DOWN_PREPARE or the CPU_DYING notifiers return NOTIFY_OK, the CPU-hotplug infrastructure will invoke the notifiers again, but this time with CPU_DEAD. RCU's CPU_DEAD code moves any remaining RCU callbacks from the dead CPU to some other CPU (taking care to maintain their order), clears the dead CPU's ->qsmaskinit bits from the rcu_node hierarchy, reports a quiescent state for the dead CPU, and, if all CPUs corresponding to this CPU's rcu_node structure are now offline, moves any tasks in this structure's ->blkd_tasks list to the root rcu_node structure. The CPU_DEAD are not permitted to fail.

Finally, the CPU-hotplug infrastructure invokes the notifiers one last time with CPU_POST_DEAD. RCU takes no action at this time. The CPU_POST_DEAD are not permitted to fail.

The CPU-hotplug CPU-online procedure is similar, as shown in the following figure:

CPU-Online.png

The main differences are that: (1) Unlike CPU_DYING, CPU_STARTING does not use stop_machine(), although it still runs with interrupts disabled on the incoming CPU, and (2) There is no equivalent of CPU_POST_DEAD.

RCU responds to the CPU_UP_PREPARE notifier by initializing the incoming CPU's rcu_data structure and by setting the incoming CPU's bits in the rcu_node structure's ->qsmaskinit fields. In CONFIG_RCU_BOOST=y kernels, it also spawns the per-rcu_node-structure priority-boost kthread.

RCU ignores the CPU_STARTING notifier.

RCU responds to the CPU_ONLINE notifier in the same way as for the CPU_DOWN_FAILED notifier described earlier. Similarly, RCU responds to the CPU_UP_CANCELED notifier in the same way as for the CPU_DEAD notifier described earlier.

RCU CPU-Hotplug Implementation

Given this background, we are now ready to look at some code. First is rcu_init_percpu_data(), which initializes RCU's per-CPU data structures for a CPU that is in the process of coming online and sets its rcu_node tree bits, shown below:

  1 static void __cpuinit
  2 rcu_init_percpu_data(int cpu, struct rcu_state *rsp, int preemptible)
  3 {
  4   unsigned long flags;
  5   unsigned long mask;
  6   struct rcu_data *rdp = per_cpu_ptr(rsp->rda, cpu);
  7   struct rcu_node *rnp = rcu_get_root(rsp);
  8 
  9   raw_spin_lock_irqsave(&rnp->lock, flags);
 10   rdp->beenonline = 1;
 11   rdp->preemptible = preemptible;
 12   rdp->qlen_last_fqs_check = 0;
 13   rdp->n_force_qs_snap = rsp->n_force_qs;
 14   rdp->blimit = blimit;
 15   rdp->dynticks->dynticks_nesting = DYNTICK_TASK_EXIT_IDLE;
 16   atomic_set(&rdp->dynticks->dynticks,
 17        (atomic_read(&rdp->dynticks->dynticks) & ~0x1) + 1);
 18   rcu_prepare_for_idle_init(cpu);
 19   raw_spin_unlock(&rnp->lock);
 20   raw_spin_lock(&rsp->onofflock);
 21   rnp = rdp->mynode;
 22   mask = rdp->grpmask;
 23   do {
 24     raw_spin_lock(&rnp->lock);
 25     rnp->qsmaskinit |= mask;
 26     mask = rnp->grpmask;
 27     if (rnp == rdp->mynode) {
 28       rdp->gpnum = rnp->completed;
 29       rdp->completed = rnp->completed;
 30       rdp->passed_quiesce = 0;
 31       rdp->qs_pending = 0;
 32       rdp->passed_quiesce_gpnum = rnp->gpnum - 1;
 33       trace_rcu_grace_period(rsp->name, rdp->gpnum, "cpuonl");
 34     }
 35     raw_spin_unlock(&rnp->lock);
 36     rnp = rnp->parent;
 37   } while (rnp != NULL && !(rnp->qsmaskinit & mask));
 38   raw_spin_unlock_irqrestore(&rsp->onofflock, flags);
 39 }

Lines 9-19 initialize this CPU's rcu_data fields, at least those that are independent of this CPU's leaf rcu_node structure. Line 9 acquires the root rcu_node structure's ->lock, and line 19 releases it. Line 10 marks this CPU has having been online, which is used by debugfs tracing to avoid dumping data structures corresponding to nonexistent CPUs. Line 11 records whether or not this flavor of RCU is preemptible, which in the past has been used to choose among quiescent-state-forcing strategies. Lines 12-14 initialize state variables used by heuristics that will govern handling of overly long queues of callbacks on this CPU. Lines 15-17 reset this CPU's dyntick-idle state to not-idle, which is needed because CPUs can be taken offline in indeterminate state because of momentary exits from idle (from an RCU perspective) in the idle loop. Finally, line 18 initializes this CPU's RCU_FAST_NO_HZ state.

Quick Quiz 5: Why is the root rcu_node structure's ->lock required on line 9 of rcu_init_percpu_data()?
Answer

Lines 20-38 set this CPU's rcu_node tree bits, and also initializes this CPU's rcu_data fields that depend on its leaf rcu_node structure. Line 20 acquires this RCU flavor's ->onofflock and line 38 releases it. Lines 21 and 22 initialize for the loop spanning lines 23-37, each pass though which handles one of this CPU's ancestors in the rcu_node tree, starting at this CPU's leaf rcu_node structure. Line 24 acquires the current rcu_node structure's ->lock, and line 35 releasees it. Line 25 sets this CPU's bit in the current rcu_node structure's ->qsmaskinit field, and line 26 picks up the relevant mask for the next level up in the rcu_node tree.

If line 27 determines that the current rcu_node structure is this CPU's leaf, lines 28-33 initialize the grace-period-related fields of this CPU's rcu_data structure. This CPU will know only about the last fully completed grace period (lines 28 and 29), which it will be noted as not having passed through a quiescent state for (line 30), and not needing to pass through one (line 31). Its non-existent quiescent-state passage will be attributed to the prior grace period (line 32), in other words, rendered irrelevant. Line 33 does event tracing.

Line 36 advances to the parent rcu_node structure up one level in the tree, and line 37 repeats the loop until either (1) we reach the root rcu_node structure or (2) we find an rcu_node structure for which this CPU's bit is already set, due to a sibling CPU already being online.

The rcu_init_percpu_data() function is invoked from rcu_prepare_cpu(), which is shown below.

  1 static void __cpuinit rcu_prepare_cpu(int cpu)
  2 {
  3   struct rcu_state *rsp;
  4 
  5   for_each_rcu_flavor(rsp)
  6     rcu_init_percpu_data(cpu, rsp,
  7                          strcmp(rsp->name, "rcu_preempt") == 0);
  8 }

Line 5 iterates across the rcu_state structures for each flavor of RCU (not including SRCU), and lines 6 and 7 invoke rcu_init_percpu_data(). The call to strcmp() checks for whether the current flavor of RCU is RCU-preempt, but in a way that works correctly even if RCU-preempt is not present in the running kernel.

Quick Quiz 6: But nothing in RCU cares whether the current flavor is RCU-preempt, so why bother?
Answer

The rcu_cleanup_dying_cpu(), which is invoked for CPU_DYING notifiers, is as follows:

  1 static void rcu_cleanup_dying_cpu(struct rcu_state *rsp)
  2 {
  3   RCU_TRACE(unsigned long mask);
  4   RCU_TRACE(struct rcu_data *rdp = this_cpu_ptr(rsp->rda));
  5   RCU_TRACE(struct rcu_node *rnp = rdp->mynode);
  6 
  7   RCU_TRACE(mask = rdp->grpmask);
  8   trace_rcu_grace_period(rsp->name,
  9                          rnp->gpnum + 1 - !!(rnp->qsmask & mask),
 10                          "cpuofl");
 11 }

As you can see, this function simply does tracing.

The rcu_preempt_offline_tasks() function shown below moves any tasks on the specified leaf rcu_node structure to the root rcu_node structure. This function is to be called only for leaf rcu_node structures whose CPUs are all offline. After this function returns, the specified rcu_node function is ignored for purposes of determining when a grace period can end.

  1 static int rcu_preempt_offline_tasks(struct rcu_state *rsp,
  2                                      struct rcu_node *rnp,
  3                                      struct rcu_data *rdp)
  4 {
  5   struct list_head *lp;
  6   struct list_head *lp_root;
  7   int retval = 0;
  8   struct rcu_node *rnp_root = rcu_get_root(rsp);
  9   struct task_struct *t;
 10 
 11   if (rnp == rnp_root) {
 12     WARN_ONCE(1, "Last CPU thought to be offlined?");
 13     return 0;
 14   }
 15   WARN_ON_ONCE(rnp != rdp->mynode);
 16   WARN_ON_ONCE(rnp->qsmask != 0);
 17   if (rcu_preempt_blocked_readers_cgp(rnp))
 18     retval |= RCU_OFL_TASKS_NORM_GP;
 19   if (rcu_preempted_readers_exp(rnp))
 20     retval |= RCU_OFL_TASKS_EXP_GP;
 21   lp = &rnp->blkd_tasks;
 22   lp_root = &rnp_root->blkd_tasks;
 23   while (!list_empty(lp)) {
 24     t = list_entry(lp->next, typeof(*t), rcu_node_entry);
 25     raw_spin_lock(&rnp_root->lock);
 26     list_del(&t->rcu_node_entry);
 27     t->rcu_blocked_node = rnp_root;
 28     list_add(&t->rcu_node_entry, lp_root);
 29     if (&t->rcu_node_entry == rnp->gp_tasks)
 30       rnp_root->gp_tasks = rnp->gp_tasks;
 31     if (&t->rcu_node_entry == rnp->exp_tasks)
 32       rnp_root->exp_tasks = rnp->exp_tasks;
 33 #ifdef CONFIG_RCU_BOOST
 34     if (&t->rcu_node_entry == rnp->boost_tasks)
 35       rnp_root->boost_tasks = rnp->boost_tasks;
 36 #endif /* #ifdef CONFIG_RCU_BOOST */
 37     raw_spin_unlock(&rnp_root->lock);
 38   }
 39   rnp->gp_tasks = NULL;
 40   rnp->exp_tasks = NULL;
 41 #ifdef CONFIG_RCU_BOOST
 42   rnp->boost_tasks = NULL;
 43   raw_spin_lock(&rnp_root->lock); /* irqs already disabled */
 44   if (rnp_root->boost_tasks != NULL &&
 45       rnp_root->boost_tasks != rnp_root->gp_tasks)
 46     rnp_root->boost_tasks = rnp_root->gp_tasks;
 47   raw_spin_unlock(&rnp_root->lock); /* irqs still disabled */
 48 #endif /* #ifdef CONFIG_RCU_BOOST */
 49   return retval;
 50 }

Line 11 checks to see if this function has been invoked on the root rcu_node structure, which is illegal. Therefore, in this case, line 12 gives a warning and line 13 returns.

Line 15 complains if this function is invoked on the root rcu_node structure, and line 16 complains if one of the now-offline CPUs is somehow thought to still be executing within an RCU read-side critical section.

Quick Quiz 7: Why is it a problem if rcu_preempt_offline_tasks() is invoked on the root rcu_node structure? After all, the rcu_node tree for a small system would consist of only a single rcu_node structure, so in that case, what other rcu_node structure could it possibly be invoked on?
Answer

Line 17 checks to see if any tasks queued on this rcu_node structure are blocking the current grace period, and, if so, line 18 flags this in the return value. Similarly, line 19 checks to see if any tasks queued on this rcu_node structure are blocking the current expedited grace period, and, if so, line 20 flags this in the return value. As we will see, the caller uses these flags to determine how to propagate quiescent states up the rcu_node tree.

Lines 21 and 22 pick up pointers to the leaf and root rcu_node structures' ->blkd_tasks lists, respectively, thus initializing for the loop spanning lines 23-38. Each pass through this loop moves one task from the leaf to the root. Line 24 obtains a pointer to the first task on the leaf's list. Line 25 acquires the root's ->lock (the caller already held that of the leaf), and line 37 releases it. Line 26 removes the task from the leaf's list, line 37 switches the task's allegience from the leaf to the root, and line 38 adds the task to the beginning of the root's list. Line 29 checks to see if this task was the first in the leaf's list that was blocking the current grace period, and if so, line 30 marks this task as blocking the current grace period for the root. Line 31 checks to see if this task was the first in the leaf's list that was blocking the current expedited grace period, and if so, line 32 marks this task as blocking the current expedited grace period for the root. If RCU priority boosting is enabled, then line 34 checks to see if this task is about to be boosted, and if so, line 35 marks this task as next to boost for the root. Finally, lines 39, 40, and 41 clear out the pointers into the ->blkd_tasks list.

Quick Quiz 8: What if rcu_preempt_offline_tasks() was executing while a grace period was being initialized, so that the leaf and root rcu_node structures had different ideas about what the current grace period was? Wouldn't that cause confusion when a task blocking what the leaf thought was the current grace period was moved to the root?
Answer

Lines 43-47 handle the possibility that the root node was doing priority boosting, but the leaf was not. In this case, it may be necessary to boost some of the tasks coming from the leaf.

Finally, line 49 returns the flags to the caller.

The rcu_cleanup_dead_cpu(), which is invoked for CPU_DEAD notifiers, is as follows:

  1 static void rcu_cleanup_dead_cpu(int cpu, struct rcu_state *rsp)
  2 {
  3   unsigned long flags;
  4   unsigned long mask;
  5   int need_report = 0;
  6   struct rcu_data *rdp = per_cpu_ptr(rsp->rda, cpu);
  7   struct rcu_node *rnp = rdp->mynode;
  8 
  9   rcu_stop_cpu_kthread(cpu);
 10   rcu_node_kthread_setaffinity(rnp, -1);
 11   raw_spin_lock_irqsave(&rsp->onofflock, flags);
 12   rcu_send_cbs_to_orphanage(cpu, rsp, rnp, rdp);
 13   rcu_adopt_orphan_cbs(rsp);
 14   mask = rdp->grpmask;
 15   do {
 16     raw_spin_lock(&rnp->lock);
 17     rnp->qsmaskinit &= ~mask;
 18     if (rnp->qsmaskinit != 0) {
 19       if (rnp != rdp->mynode)
 20         raw_spin_unlock(&rnp->lock);
 21       break;
 22     }
 23     if (rnp == rdp->mynode)
 24       need_report = rcu_preempt_offline_tasks(rsp, rnp, rdp);
 25     else
 26       raw_spin_unlock(&rnp->lock);
 27     mask = rnp->grpmask;
 28     rnp = rnp->parent;
 29   } while (rnp != NULL);
 30   raw_spin_unlock(&rsp->onofflock);
 31   rnp = rdp->mynode;
 32   if (need_report & RCU_OFL_TASKS_NORM_GP)
 33     rcu_report_unblock_qs_rnp(rnp, flags);
 34   else
 35     raw_spin_unlock_irqrestore(&rnp->lock, flags);
 36   if (need_report & RCU_OFL_TASKS_EXP_GP)
 37     rcu_report_exp_rnp(rsp, rnp, true);
 38   WARN_ONCE(rdp->qlen != 0 || rdp->nxtlist != NULL,
 39             "rcu_cleanup_dead_cpu: Callbacks on offline CPU %d: qlen=%lu, nxtlist=%p\n",
 40             cpu, rdp->qlen, rdp->nxtlist);
 41 }

The rcu_cleanup_dead_cpu() function dispositions the now-dead CPU's RCU priority-boosting kthreads (lines 9 and 10), dispositions the now-dead CPU's callbacks (lines 12 and 13), and adjusts the rcu_node tree to account for the CPU's departure (lines 14-40).

If RCU priority boosting is enabled, line 9 invokes rcu_stop_cpu_kthread() which stops the per-CPU kthread, while line 10 invokes rcu_node_kthread_setaffinity() which adjusts the affinity mask for the per-rcu_node priority-boost kthread. On the other hand, if RCU priority boosting is not enabled, these two functions do nothing.

Line 11 acquires the rcu_state structure's ->onofflock to exclude rcu_barrier() and RCU-preempt synchronize_rcu_expedited() This lock is released on line 30, so that it covers both the callback dispositioning and portions of the subsequent rcu_node tree adjustment. Line 12 and 13 transfer the dead CPU's callbacks to the current CPU.

Line 14 starts the rcu_node tree adjustment by picking up the mask with a bit set for the dead CPU within its leaf rcu_node bitmasks. Each pass through the loop spanning lines 14-29 handles one level of the rcu_node tree. Line 16 acquires the current rcu_node structure's ->lock and line 17 clears the bit in the rcu_node structure's ->qsmaskinit corresponding to the lower-level structure, which will be the CPU's rcu_data structure on the first pass through the loop and will be the previous pass's rcu_node structure on subsequent passes through the loop. If line 18 determines that there are still bits remaining in the current rcu_node structure's ->qsmaskinit field, we need not go any further up the tree, in which case lines 19 and 20 release this rcu_node structure's ->lock (but only if this is not the leaf rcu_node structure) and line 20 breaks out of the loop, holding the leaf rcu_node structure's ->lock. Otherwise, execution continues in the loop. If line 23 determines that we are still on the leaf rcu_node structure, it invokes rcu_preempt_offline_tasks() in order to move any blocked tasks queued on this rcu_node structure's ->blkd_tasks list to the root rcu_node structure. On the other hand, if we are on a non-leaf rcu_node structure, line 26 releases that structure's ->lock. Lines 27 and 28 prepare for the next pass through the loop by picking up the mask containing the bit corresponding to this rcu_node structure within the masks of its parent and advancing to that parent, respectively. Finally, line 29 ends the loop if we have passed up out of the root of the tree.

Quick Quiz 9: But given that it there will always be at least one CPU online, won't the root rcu_node structure always have at least one bit set in its ->qsmaskinit field, forcing the loop to exit at line 21 of rcu_cleanup_dead_cpu()? So how can execution possibly exit at the bottom of the loop?
Answer

Line 31 picks up a pointer to the dead CPU's leaf rcu_node structure. If line 32 indicates that the call to rcu_preempt_offline_tasks() on line 24 found that there were some tasks blocking the current grace period on the leaf rcu_node structure, then line 33 invokes rcu_report_unblock_qs_rnp() in order to report that this rcu_node structure is no longer blocking the grace period (and to release the leaf rcu_node structure's ->lock). Otherwise, line 35 releases the leaf rcu_node structure's ->lock.

Quick Quiz 10: But the tasks might still be blocked in their RCU read-side critical sections, so why is it safe for line 33 of rcu_cleanup_dead_cpu() to report that the leaf rcu_node structure is no longer blocking the current grace period?
Answer

If line 36 indicates that the call to rcu_preempt_offline_tasks() on line 24 found that there were some tasks blocking the current expedited grace period on the leaf rcu_node structure, then line 37 invokes rcu_report_exp_rnp() in order to report that this rcu_node structure is no longer blocking the expedited grace period. Finally, lines 38-40 give a warning if callbacks have somehow remained queued on the dead CPU.

This whole process is orchestrated by rcu_cpu_notify(), which is as follows:

  1 static int __cpuinit rcu_cpu_notify(struct notifier_block *self,
  2                                     unsigned long action, void *hcpu)
  3 {
  4   long cpu = (long)hcpu;
  5   struct rcu_data *rdp = per_cpu_ptr(rcu_state->rda, cpu);
  6   struct rcu_node *rnp = rdp->mynode;
  7   struct rcu_state *rsp;
  8 
  9   trace_rcu_utilization("Start CPU hotplug");
 10   switch (action) {
 11   case CPU_UP_PREPARE:
 12   case CPU_UP_PREPARE_FROZEN:
 13     rcu_prepare_cpu(cpu);
 14     rcu_prepare_kthreads(cpu);
 15     break;
 16   case CPU_ONLINE:
 17   case CPU_DOWN_FAILED:
 18     rcu_node_kthread_setaffinity(rnp, -1);
 19     rcu_cpu_kthread_setrt(cpu, 1);
 20     break;
 21   case CPU_DOWN_PREPARE:
 22     rcu_node_kthread_setaffinity(rnp, cpu);
 23     rcu_cpu_kthread_setrt(cpu, 0);
 24     break;
 25   case CPU_DYING:
 26   case CPU_DYING_FROZEN:
 27     for_each_rcu_flavor(rsp)
 28       rcu_cleanup_dying_cpu(rsp);
 29     rcu_cleanup_after_idle(cpu);
 30     break;
 31   case CPU_DEAD:
 32   case CPU_DEAD_FROZEN:
 33   case CPU_UP_CANCELED:
 34   case CPU_UP_CANCELED_FROZEN:
 35     for_each_rcu_flavor(rsp)
 36       rcu_cleanup_dead_cpu(cpu, rsp);
 37     break;
 38   default:
 39     break;
 40   }
 41   trace_rcu_utilization("End CPU hotplug");
 42   return NOTIFY_OK;
 43 }

Lines 9 and 41 do tracing. Lines 13 and 14 handle the CPU_UP_PREPARE notifier action, with the former initializing this CPU's per-CPU data and the latter dealing with RCU priority boosting. Lines 18 and 19 handle the CPU_ONLINE notifier action, both dealing with RCU priority boosting. These same two lines also handle CPU_DOWN_FAILED, and note that lines 22 and 23 (which handle CPU_DOWN_PREPARE are the inverse operations of lines 18 and 19. Lines 27-29 handle the CPU_DYING notifier action, with the first two lines just doing tracing and the last line calling into the RCU_FAST_NO_HZ to clean up idle state for the outgoing CPU. Lines 35 and 36 handle the CPU_DEAD notifier action, dealing with the now-dead CPU's priority-boosting kthreads, dispositioning the now-dead CPU's callbacks, and reporting the resulting extended quiescent state up the rcu_node tree. These same two lines also handle CPU_UP_CANCELED.

Quick Quiz 11: Why aren't rcu_cpu_notify's CPU_UP_CANCELED actions the inverse of it CPU_UP_PREPARE actions?
Answer

Finally, line 42 gives the CPU-hotplug operation RCU's authorization to proceed in all cases.

Summary

This article has described RCU's CPU-hotplug interactions. CPU hotplug will be undergoing some overhauls, which is likely to mean some changes to RCU. Hopefully for the better!

Acknowledgments

I am grateful to @@@ for their help in rendering this article human readable.

Legal Statement

This work represents the view of the author and does not necessarily represent the view of IBM.

Linux is a registered trademark of Linus Torvalds.

Other company, product, and service names may be trademarks or service marks of others.

Answers to Quick Quizzes

Quick Quiz 1: Why is it a bad idea to block waiting for a callback to be invoked during early boot?

Answer: Because the scheduler has not yet been initialized to the point where blocking is possible!

Back to Quick Quiz 1.

Quick Quiz 2: But what if my special CPU needs more than one jiffy to come online or go offline?

Answer: You only have one jiffy. If you need more time when going offline, move the work to a CPU_DYING notifier, which RCU cannot interrupt. If you need more time when coming online, then move the work to follow your architecture-specific code that marks the CPU online in the cpu_online_mask.

Back to Quick Quiz 2.

Quick Quiz 3: Under what circumstances can synchronize_sched() be a no-op at runtime (in other words, after boot has fully completed)?

Answer: Any time there is only one online CPU, synchronize_sched() will be a no-op.

Back to Quick Quiz 3.

Quick Quiz 4: This is ridiculous! How did this kludge get into the Linux kernel?

Answer: I am sure that it seemed like a good idea at the time, but it is nevertheless in the process of being fixed.

Back to Quick Quiz 4.

Quick Quiz 5: Why is the root rcu_node structure's ->lock required on line 9 of rcu_init_percpu_data()?

Answer: It prevents the ->n_force_qs field from overflowing. This issue will require some thought should quiescent-state forcing ever be optimized to visit a subset of the leaf rcu_node structures.

Back to Quick Quiz 5.

Quick Quiz 6: But nothing in RCU cares whether the current flavor is RCU-preempt, so why bother?

Answer: If there is still nothing that cares after a few releases, this will be removed. Until then, the question is instead “Why bother removing it?”

Back to Quick Quiz 6.

Quick Quiz 7: Why is it a problem if rcu_preempt_offline_tasks() is invoked on the root rcu_node structure? After all, the rcu_node tree for a small system would consist of only a single rcu_node structure, so in that case, what other rcu_node structure could it possibly be invoked on?

Answer: Given that it is illegal to invoke rcu_preempt_offline_tasks() on any rcu_node structure that has any corresponding online CPUs, it can only be legal to invoke rcu_preempt_offline_tasks() on the root rcu_node if all of the CPUs are offline. But in that case, what could possibly be executing the function? Besides, invoking this function on the root rcu_node structure is pointless in any case, because the whole purpose of this function is to move queued tasks to the root rcu_node structure. And yes, this means that on a system with only a single rcu_node structure, it is illegal to ever call rcu_preempt_offline_tasks().

Back to Quick Quiz 7.

Quick Quiz 8: What if rcu_preempt_offline_tasks() was executing while a grace period was being initialized, so that the leaf and root rcu_node structures had different ideas about what the current grace period was? Wouldn't that cause confusion when a task blocking what the leaf thought was the current grace period was moved to the root?

Answer: This cannot happen. First, from an rcu_node perspective, grace periods are consecutive, with one grace period completely finishing before the next one starts, so that the root's current grace period will either be the same as or one later than that of the leaf, which means that if anything is blocking the leaf's current grace period, it must necessarily be blocking the root's as well. Second, grace-period initialization is carried out under the protection of get_online_cpus(), which blocks CPU-hotplug operations so that rcu_preempt_offline_tasks() cannot execute while a grace period is initialized. Third and last, at least for the purposes of this Quick Quiz, if the grace period is being initialized, the leaf rcu_node structure must see the previous grace period as having been completed, which would mean that the leaf rcu_node structure cannot possibly have any tasks blocking that grace period.

Back to Quick Quiz 8.

Quick Quiz 9: But given that it there will always be at least one CPU online, won't the root rcu_node structure always have at least one bit set in its ->qsmaskinit field, forcing the loop to exit at line 21 of rcu_cleanup_dead_cpu()? So how can execution possibly exit at the bottom of the loop?

Answer: Indeed, it cannot exit at the bottom of the loop. You have a problem with this?

Back to Quick Quiz 9.

Quick Quiz 10: But the tasks might still be blocked in their RCU read-side critical sections, so why is it safe for line 33 of rcu_cleanup_dead_cpu() to report that the leaf rcu_node structure is no longer blocking the current grace period?

Answer: Because those tasks have been moved to the root rcu_node structure, which will be continuing to block the current grace period.

Back to Quick Quiz 10.

Quick Quiz 11: Why aren't rcu_cpu_notify's CPU_UP_CANCELED actions the inverse of it CPU_UP_PREPARE actions?

Answer: The reporting of the extended quiescent state by rcu_cleanup_dead_cpu() does act as the inverse of the critical parts of rcu_prepare_cpu(), but it does not make sense to undo the initialization of the CPU's per-CPU data structures. So we don't!

Back to Quick Quiz 11.