September 4, 2011

This article was contributed by Paul E. McKenney

Introduction

  1. Quiescent-State Overview
  2. RCU Quiescent-State API
  3. RCU Quiescent-State Operation
  4. RCU Quiescent-State Implementation
  5. Summary

And the end would simply not be the end without the answers to the quick quizzes.

Quiescent-State Overview

RCU updaters must wait for readers to get done with pre-existing RCU read-side critical sections, the ends of which are communicated to RCU in a more or less timely manner via quiescent states. Once each and every task in the system has been seen in a quiescent state, the system is said to have passed through an RCU grace period. Therefore, RCU updaters can wait for a grace period to elapse between (for example) removing an item from an RCU-protected data structure and freeing it.

Detecting when RCU readers are in quiescent states is clearly critically important, and is the topic of this article. Different flavors of RCU have different quiescent states:

  1. RCU-sched's quiescent states are user-space execution, context switch, not being runnable, and idle task.
  2. RCU-bh's quiescent states are all of RCU-sched's quiescent states plus executing outside of bottom-half context. CPUs running softirq handlers or in bottom-half-disable mode (e.g., via local_bh_disable()) are inside of bottom-half context.
  3. RCU-preempt's quiescent states are anywhere not in an RCU read-side critical section.

Quick Quiz 1: Give an example of a task that is in a quiescent state for RCU-sched, but not for RCU-preempt.
Answer

Quick Quiz 2: Is a quiescent state a property of a task, a CPU, a kernel thread, or of something else?
Answer

Regardless of how you choose to think about quiescent states, an important part of RCU's operation is efficiently detecting them. Because every blocked and preempted task is in both an RCU-bh and an RCU-sched quiescent state, these two RCU flavors need concern themselves only with the tasks, interrupts, and NMIs that are actually running at a given point in time. These RCU flavors therefore focus almost entirely on the CPUs, ignoring the potentially much larger number of tasks that are not currently running.

In contrast, RCU-preempt must also track any tasks blocked within an RCU-preempt read-side critical section. This flavor of RCU nevertheless avoids time-consuming scans of the entire task list by checking each outgoing task at context-switch time, explicitly tracking those that do so within RCU read-side critical sections. This approach permits RCU-preempt to use fast CPU/task-local counters in the common case, and to use more expensive operations in the much less-common situation where a task is preempted or blocks within an RCU-preempt read-side critical section.

Given this overview, we are now ready to examine the API.

RCU Quiescent-State API

RCU's quiescent-state API is reserved for scheduler-like parts of the kernel; it is not intended for use by normal RCU users. That said, here it is:

  1. rcu_bh_qs(int cpu)
  2. rcu_check_callbacks(int cpu, int user)
  3. rcu_note_context_switch(int cpu)
  4. rcu_virt_note_context_switch(int cpu)

The rcu_bh_qs() API member is called to announce an RCU-bh quiescent state to RCU. It is invoked by the softirq scheduler between consecutive softirq handler invocations, and from rcu_check_callbacks() when it notes that the scheduling-clock interrupt occurred in a section of code where bottom halves were enabled.

Quick Quiz 3: Why aren't rcu_sched_qs() and rcu_preempt_qs() part of this API?
Answer

The rcu_check_callbacks() API member is called from within the scheduling-clock interrupt. Its purposes are as follows:

  1. Inform RCU of user-mode execution, which is an extended quiescent state for both RCU-bh and RCU-sched.
  2. Inform RCU of non-dyntick idle-loop execution, which again is a quiescent state for both RCU-bh and RCU-sched.
  3. Cause the current CPU to respond as needed to any RCU-related actions of other CPUs.

Quick Quiz 4: What RCU-related actions might the current CPU need to respond to?
Answer

The rcu_note_context_switch() API member is called from the scheduler and from run_ksoftirqd to report a context switch. In the case of run_ksoftirqd(), this is a fake context switch, but is nevertheless helpful in cases when ksoftirqd decides to process lots of softirq handlers.

The rcu_virt_note_context_switch() API member is used by KVM to inform RCU when a CPU enters KVM guest mode, which as far as RCU is concerned is equivalent to user-mode execution.

RCU Quiescent-State Operation

With the exception of RCU-preempt, the recording of RCU's quiescent states is straightforward: quiescent states are recorded as they occur, and the RCU core indicates when quiescent states are needed.

RCU-preempt is a bit more complex due to its need to track tasks that block within RCU read-side critical sections. RCU-preempt tracks quiescent states for CPUs in much the same way that RCU-sched does. However, when a CPU is context-switching away from a given task, RCU-preempt also checks to see if that task is in an RCU read-side critical section. If so, RCU-preempt queues the task on that CPU's leaf rcu_node structure's ->blkd_tasks list. The rcu_node structure maintains internal pointers into this list indicating which of the tasks are blocking the current grace period: if the CPU's RCU read-side critical section was blocking the current grace period, then the blocked task must also still be blocking that grace period. Only after RCU-preempt has done any needed task queuing does it record a quiescent state for the CPU.

Quick Quiz 5: But suppose that a CPU that has already passed through a quiescent state for the current RCU grace period resumes executing this task that is blocking the current RCU grace period. We then have a task that is blocking the current RCU grace period running on a CPU that is no longer blocking that same RCU grace period. How can that possibly make any sense?
Answer

It turns out that RCU-preempt can check a single pointer (->gp_tasks) to determine whether or not there is a task blocking the current RCU grace period on a given rcu_node structure. Furthermore, RCU-preempt can check a single bitmask (->qsmask) to determine whether or not there is a CPU associated with a given rcu_node structure that needs to pass thought a quiescent state. These tricks permit RCU-preempt to efficiently check for quiescent states.

RCU Quiescent-State Implementation

The quiescent-state implementation is presented in three pieces:

  1. Momentary quiescent states.
  2. Extended quiescent states.
  3. Context-switch handling.

The implementation of the first category, momentary quiescent states, is as follows:

  1 void rcu_bh_qs(int cpu)
  2 {
  3   struct rcu_data *rdp = &per_cpu(rcu_bh_data, cpu);
  4 
  5   rdp->passed_quiesce_gpnum = rdp->gpnum;
  6   barrier();
  7   if (rdp->passed_quiesce == 0)
  8     trace_rcu_grace_period("rcu_bh", rdp->gpnum, "cpuqs");
  9   rdp->passed_quiesce = 1;
 10 }
 11 
 12 static void rcu_preempt_qs(int cpu)
 13 {
 14   struct rcu_data *rdp = &per_cpu(rcu_preempt_data, cpu);
 15 
 16   rdp->passed_quiesce_gpnum = rdp->gpnum;
 17   barrier();
 18   if (rdp->passed_quiesce == 0)
 19     trace_rcu_grace_period("rcu_preempt", rdp->gpnum, "cpuqs");
 20   rdp->passed_quiesce = 1;
 21   current->rcu_read_unlock_special &= ~RCU_READ_UNLOCK_NEED_QS;
 22 }
 23 
 24 void rcu_sched_qs(int cpu)
 25 {
 26   struct rcu_data *rdp = &per_cpu(rcu_sched_data, cpu);
 27 
 28   rdp->passed_quiesce_gpnum = rdp->gpnum;
 29   barrier();
 30   if (rdp->passed_quiesce == 0)
 31     trace_rcu_grace_period("rcu_sched", rdp->gpnum, "cpuqs");
 32   rdp->passed_quiesce = 1;
 33 }

The rcu_bh_qs() function, whose purpose is to announce an RCU-bh queiscent state to RCU, is shown on lines 1-10. Line 3 gets a pointer to this CPU's rcu_data structure for RCU-bh (hence rcu_bh_data), which means that the caller must have at least disabled preemption. Line 5 records this CPU's idea of the current grace-period number and line 6 ensures that the compiler doesn't reorder line 5 with the remainder of the function. Line 7 checks to see if the is the first quiescent state for the current grace period, and, if so, line 8 traces it. Finally, line 9 records the fact that we have encountered an RCU-bh quiescent state, which will eventually be noticed by the grace-period machinery.

Quick Quiz 6: Why is the barrier() in rcu_bh_qs() on line 6 required?
Answer

Quick Quiz 7: Why trace only the first quiescent state of a given grace period?
Answer

The rcu_preempt_qs() function, whose purpose is to announce a CPU's RCU-preempt quiescent state to RCU, is shown on lines 12-22. (In CONFIG_TREE_RCU kernels, which do not implement RCU-preempt, rcu_preempt_qs() is omitted.) As noted earlier, just because a given CPU is in an RCU-preempt quiescent state does not mean that the task that this CPU just context-switched away from is also in an RCU-preempt quiescent state—and RCU-preempt needs quiescent states from all tasks as well as all CPUs.

Quick Quiz 8: Why don't the other RCU flavors also need quiescent states from all tasks as well as all CPUs?
Answer

That aside, the rcu_preempt_qs() function is analogous to rcu_bh_qs() except for line 21, which responds to any outstanding request to finish the RCU read-side critical section.

The rcu_sched_qs() function is shown on lines 24-33. This function is called to announce an RCU-sched quiescent state to RCU. It is invoked from rcu_note_context_switch(), rcu_check_callbacks(), and, when CONFIG_RCU_FAST_NO_HZ is set, rcu_needs_cpu(). Is is analogous to rcu_bh_qs(), so it will not be described separately.

The implementation of the second category, extended quiescent states, is as follows:

  1 void rcu_check_callbacks(int cpu, int user)
  2 {
  3   trace_rcu_utilization("Start scheduler-tick");
  4   if (user ||
  5       (idle_cpu(cpu) && rcu_scheduler_active &&
  6        !in_softirq() && hardirq_count() <= (1 << HARDIRQ_SHIFT))) {
  7     rcu_sched_qs(cpu);
  8     rcu_bh_qs(cpu);
  9   } else if (!in_softirq()) {
 10     rcu_bh_qs(cpu);
 11   }
 12   rcu_preempt_check_callbacks(cpu);
 13   if (rcu_pending(cpu))
 14     invoke_rcu_core();
 15   trace_rcu_utilization("End scheduler-tick");
 16 }
 17 
 18 static void rcu_preempt_check_callbacks(int cpu)
 19 {
 20   struct task_struct *t = current;
 21 
 22   if (t->rcu_read_lock_nesting == 0) {
 23     rcu_preempt_qs(cpu);
 24     return;
 25   }
 26   if (t->rcu_read_lock_nesting > 0 &&
 27       per_cpu(rcu_preempt_data, cpu).qs_pending)
 28     t->rcu_read_unlock_special |= RCU_READ_UNLOCK_NEED_QS;
 29 }

The rcu_check_callbacks() function, which checks for user-mode execution, idle CPUs, and (for RCU-bh) execution between softirq handlers, is shown on lines 1-16.

Quick Quiz 9: But why doesn't this code check for dyntick-idle and offline CPUs, both of which are also extended quiescent states?
Answer

Lines 3 and 15 trace the beginning and end, respectively, of rcu_check_callbacks(). Lines 4-6 check for RCU-sched quiescent states as follows:

  1. Line 4 checks to see if this scheduling-clock interrupt occurred during user-space execution, which is a quiescent state for both RCU-bh and RCU-sched.
  2. The first term of line 5 checks to see if this scheduling-clock interrupt interrupted the idle task. This would normally be an quiescent state for both RCU-bh and RCU-sched, but there are some special cases where it is not a quiescent state, which are covered by the remaining terms.
  3. The second term of line 5 checks to see if this scheduling-clock interrupt occurred before the scheduler has been fully initialized during early boot. Because all early-boot execution occurs in idle-task context, the idle task is not automatically a quiescent state during this time.
  4. The first term of line 6 checks to see if this scheduling-clock interrupt occurred during the execution of a softirq handler. Because softirq handlers can contain RCU read-side critical sections, the idle task is not a quiescent state for either RCU-bh or RCU-sched during sofirq processing.
  5. The second term of line 6 is similar to the first term, but for hardirq handlers rather than softirq handlers.

Quick Quiz 10: Why isn't the second term on line 6 simply “!hardirq_count()”? Why do we need the comparison involving HARDIRQ_COUNT?
Answer

If lines 4-6 determine that we are in an RCU-sched quiescent state, then line 7 notes an RCU-sched quiescent state and line 8 notes an RCU-bh quiescent state. (Recall that any RCU-sched quiescent state is also an RCU-bh quiescent state.) Line 9 checks to see if we are running in softirq context, in other words, if we interrupted a softirq handler or a region of code with bottom halves disabled (e.g., via local_bh_disable()). If we are not running in softirq context, line 10 records the fact that we are in an RCU-bh quiescent state.

Line 12 invokes rcu_preempt_check_callbacks() in order to check for RCU-preempt quiescent states in CONFIG_TREE_PREEMPT_RCU kernels. Line 12 checks to see if RCU needs anything from this CPU, and, if so, line 14 causes the RCU core code to run on this CPU in softirq context.

The rcu_preempt_check_callbacks() function is shown on lines 18-29 above. Again, the purpose of this function is to check for RCU-preempt quiescent states, so in CONFIG_RCU_TREE kernels, which do not implement RCU-preempt, rcu_preempt_check_callbacks() is an empty function. On the other hand, in CONFIG_RCU_TREE_PREEMPT kernels, line 20 picks up a pointer to the current task, and line 22 checks to see if this task is executing outside of any RCU-preempt read-side critical section, and, if so, line 23 announces a quiescent state to the RCU core code. Lines 26 and 27 check to see if we are in an RCU-preempt read-side critical section on a CPU that has not yet passed through a quiescent state for the current RCU-preempt grace period, and, if so, line 28 sets the RCU_READ_UNLOCK_NEED_QS bit so that the next outermost rcu_read_unlock() will announce a quiescent state.

The implementation of the third category, context-switch handling, is as follows:

  1 void rcu_note_context_switch(int cpu)
  2 {
  3   trace_rcu_utilization("Start context switch");
  4   rcu_sched_qs(cpu);
  5   rcu_preempt_note_context_switch(cpu);
  6   trace_rcu_utilization("End context switch");
  7 }
  8 
  9 static inline void rcu_virt_note_context_switch(int cpu)
 10 {
 11   rcu_note_context_switch(cpu);
 12 }
 13 
 14 static void rcu_preempt_note_context_switch(int cpu)
 15 {
 16   struct task_struct *t = current;
 17   unsigned long flags;
 18   struct rcu_data *rdp;
 19   struct rcu_node *rnp;
 20 
 21   if (t->rcu_read_lock_nesting > 0 &&
 22       (t->rcu_read_unlock_special & RCU_READ_UNLOCK_BLOCKED) == 0) {
 23     rdp = per_cpu_ptr(rcu_preempt_state.rda, cpu);
 24     rnp = rdp->mynode;
 25     raw_spin_lock_irqsave(&rnp->lock, flags);
 26     t->rcu_read_unlock_special |= RCU_READ_UNLOCK_BLOCKED;
 27     t->rcu_blocked_node = rnp;
 28     WARN_ON_ONCE((rdp->grpmask & rnp->qsmaskinit) == 0);
 29     WARN_ON_ONCE(!list_empty(&t->rcu_node_entry));
 30     if ((rnp->qsmask & rdp->grpmask) && rnp->gp_tasks != NULL) {
 31       list_add(&t->rcu_node_entry, rnp->gp_tasks->prev);
 32       rnp->gp_tasks = &t->rcu_node_entry;
 33 #ifdef CONFIG_RCU_BOOST
 34       if (rnp->boost_tasks != NULL)
 35         rnp->boost_tasks = rnp->gp_tasks;
 36 #endif
 37     } else {
 38       list_add(&t->rcu_node_entry, &rnp->blkd_tasks);
 39       if (rnp->qsmask & rdp->grpmask)
 40         rnp->gp_tasks = &t->rcu_node_entry;
 41     }
 42     trace_rcu_preempt_task(rdp->rsp->name,
 43                t->pid,
 44                (rnp->qsmask & rdp->grpmask)
 45                ? rnp->gpnum
 46                : rnp->gpnum + 1);
 47     raw_spin_unlock_irqrestore(&rnp->lock, flags);
 48   } else if (t->rcu_read_lock_nesting < 0 &&
 49        t->rcu_read_unlock_special) {
 50     rcu_read_unlock_special(t);
 51   }
 52   local_irq_save(flags);
 53   rcu_preempt_qs(cpu);
 54   local_irq_restore(flags);
 55 }

The rcu_note_context_switch() function is shown on lines 1-7. Lines 3 and 6 trace entry and exit, line 4 invokes rcu_sched_qs() in order to announce the RCU-sched quiescent state, and line 5 invokes rcu_preempt_note_context_switch() in order to allow RCU-preempt to do context-switch-time processing for CONFIG_TREE_PREEMPT_RCU kernels.

Quick Quiz 11: Why not also call rcu_bh_qs() from rcu_note_context_switch()? After all, any RCU-sched quiescent state is also an RCU-bh quiescent state.
Answer

The rcu_virt_note_context_switch() function, shown on lines 9-12, allows KVM to inform RCU of the beginning of guest-mode operation, allowing RCU to treat guest-OS execution in a manner similar to the way that it treats user-mode execution.

Quick Quiz 12: Suppose that the guest OS is Linux: Then the guest OS will have RCU activity. So how can RCU safely treat guest-OS execution in the same way that it treats user-mode execution?
Answer

Quick Quiz 13: Yes, rcu_virt_note_context_switch() causes RCU to note a quiescent state when a KVM guest begins executing. But what if that guest continues executing indefinitely? Wouldn't that indefinitely extend the RCU grace period, eventually resulting in out-of-memory conditions?
Answer

The rcu_preempt_note_context_switch() function on lines 14-55 handles context-switch-time processing for RCU-preempt in CONFIG_TREE_PREEMPT_RCU kernels (in CONFIG_TREE_RCU kernels, rcu_preempt_note_context_switch() is an empty function). This function has three major jobs:

  1. Enqueue tasks that block within RCU-preempt read-side critical sections (lines 21-47).
  2. Finish rcu_read_unlock_special() processing for tasks that are preempted in the middle of an outermost rcu_read_unlock() (lines 48-51).
  3. Announce an RCU-preempt quiescent state (lines 52-54).

Quick Quiz 14: So what happens to tasks that are blocking, but which happen to be in neither an RCU-preempt read-side critical section nor in the middle of an outermost rcu_read_unlock()?
Answer

The first job is the biggest and is also the one we start with. Lines 21-22 checks to see if this task is blocking for the first time within an RCU-preempt read-side critical section, and, if so, lines 23-47 enqueue the task on the current CPU's leaf rcu_node structure. Line 23 picks up a pointer to the current CPU's rcu_data structure (which means that the caller must have at least disabled preemption), line 24 picks up a pointer to the current CPU's rcu_node structure, and line 25 acquires that rcu_node structure's ->lock. Line&nsp;26 sets the RCU_READ_UNLOCK_BLOCKED so as to cause the next outermost __rcu_read_unlock() to invoke rcu_read_unlock_special() in order to dequeue this task, and line 27 records a pointer to the rcu_node structure on which the task was enqueued.

Quick Quiz 15: What prevents the task from entering rcu_read_unlock_special() before rcu_preempt_note_context_switch() has finished enqueuing it, thereby corrupting the lists?
Answer

Quick Quiz 16: Why must rcu_preempt_note_context_switch() record the pointer to the rcu_node structure? After all, list_del_rcu() works just fine given only the list element, the list header is not required.
Answer

Line 28 complains if the current CPU is offline (if it is offline, how is it that this task is currently running on it?), and line 29 complains if this task is already queued somewhere.

We are now ready for lines 30-41 to actually queue the task. However, we first need to figure out where to enqueue it, and there are three cases that must be properly handled:

  1. This task's RCU-preempt read-side critical section blocks the current grace period, but there is already another task queued that also blocks this same grace period (lines 31-36). This task must be enqueued to precede that other task, and the ->gp_tasks pointer must be updated to point to the task that is just now context switching. If RCU priority boosting is in progress, then this new task will also need to be priority boosted.
  2. This task's RCU-preempt read-side critical section blocks the current grace period, and it is the first to do so (lines 38-40). This task must be enqueued at the head of the ->blkd_tasks list, and the ->gp_tasks pointer must be updated to reference it.
  3. This task's RCU-preempt read-side critical section does not block the current grace period, which means that it will block the next grace period should that next grace period start before this read-side critical section ends (line 38 but not 39-40). This task must be enqueued at the head of the ->blkd_tasks list, but no other action need be taken.

Starting with the first case above, line 30 checks to see if the current CPU is blocking the current grace period (rnp->qsmask & rdp->grpmask) and there is at least one other task on this rcu_node structure doing so (rnp->gp_tasks != NULL). If so, lines 31-36 handle this case. Line 31 adds the task to the list so as to immediately precede the first task blocking the current grace period in this rcu_node structure's ->blkd_tasks list, and line 32 points this rcu_node structure's ->gp_tasks to reference the new task. If RCU priority boosting has been configured, then line 34 checks to see if this rcu_node structure is currently being priority boosted, and, if so, line 35 makes this rcu_node structure's ->boost_tasks pointer reference the task undergoing a context switch.

Otherwise, we are in either case 2 or 3. In both of these cases, line 38 enqueues the task at the head of this rcu_node structure's ->blkd_tasks list. Line 39 checks to see if this task is blocking the current grace period. If so, we are in case 2, so line 40 makes this rcu_node structure's ->gp_tasks pointer reference this task.

Regardless of which of the three cases applies, lines 42-46 trace the fact that this task blocked, and line 47 releases this rcu_node structure's ->lock.

We have now completed our review of lines 21-47, which implement the rcu_preempt_note_context_switch() function's first job, which was to enqueue tasks that block within RCU-preempt read-side critical sections. We are therefore ready to proceed to the second job, which is finishing rcu_read_unlock_special() processing for tasks that are preempted in the middle of an outermost rcu_read_unlock(), which is covered by lines 48-50.

Line 48 checks to see if this task was in the middle of an outermost __rcu_read_unlock() (recall that __rcu_read_unlock() sets its task_struct structure's ->rcu_read_lock_nesting to INT_MIN before invoking rcu_read_unlock_special() and to zero after rcu_read_unlock_special() returns). If so, line 49 checks this task_struct structure's ->rcu_read_unlock_special() field, which, if nonzero, indicates that there is some work for rcu_read_unlock_special() to do, in which case line 50 invokes rcu_read_unlock_special() to get that work done.

Quick Quiz 17: Say what??? Why does rcu_preempt_note_context_switch() need to invoke rcu_read_unlock_special()?
Answer

We have now completed our review of lines 48-51, which do the rcu_preempt_note_context_switch() function's second job of dealing with tasks preempted in the middle of an outermost rcu_read_unlock(). We are now ready to look at the final job, which is announcing an RCU-preempt quiescent state. This is straightforward: Line 52 disables interrupts, line 53 calls rcu_preempt_qs to make the announcement, and line 54 restores interrupts.

Summary

This article has described RCU's quiescent-state-handling code. Although the code itself is quite simple, especially for RCU code, its relationships with other parts of RCU and indeed with other parts of the Linux kernel can be quite subtle. Great care is required when studying this code, and most especially when modifying it.

Acknowledgments

I am grateful to Gleb Natapov and Avi Kivity for their explanation of how rcu_virt_note_context_switch() handles the quiescent-state interaction between KVM and RCU, and to Cheng Xu and Peter Zijlstra for greatly increasing the human readability of this article. @@@

Legal Statement

This work represents the view of the author and does not necessarily represent the view of IBM.

Linux is a registered trademark of Linus Torvalds.

Other company, product, and service names may be trademarks or service marks of others.

Answers to Quick Quizzes

Quick Quiz 1: Give an example of a task that is in a quiescent state for RCU-sched, but not for RCU-preempt.

Answer: A task that is blocked or preempted while within an RCU-preempt read-side critical section.

Back to Quick Quiz 1.

Quick Quiz 2: Is a quiescent state a property of a task, a CPU, a kernel thread, or of something else?

Answer: Within the confines of the Linux kernel, the answer to this question is rather fuzzy. To see this, consider the possibilities:

  1. Queiscent state is a property of a CPU. This fails when we consider an RCU-preempt task that has been preempted within an RCU read-side critical section. This task is not in a quiescent state, but it has no CPU.
  2. Quiescent state is a property of a task. This fails when we consider an interrupt or NMI handler that contains an RCU read-side critical section, which can have both RCU read-side critical sections and quiescent states independent of whatever task happens to be running.
  3. Quiescent state is a property of a kthread. This fails when we consider user threads executing in kernel mode.

This situation might be a bit frustrating to those who prefer rigorous definitions, but the fact is that the design of high-performance production quality RCU implementations depends critically on blurring the distinctions between these contexts. Alternatively, feel free to think in terms of RCU needing to separately handle each of these contexts—and boot-time execution as well!

Back to Quick Quiz 2.

Quick Quiz 3: Why aren't rcu_sched_qs() and rcu_preempt_qs() part of this API?

Answer: Because they are called from within the RCU implementation only, so they do not qualify as external-to-RCU API members.

Back to Quick Quiz 3.

Quick Quiz 4: What RCU-related actions might the current CPU need to respond to?

Answer: Other CPUs might start new grace periods, end old grace periods, or fail to pass through an RCU quiescent state in a timely manner. Each of these actions (or, in the last case, inaction) might require a response from the current CPU.

If another CPU starts a new grace period, then the current CPU must set up so that it will report passage through some later quiescent state.

If another CPU ends an old grace period, then the current CPU must advance its callbacks and check whether another grace period is required.

If another CPU fails to pass through an RCU quiescent state in a timely manner, then the current CPU needs to check for it being in dyntick-idle mode, being offline, or being in need of an IPI.

Back to Quick Quiz 4.

Quick Quiz 5: But suppose that a CPU that has already passed through a quiescent state for the current RCU grace period resumes executing this task that is blocking the current RCU grace period. We then have a task that is blocking the current RCU grace period running on a CPU that is no longer blocking that same RCU grace period. How can that possibly make any sense?

Answer: The trick is that the task remains queued on an rcu_node structure's ->blkd_tasks list. Therefore, RCU-preempt understands that it still needs to wait for the task to finish its RCU read-side critical section even if it doesn't think that it needs to wait on the CPU. And as long as RCU waits long enough, it doesn't really matter exactly what it thinks that it is waiting on.

Back to Quick Quiz 5.

Quick Quiz 6: Why is the barrier() in rcu_bh_qs() on line 6 required?

Answer: Because otherwise the compiler could reorder line 9 with line 5, which could cause a quiescent state from an old grace period being applied to this new grace period, which could in turn cause this new grace period to end too soon. Which could result in random memory corruption, which none of us want!

Exactly how could this happen? As follows:

  1. Because the barrier() statement on line 6 was omitted, the compiler took it upon itself to perform a code-motion optimization that caused the assembly code for line 9 to appear before that of line 5.
  2. CPU&nbps;0 enters rcu_bh_qs(), executes the code for line 9 and then takes an interrupt.
  3. CPU 1, noting that the current RCU-bh grace period is very long in the tooth, invokes force_quiescent_state(), which in turn deduces that CPU 0 spent some time in dyntick-idle mode during the current grace period.
  4. CPU 1 therefore announces a quiescent state on behalf of CPU 0, which ends the current grace period.
  5. CPU 1 has some RCU-bh callbacks registered that need another RCU-bh grace period, which CPU 1 initiates.
  6. CPU 0 takes a scheduling-clock interrupt, which invokes rcu_check_callbacks(), which in turn invokes rcu_pending(), which notices that there is a new grace period.
  7. CPU 0 therefore schedules an RCU_SOFTIRQ.
  8. The RCU_SOFTIRQ fires, invoking rcu_process_callbacks(), which initiates processing that copies the new ->gpnum to CPU 0's rcu_data structure.
  9. CPU 0 returns from its interrupt back to rcu_bh_qs, where it executes the compiler-misordered line 5, recording the new grace-period number for a quiescent state that was detected during the old grace period.
  10. Later, the grace period might end too soon, which would result in all manner of calamities.

The moral of this story is to always make very sure that you carefully track which grace period a given quiescent state corresponds to.

Back to Quick Quiz 6.

Quick Quiz 7: Why trace only the first quiescent state of a given grace period?

Answer: I tried that. Once. Then I changed it because it isn't helpful to have the trace full of irrelevant quiescent states.

Back to Quick Quiz 7.

Quick Quiz 8: Why don't the other RCU flavors also need quiescent states from all tasks as well as all CPUs?

Answer: Actually, the other RCU flavors really do need quiescent states from all tasks as well as all CPUs. It is just that all blocked tasks are by definition in extended quiescent states with respect to both RCU-bh and RCU-preempt. Therefore, these two flavors of RCU can focus their attention entirely on quiescent states for the CPUs, despite needing quiescent states from both CPUs and tasks.

Back to Quick Quiz 8.

Quick Quiz 9: But why doesn't this code check for dyntick-idle and offline CPUs, both of which are also extended quiescent states?

Answer: Dyntick-idle mode and offline state are indeed quiescent states, but it does not make sense for a CPU to check to see if it is in dyntick-idle mode or if it is offline because such CPUs are not executing code. Therefore, because the rcu_check_callbacks() and rcu_preempt_check_callbacks() functions execute on the CPU being checked for quiescent states, these functions cannot check for dyntick-idle or offline mode. Checks for CPUs in these modes are invoked from when forcing quiescent states, and so are discussed in a separate article.

Back to Quick Quiz 9.

Quick Quiz 10: Why isn't the second term on line 6 simply “!hardirq_count()”? Why do we need the comparison involving HARDIRQ_COUNT?

Answer: Because the scheduling clock interrupt is a hardware interrupt, this code is guaranteed to see hardirq nesting at least one deep. So we must be in a hardirq handler nested within another hardirq handler to disqualify this call from being an idle-task quiescent state. The check in the second term of line 6 therefore returns true if the hardirq nesting is less than or equal to 1.

Back to Quick Quiz 10.

Quick Quiz 11: Why not also call rcu_bh_qs() from rcu_note_context_switch()? After all, any RCU-sched quiescent state is also an RCU-bh quiescent state.

Answer: It would be theoretically correct to invoke rcu_bh_qs() from rcu_note_context_switch(), but it is important to keep in mind that this is on the context-switch fastpath, so we must avoid any unnecessary overhead. Given that an RCU-bh quiescent state is observed between each softirq handler and during any scheduling-clock interrupt, it currently appears that invoking rcu_bh_qs() from rcu_note_context_switch() would represent unnecessary overhead. That said, it is quite possible that some future workload will require a change in this area.

Back to Quick Quiz 11.

Quick Quiz 12: Suppose that the guest OS is Linux: Then the guest OS will have RCU activity. So how can RCU safely treat guest-OS execution in the same way that it treats user-mode execution?

Answer: Two different OSes, two different independent instances of RCU. So the host kernel's RCU can ignore the guest OS's RCU in much the same way that the host kernel's RCU can ignore any use of user-mode RCU in user-mode applications.

Back to Quick Quiz 12.

Quick Quiz 13: Yes, rcu_virt_note_context_switch() causes RCU to note a quiescent state when a KVM guest begins executing. But what if that guest continues executing indefinitely? Wouldn't that indefinitely extend the RCU grace period, eventually resulting in out-of-memory conditions?

Answer: The scheduling-clock interrupt continues ticking while a KVM guest is executing. When the scheduling-clock-interrupt handler returns to the KVM guest, KVM will once again invoke rcu_virt_note_context_switch(), which will report another quiescent state to RCU, permitting any RCU grace periods to complete in a timely manner.

Back to Quick Quiz 13.

Quick Quiz 14: So what happens to tasks that are blocking, but which happen to be in neither an RCU-preempt read-side critical section nor in the middle of an outermost rcu_read_unlock()?

Answer: Absolutely nothing. Such tasks can safely be ignored by RCU-preempt.

Back to Quick Quiz 14.

Quick Quiz 15: What prevents the task from entering rcu_read_unlock_special() before rcu_preempt_note_context_switch() has finished enqueuing it, thereby corrupting the lists?

Answer: Nothing prevents the task from entering rcu_read_unlock_special() before rcu_preempt_note_context_switch() has finished enqueuing it, but use of the rcu_node structure's ->lock field prevents the lists from being corrupted.

Back to Quick Quiz 15.

Quick Quiz 16: Why must rcu_preempt_note_context_switch() record the pointer to the rcu_node structure? After all, list_del_rcu() works just fine given only the list element, the list header is not required.

Answer: Because in this case, use of list_del_rcu() is unsafe unless the rcu_node structure's ->lock is held.

Back to Quick Quiz 16.

Quick Quiz 17: Say what??? Why does rcu_preempt_note_context_switch() need to invoke rcu_read_unlock_special()?

Answer: In principle, it does not. The task will eventually run again, and will finish executing rcu_read_unlock_special() at that time. However, if the task is blocked for too long, it will be needlessly priority boosted (recall that it has already completed its RCU read-side critical section and is in the final throes of cleanup). The call to rcu_read_unlock_special() is reasonably cheap and is quite infrequent, so it makes sense to call it to avoid the priority boosting operation—and perhaps more important, to simplify RCU's state space by eliminating the possibility of tasks blocked waiting to do final rcu_read_unlock_special() cleanup.

Back to Quick Quiz 17.