September 21, 2012 (Linux 3.6+)

This article was contributed by Paul E. McKenney

Introduction

  1. Grace-Period Overview
  2. Grace-Period Operation
  3. Grace-Period Implementation
  4. Summary

And then there are of course the customary answers to the quick quizzes.

Grace-Period Overview

RCU updaters must wait for all pre-existing RCU read-side critical sections, and the ends of RCU read-side critical sections are communicated via quiescent states. The length of time for all pre-existing RCU read-side critical sections to complete is known as a grace period, and the job of grace-period detection is therefore to efficiently process quiescent-state information, which is a special challenge on large systems.

Grace-Period Data Flow

Grace-period detection is therefore done using a combining tree, with different types of quiescent-state information being fed in at different levels, as shown in the following diagram:

TreeRCUStateMachine.png

The reason that dyntick-idle information enters via the rcu_dynticks structure is that this information is specific to each CPU, but common to all flavors of RCU, and thus is processed by the per-CPU data structure that is common to all flavors of RCU. Similarly, passage of CPUs through normal quiescent states (such as context switches) is handled by the per-CPU rcu_data structure because this information is specific both to the corresponding CPU and to the flavor of RCU. In addition, both dyntick-idle transitions and normal quiescent states can occur extremely frequently, as in many times per millisecond. It is therefore critically important that these events be handled as efficiently as possible, in other words, on a per-CPU basis.

Quick Quiz 1: Is there a type of quiescent state other than CPU hotplug events that is not handled on a per-CPU basis?
Answer

In contrast, CPU hotplug events are normally rare and incur high overhead. There is therefore no reason to avoid locking for CPU hotplug events, and so they are tracked at the leaf level of the rcu_node tree. As noted in the article on RCU data structures, quiescent states propagate up from the bottom of the tree in the above diagram, and grace periods emerge out the top.

Grace-Period Control Flow

The full grace-period detection process is extensive, so discussion of it is subdivided into four sections, as follows:

GenericRCUStateMachine.png

This article covers grace-period detection, outlined in the uppermost blue dashed box in the above diagram. This covers the portions of the grace-period-detection process that are exercised most heavily on a system that has frequent short bursts of computation on all CPUs that is not undergoing any CPU hotplug operations. In this case, any dyntick-idle sojourns will be short enough that they will not hold up grace periods and there will be no need for RCU priority boosting—and hopefully no need for stall detection. The other three boxes are discussed in other articles.

Grace-Period Operation

RCU's grace-period detection is a state machine that is driven primarily from RCU_SOFTIRQ, which invokes rcu_process_callbacks(). This state machine can also be invoked from __call_rcu(), which is a safety feature that prevents a blizzard of call_rcu() invocations from OOMing the system.

Quick Quiz 2: But doesn't this mean that call_rcu() is not wait free?
Answer

Data Structures, Fields, and Data Flow

Grace periods are tracked in different ways at different levels of the combining tree. At the very top, the rcu_state structure has its ->gpnum field one greater than its ->completed field if there is a grace period in progress, otherwise these two fields are equal to each other. These two fields appear in the other structures, but the instances in the rcu_state structure are the master copies.

Within the tree itself, each rcu_node structure uses the ->qsmask mask to track which of its descendants are not ready for the current grace period to complete. At the leaf rcu_node level, each bit of the ->qsmask mask corresponds to a CPU. For RCU-preempt, each rcu_node structure also maintains a ->blkd_tasks list of tasks that have incurred a context switch within their current RCU read-side critical sections. The ->gp_tasks pointer references the first of those tasks that is preventing the current grace period from completing, or is NULL if there is no such task. Therefore, once a given rcu_node structure's ->qsmask mask is zero and its ->gp_tasks pointer is NULL, then every CPU and task represented by the subtree up to and including that rcu_node structure is ready for the current grace period to end. If this rcu_node structure is at the root of the tree, that means that the grace period has ended, otherwise it means this rcu_node structure's bit in its parent's ->qsmask field may now be cleared.

Quick Quiz 3: How does a given non-root rcu_node structure know which of its parent's ->qsmask bits to clear?
Answer
Answer

Quick Quiz 4: What happens with the blkd_tasks list and the gp_tasks pointer for RCU-bh and RCU-sched?
Answer

Feeding into the bottom of the combining tree are the rcu_data structures, one per CPU. Each structure has a ->qs_pending field that indicates that a quiescent state is needed from the corresponding CPU, a ->passed_quiesce field that indicates that this CPU has in fact passed through a quiescent state, and finally a ->passed_quiesce_gpnum field that indicates what grace-period number was in effect when the CPU passed through the quiescent state.

Quick Quiz 5: Why bother with the ->passed_quiesce_gpnum field? Given that the grace period cannot end until each CPU passes through a quiescent state, the grace-period number cannot change, so what is the point of tracking it?
Answer

Finally, the rcu_dynticks structure has the ->dynticks field, which contains an even value if this CPU is in dyntick-idle mode (which is an extended quiescent state) and contains an odd value otherwise.

The primary job of the grace-period state machine is to propagate quiescent-state indications as far up the tree as possible, declaring the end of the current grace period when the whole tree has cleared.

Example Grace Period Processing: Overview

This example features a system with four CPUs, and uses a tree of rcu_node structures with one root node and a pair of leaf nodes, as shown below:

GPExample00.png

The annotations show the values of the relevant fields of each structure. The “g0c0” in each structure indicates that both the ->gpnum and ->completed fields of each structure are zero, so that RCU is idle. The “fqs:0” in the rcu_state structures indicate that the ->fqs_state field is zero, which again indicates that RCU is idle. This field controls the state machine for forcing quiescent states, which can take on the following values:

  1. RCU_GP_IDLE=0 indicates that RCU is idle.
  2. RCU_GP_INIT=1 indicates that RCU is initializing a new grace period.
  3. RCU_SAVE_DYNTICK=2 indicates the first phase of quiescent-state forcing is in progress, in which initial dyntick-idle information is collected.
  4. RCU_FORCE_QS=3 indicates that the second phase of quiescent state forcing is in progress, in which dyntick-idle quiescent states are reported, along with offline CPUs. In addition, online CPUs that have failed to pass through quiescent state are also sent reschedule IPIs.

The “qsm..” in the rcu_node structures indicate that the ->qsmask fields are all zero, in other words, RCU is not waiting on any CPU (which is not surprising given that RCU is idle). The “b” and “g” in the rcu_node structures indicate that the ->blkd_tasks and gp_tasks pointers, respectively, are NULL. The “qsp:0” in the rcu_data structures indicates that the ->qs_pending fields are zero, which in turn means that RCU is not waiting for a quiescent state from any of them. The “pq” in the rcu_data structures indicate that the ->passed_quiesce fields are set to 1, which in turns means that each CPU has passed a quiescent state since the beginning of the last grace period. The “pgc:0” in the rcu_data structures indicate that the ->passed_quiesce_gpnum fields are set to zero, which in turn means that each CPU last passed through a quiescent state during grace period 0. The “dts:0” in the rcu_data structures indicate that the ->dynticks_snap fields are set to zero, which likely means that they are at their initialized values, but could also mean that each CPU needed quiescent-state forcing during grace period 0. Finally, the “dt:1” in the rcu_dynticks structure indicates that the ->dynticks fields are all equal to one, which is an odd number, in turn indicating that none of the CPUs is in dyntick-idle mode.

The fact that quiescent states and grace periods propagate through this tree non-atomically means that different CPUs might disagree about the grace-period state, as shown in the following chart:

GPchart.png

Each row of this figure corresponds to one leaf rcu_node structure, with time advancing from left to right. In the blue region in the lower left, a grace period has started, but some rcu_node structures (and therefore also their CPUs) are not yet aware of it. In other words, the rcu_state structure has been updated to reflect the new grace period, but the blue-colored rcu_node structures have not. Nevertheless, during this period, some CPUs (such as those corresponding to CPU 0) might already have passed through their quiescent states. This means that a CPU cannot assume that its callbacks need only wait until the end of the next grace period, because that grace period might well already have started.

In the pink region, all rcu_node structures have been updated to reflect the new grace period, but at least one CPU corresponding to each rcu_node structure has not yet managed to pass through a quiescent state. The pink rcu_node structures therefore have non-zero ->qsmask fields.

In contrast, all CPUs corresponding to the orange rcu_node structures have already passed through a quiescent state, but there is at least one CPU that has not yet passed through a quiescent state. In other words, the orange rcu_node structures' ->qsmask fields are zero, but there must be at least one other rcu_node structure with a non-zero ->qsmask field.

In the yellow region, the ->qsmask fields of all of the rcu_node structures are zero, which means that the grace period has completed, the rcu_node structures' ->completed fields have not yet been updated accordingly. It is important to note that it is impossible to distinguish between the orange and yellow states by looking at any individual leaf rcu_node structure: One must instead look at either the root rcu_node structure, the rcu_state structure, or all of the leaf rcu_node structures. Interestingly enough, this yellow region is the last chance for CPUs to assume that their callbacks need only wait for the end of the next grace period, with one exception.

Quick Quiz 6: What is the exceptional case where a CPU can assume that its callbacks need only wait until the end of the next grace period, despite that CPU being aware that the prior grace period has ended?
Answer

In the green region, RCU is idle: no grace period is in progress. This sequence repeats, as can be seen in the right-hand side of the diagram.

The following example will illustrate this grace-period process in more detail:

  1. CPU 1 starts a new grace period, initializing the rcu_state structure.
  2. CPU 3 passes through a quiescent state.
  3. CPU 3 takes a scheduling-clock interrupt, but notices that its rcu_data structure indicates that it has already passed through a quiescent state for the already-completed grace period, and therefore takes no action.
  4. CPU 1 initializes the root rcu_node structure for the new grace period.
  5. CPU 1 initializes the leaf rcu_node structure for itself and CPU 0 for the new grace period.
  6. CPU 1 initializes its own rcu_data structure for the new grace period.
  7. CPU 0 enters dyntick-idle mode.
  8. CPU 1 initializes the leaf rcu_node structure for CPUs 2 and 3 for the new grace period.
  9. CPU 2 notices the new grace period and therefore initializes its own rcu_data structure.
  10. CPU 1 passes through a quiescent state.
  11. CPU 3 notices the new grace period and therefore initializes its own rcu_data structure.
  12. CPU 2 passes through a quiescent state.
  13. CPU 1 takes a scheduling-clock interrupt, notices that it has not announced its quiescent state to RCU, and therefore makes this announcement.
  14. CPU 2 takes a scheduling-clock interrupt, notices that it has not announced its quiescent state to RCU, and therefore makes this announcement.
  15. CPU 3 context switches away from task A, which implies a quiescent state. Task A is queued on CPU 3's rcu_node structure.
  16. CPU 3 takes a scheduling-clock interrupt, notices that it has not announced its quiescent state to RCU, and therefore makes this announcement. However, task A is still blocked in an RCU read-side critical section, so this announcement cannot be propagated up to the root rcu_node structure.
  17. CPU 1 context switches away from task B, which is enqueued on CPU 1's rcu_node structure. Because CPU 1 has already announced a quiescent state for this grace period, this task does not block grace period 1, but might block grace period 2 if it remains preempted long enough. Nevertheless, task B is queued onto CPU 1's rcu_node structure.
  18. CPU 1 notes that grace period 1 is getting long in the tooth, and therefore attempts to force quiescent states.
  19. CPU 0 takes an interrupt, momentarily leaving dyntick-idle mode, at least from RCU's perspective.
  20. CPU 0 returns from interrupt back to dyntick-idle mode.
  21. CPU 3 notes that quite a bit of time has passed since quiescent states were forced for this grace period, so it attempts to force quiescent states again. In doing so, it notices that CPU 0 has been in dyntick-idle mode, which is an extended quiescent state, so it announces this to RCU. Because both CPUs 0 and 1 have announced their quiescent states (or had their quiescent states announced for them), CPU 3 propagates this announcement up to the root rcu_node structure, announcing that everything represented by the left subtree has passed through a quiescent state.
  22. Task A resumes on CPU 1 and exits its RCU read-side critical section. Task A is removed from the rightmost rcu_node structure, and, because there are no more tasks on this structure and because both CPUs have passed through quiescent states, CPU 3 continues up to the root rcu_node, announcing that everything represented by the right subtree has passed through a quiescent state.
  23. Because the root rcu_node structure reflects the entire tree having passed through a quiescent state, CPU 3 ends grace period 1.
Each of these steps is described in more detail in the following sections.
Example Grace Period: rcu_state Structure

Grace periods are started in the following situations:

  1. A CPU registers a new callback in __call_rcu(), and finds a large backlog of callbacks when there is no grace period in progress.
  2. The CPU that ends a grace period has at least one callback that needs a grace period.
  3. A CPU running RCU core code in __rcu_process_callbacks() finds that it has callbacks in need of a grace period, and no grace period is in progress.
  4. The starting of a grace period was deferred due to the concurrent execution of force_quiescent_state(), which has now completed.

In our example, CPU 1 starts a new RCU grace period by invoking rcu_start_gp(), which checks to see if starting a grace period is appropriate, and, if so, increments the rcu_state structure's ->gpnum field, resulting in the state shown below:

GPExample01.png

At this point, we proceed with CPU 3's early quiescent state.

Example Grace Period: CPU 3 Early Quiescent State

When CPU 3 passes through a quiescent state, it sets its rcu_data structure's ->passed_quiesce_gpnum to ->gpnum and ->passed_quiesce to 1. But these are the values that these fields already had, so there is no effect. Which is to be expected: There is no grace period active, so there is nothing for the quiescent states to do.

Example Grace Period: CPU 3 Early Scheduling-Clock

When CPU 3 takes a scheduling-clock interrupt, it invokes __rcu_pending(), which finds that this CPU's rcu_data structure's ->qs_pending field is zero, which in turn causes it to refrain from invoking the RCU core code. Again, this is to be expected, as there is no grace period active, so there is nothing for RCU to do.

Example Grace Period: Initialize Root rcu_node Structure

CPU 1 continues through rcu_start_gp(), initializing the root rcu_node structure. This initialization includes the ->gpnum field, ->completed field (which is initialized to the same zero value that it had originally), and the ->qsmask field (where the question marks indicate that quiescent states are required from everything). Because initialization has started, the rcu_state structure's ->fqs_state field is set to “1”. This results in the following state:

GPExample02.png

At this point, we initialize the leftmost rcu_node structure.

Example Grace Period: Initialize Left rcu_node Structure

CPU 1 continues further through rcu_start_gp(), initializing the left-most rcu_node structure, which covers CPUs 0 and 1. This proceeds much as for the root rcu_node structure, with the following result:

GPExample03.png

CPU 1 has now initialized two of the three rcu_node structures, and has but one left to go. However, CPU 1 is handled by the rcu_node structure that was just initialized, which requires special handling.

Example Grace Period: Initialize CPU 1's rcu_data Structure

At this point, CPU 1 initializes its own rcu_data structure, setting the ->qs_pending field and clearing the ->passed_quiesce field, as shown below.

GPExample04.png

CPU 1 could now start announcing quiescent states against this new grace period, except for the fact that it is still busy initializing it. In general, however, CPUs can and do announce quiescent states against grace periods that are not yet fully initialized. It is important to note that other CPUs can do useful work while CPU 1 is initializing for the new grace period, for exmaple, they might enter dyntick-idle mode.

Example Grace Period: CPU 0 Enters Dyntick-Idle Mode

CPU 0 then enters dyntick-idle mode, so that its rcu_dyntick structure's ->dynticks field is incremented from 1 to 2, as shown below:

GPExample05.png

Because CPU 0 is now in dyntick-idle mode, it no longer needs to inform the RCU core of passage through quiescent states. Instead, other CPUs will (eventually) recognize that CPU 0 is in dyntick-idle mode, and thus is in an extended quiescent state. These other CPUs will therefore announce CPU 0's quiescent states on its behalf.

But now back to CPU 1's initialization for the new RCU grace period.

Example Grace Period: Initialize Right rcu_node Structure

CPU 1 then continues initializing the new grace period in rcu_start_gp() by initializing the right-hand rcu_node structure, the one corresponding to CPUs 2 and 3. This completes initialization, so the fqs state is advanced. This proceeds as before, with the result as shown below:

GPExample06.png

The new RCU grace period is now fully initialized, so quiescent states may now be announced against it by any CPU that has initialized its own rcu_data structure, which in this example includes only CPU 1.

Example Grace Period: CPU 2 Notes New Grace Period

CPU 2 now notices that a new grace period has started by comparing its rcu_data structure's ->gpnum field to that of its leaf rcu_node structure. These two fields differ, and thus CPU 2 initializes its rcu_data structure to account for this new grace period, as shown below:

GPExample07.png

CPU 2's rcu_data structure is now set to indicate that the RCU core needs a quiescent state from CPU 2 and has not yet seen one.

Example Grace Period: CPU 1 Passes Through a Quiescent State

At this point, CPU 1 passes through a quiescent state. Because it has already initialized its rcu_data structure to reflect the new (now current) grace period, this quiescent state is applied against this grace period, as shown below:

GPExample08.png

CPU 1's rcu_data structure now shows that CPU 1 has passed through a quiescent state that applies to grace period number 1. However, this quiescent state has not yet been announced to the RCU core, so CPU 1's rcu_node structure is still unaware that CPU 1 has passed through a quiescent state.

Quick Quiz 7: Why not just immediately announce the quiescent state to the RCU core? Wouldn't that be far simpler and faster?
Answer

But before that announcement happens, CPU 3 joins the grace-period-1 party.

Example Grace Period: CPU 3 Notes Current Grace Period

At this point, CPU 3 notices that its rcu_data structure's ->gpnum field does not match that of its leaf rcu_node structure, and therefore initializes its rcu_data structure to reflect the current grace period, as shown below:

GPExample09.png

CPU 3 is now ready to record quiescent states against the current grace period. But now back to CPU 2...

Example Grace Period: CPU 2 Passes Through a Quiescent State

CPU 2 now passes through a quiescent state, and because it has initialized its rcu_data structure to reflect the current grace period, this quiescent state may be applied against the current grace period. CPU 2 therefore updates its rcu_data structure as follows:

GPExample10.png

CPU 2 will announce this quiescent state to the RCU core later. In the meantime, over to CPU 1.

Example Grace Period: CPU 1 Takes a Scheduling-Clock Interrupt

At this point, CPU 1 takes a scheduling-clock interrupt. Because CPU 1's rcu_data structure indicates that it has passed through a quiescent state for the current grace period that the RCU core does not yet know about, this quiescent state is announced to the RCU core. The ->qs_pending field of CPU 1 rcu_data structure is cleared, as is the corresponding bit in the leftmost rcu_node structure's ->qsmask field. Because the bit corresponding to CPU 0 is still set, the information does not propagate up to the root rcu_node structure. This results in the state shown below:

GPExample11.png

Quick Quiz 8: Why are the positions of the “.” and the “?” in the diagram are reversed? After all, CPU 1 has announced a quiescent state to the RCU core and CPU 0 has not yet done so.
Answer

CPU 0 must report a quiescent state before any change can be propagated up from the leftmost rcu_node structure up to the root.

Example Grace Period: CPU 2 Takes a Scheduling-Clock Interrupt

And now CPU 2 takes a scheduling-clock interrupt. Because CPU 2's rcu_data structure indicates that it has passed through a quiescent state for the current grace period that the RCU core does not yet know about, this quiescent state is announced to the RCU core. The ->qs_pending field of CPU 2 rcu_data structure is cleared, as is the corresponding bit in the rightmost rcu_node structure's ->qsmask field. Because the bit corresponding to CPU 3 is still set, the information does not propagate up to the root rcu_node structure. This results in the state shown below:

GPExample12.png

CPU 3 must report a quiescent state before any change can be propagated up from the leftmost rcu_node structure up to the root.

Example Grace Period: CPU 3 Switches Away From Task A

CPU 3 then context switches away from task A while in an RCU read-side critical section. Because CPU 3 has not previously passed through a quiescent state during this grace period, task A is queued on the rightmost rcu_node structure, with both the ->blkd_tasks and ->gp_tasks pointers referencing it. CPU 3 also records a quiescent state in its rcu_data structure by setting its rcu_data structure's passed_quiesce field to 1 and also the ->passed_quiesce_gpnum field to the current grace-period number. This results in the following:

GPExample13.png

Task A must resume and complete its RCU read-side critical section before the current grace period can complete.

Example Grace Period: CPU 3 Takes a Scheduling-Clock Interrupt

Now CPU 3 takes a scheduling-clock interrupt. Because CPU 3's rcu_data structure indicates that it has passed through a quiescent state for the current grace period that the RCU core does not yet know about, this quiescent state is announced to the RCU core. The ->qs_pending field of CPU 3 rcu_data structure is cleared, as is the corresponding bit in the rightmost rcu_node structure's ->qsmask field. All of the ->qs_pending bits are now clear, but because task A is still blocked in an RCU read-side critical section, the information does not propagate up to the root rcu_node structure. This results in the state shown below:

GPExample14.png

CPU 0 and task A are now the only things blocking completion of the current grace period.

Example Grace Period: CPU 1 Switches Away From Task B

CPU 1 then context switches away from task B while in an RCU read-side critical section. Task B is therefore queued on the leftmost rcu_node structure, but because CPU 1 has already passed through a quiescent state during this grace period, only the ->blkd_tasks pointer references it. This results in the following:

GPExample15.png

Both CPU 0 and task A are still blocking completion of the current grace period. If it stays preempted long enough, task B will eventually block the next grace period.

Example Grace Period: CPU 1 Forces Quiescent States

CPU 1 now forces quiescent states. Because the rcu_state structure's ->fqs_state field is currently “2” (RCU_SAVE_DYNTICK), this pass simply records quiescent states, but might also carry out RCU priority boosting. The only CPU that has not yet passed through a quiescent state is CPU 0, so its rcu_dynticks structure's ->dynticks counter is copied to its rcu_data structure's ->dynticks_snap field. Finally, the rcu_state structure's ->fqs_state field is set to “3” (RCU_FORCE_QS), resulting in the following:

GPExample16.png

Both CPU 0 and task A are still blocking completion of the current grace period. However, the RCU core is one step closer to determining that CPU 0 is in dyntick-idle mode, which is a quiescent state.

Example Grace Period: CPU 0 Takes an Interrupt

CPU 0 now takes an interrupt, which causes it to exit dyntick-idle mode, at least from RCU's perspective. This CPU's rcu_dynticks structure's ->dynticks counter is therefore incremented, giving the odd (non-dyntick-idle) value of “3”, as shown below.

GPExample17.png

Still we have both CPU 0 and task A blocking completion of the current grace period.

Example Grace Period: CPU 0 Returns from Interrupt

CPU 0 now returns from interrupt, which causes it to re-enter dyntick-idle mode, again, at least from RCU's perspective. This CPU's rcu_dynticks structure's ->dynticks counter is therefore incremented once again, giving the even (dyntick-idle) value of “4”, as shown below.

GPExample18.png

CPU 0 is once again in an extended quiescent state.

Example Grace Period: CPU 3 Forces Quiescent States

Now CPU 3 notes that it has been some time since the last forcing of quiescent states and that the grace period is still in progress, and therefore forces quiescent states once more. It notes that CPU 1's rcu_dynticks structure's ->dynticks field is even, which indicates that CPU 0 is in a extended quiescent state. It therefore announces this to the RCU core on CPU 0's behalf. It might also priority-boost task A.

Quick Quiz 9: Why not make CPU 0 announce its own quiescent states? Wouldn't that simplify things by eliminating a class of race conditions?
Answer

This announcement clears all of the ->qsmask bits in the leftmost rcu_node structure, so this time state is propagated to the root rcu_node structure, as shown below.

GPExample19.png

Now only task A is blocking completion of the current grace period.

Example Grace Period: Task A resumes on CPU 1

At this point, task A resumes on CPU 1 and completes its RCU read-side critical section. It removes itself from the rightmost rcu_node structure's ->blkd_tasks list, and notes that it was the last entity on this structure blocking the current grace period. It therefore propagates state up to the root rcu_node structure, and finds that there is no longer anything blocking the current grace period. It therefore updates the rcu_state structure's ->completed field to match the ->gpnum field, and then similarly updates all of the rcu_node structure's ->completed fields, resulting in the state shown below:

GPExample20.png

The grace period has now officially completed, but none of the CPUs are yet aware of this fact. They will become aware on their next invocation of the RCU core, when they will update the ->completed field of their own rcu_data structures.

Grace-Period Implementation

This section covers the grace-period code, including starting grace periods, utility functions, RCU core processing, checking for pending RCU core work, core RCU work instigated by new callbacks, and special considerations for preemptible RCU.
Starting Grace Periods

Starting a grace period involves rcu_gp_in_progress(), cpu_needs_another_gp(), __rcu_process_gp_end(), __note_new_gpnum(), and rcu_start_gp_per_cpu(), each of which is shown below:

  1 static int rcu_gp_in_progress(struct rcu_state *rsp)
  2 {
  3   return ACCESS_ONCE(rsp->completed) != ACCESS_ONCE(rsp->gpnum);
  4 }
  5 
  6 static int
  7 cpu_needs_another_gp(struct rcu_state *rsp, struct rcu_data *rdp)
  8 {
  9   return *rdp->nxttail[RCU_DONE_TAIL] && !rcu_gp_in_progress(rsp);
 10 }
 11 
 12 __rcu_process_gp_end(struct rcu_state *rsp, struct rcu_node *rnp, struct rcu_data *rdp)
 13 {
 14   if (rdp->completed != rnp->completed) {
 15     rdp->nxttail[RCU_DONE_TAIL] = rdp->nxttail[RCU_WAIT_TAIL];
 16     rdp->nxttail[RCU_WAIT_TAIL] = rdp->nxttail[RCU_NEXT_READY_TAIL];
 17     rdp->nxttail[RCU_NEXT_READY_TAIL] = rdp->nxttail[RCU_NEXT_TAIL];
 18     rdp->completed = rnp->completed;
 19     trace_rcu_grace_period(rsp->name, rdp->gpnum, "cpuend");
 20     if (ULONG_CMP_LT(rdp->gpnum, rdp->completed))
 21       rdp->gpnum = rdp->completed;
 22     if ((rnp->qsmask & rdp->grpmask) == 0)
 23       rdp->qs_pending = 0;
 24   }
 25 }
 26 
 27 static void __note_new_gpnum(struct rcu_state *rsp, struct rcu_node *rnp, struct rcu_data *rdp)
 28 {
 29   if (rdp->gpnum != rnp->gpnum) {
 30     rdp->gpnum = rnp->gpnum;
 31     trace_rcu_grace_period(rsp->name, rdp->gpnum, "cpustart");
 32     if (rnp->qsmask & rdp->grpmask) {
 33       rdp->qs_pending = 1;
 34       rdp->passed_quiesce = 0;
 35     } else
 36       rdp->qs_pending = 0;
 37   }
 38 }
 39 
 40 rcu_start_gp_per_cpu(struct rcu_state *rsp, struct rcu_node *rnp, struct rcu_data *rdp)
 41 {
 42   __rcu_process_gp_end(rsp, rnp, rdp);
 43   rdp->nxttail[RCU_NEXT_READY_TAIL] = rdp->nxttail[RCU_NEXT_TAIL];
 44   rdp->nxttail[RCU_WAIT_TAIL] = rdp->nxttail[RCU_NEXT_TAIL];
 45   __note_new_gpnum(rsp, rnp, rdp);
 46 }

The rcu_gp_in_progress() function is shown above on lines 1-4. It simply compares the specified rcu_state structure's ->completed and ->gpnum fields, returning true if they differ, in other words, if there is an RCU grace period in progress.

Quick Quiz 10: What is the purpose of the ACCESS_ONCE() calls?
Answer

The cpu_needs_another_gp() function, shown on lines 6-10, checks to see if this CPU has callbacks that are waiting for a grace period (*rdp->nxttail[RCU_DONE_TAIL]) when there is no grace period in progress. When this condition occurs, the calling CPU will normally start a new grace period.

The __rcu_process_gp_end() function is shown on lines 12-25, and is used by the CPU starting the new grace period to accelerate recognition of the completion of the old one. Line 14 checks to see if the current CPU is already aware of the new grace period by comparing that CPU's rcu_data structure's ->completed field with the same CPU's leaf rcu_node structure's ->completed field. If they differ, then this CPU is not yet aware that the old grace period has completed, an omission that it addresses by executing lines 15-23. Lines 15-17 advance RCU callbacks, and will be dealt with in another article devoted to the processing of RCU callbacks. Line 18 records the number of the last completed grace period in order to avoid executing this code twice for the same grace period. Line 19 does tracing, and lines 20 and 21 initialize the CPU's rcu_data structure's ->gpnum field in case this CPU just exited a dyntick-idle sojourn so long that grace-period number wrapped. Line 22 checks to see if the current grace period is waiting on any quiescent states from the current CPU, as might be the case if this CPU just came online. If not, line 23 sets this CPU's rcu_data structure's ->qs_pending field to zero to prevent the CPU from needlessly attempting to announce any quiescent states.

Quick Quiz 11: In __rcu_process_gp_end() in line 20, why bother comparing rdp->gpnum to rdp->completed? Why not just unconditionally set rdp->gpnum to a sane value, for example, rnp->gpnum? Or even rdp->completed?
Answer

The __note_new_gpnum() function, shown on lines 27-38, initializes this CPU for the new grace period. Line 29 checks to see if this CPU is already aware of the new grace period by comparing the rcu_data structure's ->gpnum field to that of the corresponding leaf rcu_node structure. If the CPU is indeed unaware of the new grace period, it executes lines 30-36 to carry out the needed initialization. Line 30 records the new ->gpnum in the CPU's rcu_data structure to avoid initializing twice, while line 31 carries out tracing. Line 32 checks to see if the current grace period needs a quiescent state from this CPU, and if so, lines 33 and 34 set the rcu_data structure to cause the CPU to record the next quiescent state. Otherwise, line 36 prevents the CPU from attempting to report an unneeded quiescent state.

The rcu_start_gp_per_cpu() function, shown on lines 40-46, optimizes grace-period startup for the CPU starting the new grace period. Line 42 invokes __rcu_process_gp_end() to handle the end of the prior grace period, if needed, lines 43 and 44 advance RCU callbacks, and line 45 initializes this CPU for the start of the new grace period.

The starting of grace periods is driven by rcu_start_gp(), as shown below:

  1 static void
  2 rcu_start_gp(struct rcu_state *rsp, unsigned long flags)
  3   __releases(rcu_get_root(rsp)->lock)
  4 {
  5   struct rcu_data *rdp = this_cpu_ptr(rsp->rda);
  6   struct rcu_node *rnp = rcu_get_root(rsp);
  7 
  8   if (!rcu_scheduler_fully_active ||
  9       !cpu_needs_another_gp(rsp, rdp)) {
 10     raw_spin_unlock_irqrestore(&rnp->lock, flags);
 11     return;
 12   }
 13   if (rsp->fqs_active) {
 14     rsp->fqs_need_gp = 1;
 15     raw_spin_unlock_irqrestore(&rnp->lock, flags);
 16     return;
 17   }
 18   rsp->gpnum++;
 19   trace_rcu_grace_period(rsp->name, rsp->gpnum, "start");
 20   WARN_ON_ONCE(rsp->fqs_state == RCU_GP_INIT);
 21   rsp->fqs_state = RCU_GP_INIT;
 22   rsp->jiffies_force_qs = jiffies + RCU_JIFFIES_TILL_FORCE_QS;
 23   record_gp_stall_check_time(rsp);
 24   if (NUM_RCU_NODES == 1) {
 25     rcu_preempt_check_blocked_tasks(rnp);
 26     rnp->qsmask = rnp->qsmaskinit;
 27     rnp->gpnum = rsp->gpnum;
 28     rnp->completed = rsp->completed;
 29     rsp->fqs_state = RCU_SIGNAL_INIT;
 30     rcu_start_gp_per_cpu(rsp, rnp, rdp);
 31     rcu_preempt_boost_start_gp(rnp);
 32     trace_rcu_grace_period_init(rsp->name, rnp->gpnum,
 33               rnp->level, rnp->grplo,
 34               rnp->grphi, rnp->qsmask);
 35     raw_spin_unlock_irqrestore(&rnp->lock, flags);
 36     return;
 37   }
 38   raw_spin_unlock(&rnp->lock);
 39   raw_spin_lock(&rsp->onofflock);
 40   rcu_for_each_node_breadth_first(rsp, rnp) {
 41     raw_spin_lock(&rnp->lock);
 42     rcu_preempt_check_blocked_tasks(rnp);
 43     rnp->qsmask = rnp->qsmaskinit;
 44     rnp->gpnum = rsp->gpnum;
 45     rnp->completed = rsp->completed;
 46     if (rnp == rdp->mynode)
 47       rcu_start_gp_per_cpu(rsp, rnp, rdp);
 48     rcu_preempt_boost_start_gp(rnp);
 49     trace_rcu_grace_period_init(rsp->name, rnp->gpnum,
 50               rnp->level, rnp->grplo,
 51               rnp->grphi, rnp->qsmask);
 52     raw_spin_unlock(&rnp->lock);
 53   }
 54   rnp = rcu_get_root(rsp);
 55   raw_spin_lock(&rnp->lock);
 56   rsp->fqs_state = RCU_SIGNAL_INIT;
 57   raw_spin_unlock(&rnp->lock);
 58   raw_spin_unlock_irqrestore(&rsp->onofflock, flags);
 59 }

Line 3 is a sparse directive that states that this function must be called with the root rcu_node structure's ->lock held, and that it releases it before returning. Line 8 checks to see if the scheduler has already spawned the first task and line 9 checks to see if this CPU needs a grace period. If the answer to either of these are “no”, lines 10 and 11 release the root rcu_node structure's ->lock and return.

Otherwise, line 13 checks to see if some other CPU is currently attempting to force quiescent states. If so, line 14 sets the rcu_state structure's ->fqs_need_gp field so that this other CPU will start a grace period on our behalf, and lines 15 and 16 release the root rcu_node structure's ->lock and return.

Line 18 increments the rcu_state structure's ->gpnum field, which officially starts the new grace period.

Quick Quiz 12: But nothing can yet apply quiescent states to this new grace period, so does it really make sense to say that it has started?
Answer

Line 19 traces the new grace period and line 20 complains if some other CPU is already starting a new grace period. Line 21 causes concurrent attempts to force quiescent states to hold off until we have fully initialized the rcu_node structures for the new grace period. Line 22 records the time (in jiffies) at which quiescent states should be forced, assuming that the new grace period does not complete first. Line 23 computes and records the time at which RCU CPU stall warnings should be printed, again assuming that the new grace period does not complete first.

Line 24 checks to see if the tree of rcu_node structures consists of only a single node, and if so, lines 25-36 initialize that single node. Line 25 carries out some RCU-preempt-specific debug checks, line 26 sets up the rcu_node structure's ->qsmask field to wait for quiescent states from all online CPUs corresponding to this rcu_node structure, and lines 27 and 28 update to the current grace-period number. Line 29 re-enables forcing of quiescent states and line 30 sets up the current CPU to detect quiescent states for the current grace period. Line 31 schedules RCU priority boosting for RCU-preempt, and lines 32-34 trace the initialization of this rcu_node structure. Finally, lines 35 and 36 release the root rcu_node structure's ->lock and return.

Lines 38-58 of rcu_start_gp() are executed only on systems where there is more than one rcu_node structure. Line 38 releases the root rcu_node structure's ->lock in order to avoid deadlock when line 29 acquires the rcu_state structure's ->onofflock, which excludes changes in RCU's idea of which CPUs are online.

Quick Quiz 13: Why not change the locking hierarchy so that we could just hold the root rcu_node structure's ->lock while acquiring ->onofflock? That would get rid of the single-node special case, simplifying the code.
Answer

Quick Quiz 14: That is just plain silly! Given that CPUs are coming online and going offline anyway, what possible sense does it make for RCU to bury its head in the sand and ignore these CPU-hotplug events?
Answer

Each pass through the loop spanning lines 40-53 initializes one rcu_node structure. Within this loop, line 41 acquires the current rcu_node structure's ->lock and line 52 releases it. The intervening lines operate in the same manner as the corresponding lines did in the single-rcu_node case, except that line 46 checks to make sure that the current rcu_node structure corresponds to the current CPU before line 47 initializes this CPU's rcu_data structure for the new grace period. In addition, re-enabling quiescent-state forcing is deferred to outside of the loop.

Quick Quiz 15: Why not abstract rcu_node initialization?
Answer

Once all the rcu_node structures are initialized, lines 54 and 55 acquire the root rcu_node structure's ->lock, line 56 re-enables quiescent-state forcing, and line 57 releases the ->lock. Finally, line 58 releases the rcu_state structure's ->onofflock.
Utility Functions
The invoke_rcu_core(), rcu_process_gp_end(), note_new_gpnum(), and check_for_new_grace_period() functions are used in the combining-tree algorithm that reduces quiescent states into grace periods. These functions as shown below:
  1 static void invoke_rcu_core(void)
  2 {
  3   raise_softirq(RCU_SOFTIRQ);
  4 }
  5 
  6 static void
  7 rcu_process_gp_end(struct rcu_state *rsp, struct rcu_data *rdp)
  8 {
  9   unsigned long flags;
 10   struct rcu_node *rnp;
 11 
 12   local_irq_save(flags);
 13   rnp = rdp->mynode;
 14   if (rdp->completed == ACCESS_ONCE(rnp->completed) ||
 15       !raw_spin_trylock(&rnp->lock)) {
 16     local_irq_restore(flags);
 17     return;
 18   }
 19   __rcu_process_gp_end(rsp, rnp, rdp);
 20   raw_spin_unlock_irqrestore(&rnp->lock, flags);
 21 }
 22 
 23 static void note_new_gpnum(struct rcu_state *rsp, struct rcu_data *rdp)
 24 {
 25   unsigned long flags;
 26   struct rcu_node *rnp;
 27 
 28   local_irq_save(flags);
 29   rnp = rdp->mynode;
 30   if (rdp->gpnum == ACCESS_ONCE(rnp->gpnum) ||
 31       !raw_spin_trylock(&rnp->lock)) {
 32     local_irq_restore(flags);
 33     return;
 34   }
 35   __note_new_gpnum(rsp, rnp, rdp);
 36   raw_spin_unlock_irqrestore(&rnp->lock, flags);
 37 }
 38 
 39 static int
 40 check_for_new_grace_period(struct rcu_state *rsp, struct rcu_data *rdp)
 41 {
 42   unsigned long flags;
 43   int ret = 0;
 44 
 45   local_irq_save(flags);
 46   if (rdp->gpnum != rsp->gpnum) {
 47     note_new_gpnum(rsp, rdp);
 48     ret = 1;
 49   }
 50   local_irq_restore(flags);
 51   return ret;
 52 }

The invoke_rcu_core() function shown on lines 1-4 simply does a raise_softirq() in order to cause rcu_process_callbacks() to be invoked in a clean environment.

The rcu_process_gp_end() function shown on lines 6-21 acquires the CPU's leaf rcu_node structure's lock if it is available, and, if so, invokes __rcu_process_gp_end() with the lock held. This function handles the possibility of migration from one CPU to another by disabling irqs first and acquiring the lock second, rather than acquiring the lock and disabling irqs in a single operation. The note_new_gpnum() function shown on lines 23-37 is a similar wrapper for __note_new_gpnum().

The check_for_new_grace_period() function shown on lines 39-52 checks to see if there is a new grace period that this CPU is unaware of (line 46), and, if so, invokes note_new_gpnum() to become aware of it. Line 51 returns true iff there was a new grace period.

Combining Quiescent States to Grace Periods
This section presents the key functions that combine quiescent states to arrive at grace periods. These are presented in the order that they execute rather than the usual caller-first-callee-next order that is used otherwise. So, first rcu_report_qs_rdp(), next rcu_report_qs_rnp(), and finally rcu_report_qs_rsp().
  1 static void
  2 rcu_report_qs_rdp(int cpu, struct rcu_state *rsp, struct rcu_data *rdp, long lastgp)
  3 {
  4   unsigned long flags;
  5   unsigned long mask;
  6   struct rcu_node *rnp;
  7 
  8   rnp = rdp->mynode;
  9   raw_spin_lock_irqsave(&rnp->lock, flags);
 10   if (lastgp != rnp->gpnum || rnp->completed == rnp->gpnum) {
 11     rdp->passed_quiesce = 0;
 12     raw_spin_unlock_irqrestore(&rnp->lock, flags);
 13     return;
 14   }
 15   mask = rdp->grpmask;
 16   if ((rnp->qsmask & mask) == 0) {
 17     raw_spin_unlock_irqrestore(&rnp->lock, flags);
 18   } else {
 19     rdp->qs_pending = 0;
 20     rdp->nxttail[RCU_NEXT_READY_TAIL] = rdp->nxttail[RCU_NEXT_TAIL];
 21     rcu_report_qs_rnp(mask, rsp, rnp, flags);
 22   }
 23 }

The rcu_report_qs_rdp() reports a quiescent state to the CPU's rcu_data structure, which must be invoked with preemption disabled. Line 8 obtains a pointer to the CPU's rcu_node structure, and line 9 acquires that structure's ->lock. Line 10 checks to see whether the quiescent state being announced corresponds to a now-completed grace period, and, if so, line 11 sets up the CPU to look for a new quiescent state if there is a new grace period, and lines 12 and 13 release the leaf rcu_node structure's ->lock and return.

Otherwise, execution continues at line 15, which picks up the bit corresponding to this CPU in its leaf rcu_node structure's ->qsmask field. Line 16 then checks to see if a quiescent state has already been reported for the current grace period, and, if so, line 17 releases the leaf rcu_node structure's ->lock.

Otherwise, line 19 clears this CPU's rcu_data structure's ->qs_pending field to acknowledge the announcement (and thus preventing the CPU from attempting to announce additional quiescent states for this grace period), line 20 does callback handling (which is discussed elsewhere), and line 21 invokes rcu_report_qs_rnp(), which releases the leaf rcu_node structure's ->lock, and is discussed next.

  1 static void
  2 rcu_report_qs_rnp(unsigned long mask, struct rcu_state *rsp,
  3       struct rcu_node *rnp, unsigned long flags)
  4   __releases(rnp->lock)
  5 {
  6   struct rcu_node *rnp_c;
  7 
  8   for (;;) {
  9     if (!(rnp->qsmask & mask)) {
 10       raw_spin_unlock_irqrestore(&rnp->lock, flags);
 11       return;
 12     }
 13     rnp->qsmask &= ~mask;
 14     trace_rcu_quiescent_state_report(rsp->name, rnp->gpnum,
 15              mask, rnp->qsmask, rnp->level,
 16              rnp->grplo, rnp->grphi,
 17              !!rnp->gp_tasks);
 18     if (rnp->qsmask != 0 || rcu_preempt_blocked_readers_cgp(rnp)) {
 19       raw_spin_unlock_irqrestore(&rnp->lock, flags);
 20       return;
 21     }
 22     mask = rnp->grpmask;
 23     if (rnp->parent == NULL) {
 24       break;
 25     }
 26     raw_spin_unlock_irqrestore(&rnp->lock, flags);
 27     rnp_c = rnp;
 28     rnp = rnp->parent;
 29     raw_spin_lock_irqsave(&rnp->lock, flags);
 30     WARN_ON_ONCE(rnp_c->qsmask);
 31   }
 32   rcu_report_qs_rsp(rsp, flags);
 33 }

The rcu_report_qs_rnp() function reports the quiescent state up the rcu_node tree. Each pass through the loop spanning lines 8-31 handles one level of this tree. Line 9 checks to see if this quiescent state has already been reported, and, if so, line 10 releases the current rcu_node structure's ->lock field and line 11 returns.

Quick Quiz 16: But this is redundant with the check in rcu_report_qs_rdp()! Why not remove one of these checks?
Answer

Line 13 clears the specified bit from the current rcu_node structure's ->qsmask structure and lines 14-17 trace this action. If line 18 finds that there are more quiescent states required (“rnp->qsmask != 0”) or there are tasks blocked in RCU read-side critical sections that are blocking the current grace period, then line 19 releases the current rcu_node structure's ->lock and line 20 returns.

Otherwise, line 22 picks up the bit corresponding to the current rcu_node structure in its parent's ->qsmask field. Line 23 checks to see if the current rcu_node structure has a parent, and, if not (in other words, the current rcu_node structure is the root), line 24 exits the loop.

Otherwise, line 26 releases the current rcu_node structure's ->lock. Line 27 retains a pointer to the current rcu_node structure for diagnostic purposes, and line 28 advances up the tree to the parent rcu_node structure, whose ->lock line 29 acquires. Line 30 complains if the previous (now child) rcu_node structure is still waiting for any quiescent states.

Quick Quiz 17: But we released the previous rcu_node structure's ->lock, so why couldn't a new grace period have started? This might well result in some of the child rcu_node structure's ->qsmask bits being set, wouldn't it?
Answer

Line 32 is reached if the root rcu_node structure shows that all needed quiescent states have been reported for the current grace period, in which case rcu_report_qs_rsp() is invoked to end the grace period. This function is shown below.

  1 static void rcu_report_qs_rsp(struct rcu_state *rsp, unsigned long flags)
  2   __releases(rcu_get_root(rsp)->lock)
  3 {
  4   unsigned long gp_duration;
  5   struct rcu_node *rnp = rcu_get_root(rsp);
  6   struct rcu_data *rdp = this_cpu_ptr(rsp->rda);
  7 
  8   WARN_ON_ONCE(!rcu_gp_in_progress(rsp));
  9   smp_mb();
 10   gp_duration = jiffies - rsp->gp_start;
 11   if (gp_duration > rsp->gp_max)
 12     rsp->gp_max = gp_duration;
 13   if (*rdp->nxttail[RCU_WAIT_TAIL] == NULL) {
 14     raw_spin_unlock(&rnp->lock);
 15     rcu_for_each_node_breadth_first(rsp, rnp) {
 16       raw_spin_lock(&rnp->lock);
 17       rnp->completed = rsp->gpnum;
 18       raw_spin_unlock(&rnp->lock);
 19     }
 20     rnp = rcu_get_root(rsp);
 21     raw_spin_lock(&rnp->lock);
 22   }
 23   rsp->completed = rsp->gpnum;
 24   trace_rcu_grace_period(rsp->name, rsp->completed, "end");
 25   rsp->fqs_state = RCU_GP_IDLE;
 26   rcu_start_gp(rsp, flags);
 27 }

The rcu_report_qs_rsp() function announces the full set of quiescent states to the rcu_state structure, thus ending the grace period—and possibly starting another one. Line 8 complains bitterly if there is no grace period in progress, while line 9 preserves ordering in order to ensure that all grace-period and pre-grace-period activity is seen by all CPUs to precede the assignments to the various ->completed fields that mark the end of this grace period. Lines 11 and 12 accumulate the maximum grace-period duration for tracing and diagnostic purposes. Line 13 checks to see if the current CPU needs a new grace period, and if not, lines 14-21 update the ->completed fields in all the rcu_node structures, momentarily releasing the root rcu_node structure's ->lock in order avoid deadlock.

Quick Quiz 18: Why open-code cpu_needs_another_gp() on line 13 of rcu_report_qs_rsp()?
Answer

Line 23 updates the rcu_state structure's ->completed field, thus officially marking the end of the old grace period. Line 24 traces the end of the old grace period, line 25 sets ->fqs_state to the idle state, and finally line 26 invokes rcu_start_gp() to start a new grace period if warranted.

RCU Core Processing

The RCU core processing drives the state machine that is RCU. It is initiated from softirq context, and is typically started when rcu_check_callbacks() notices that something needs to be done, which it notices due to the rcu_pending() function's return value, which is described in the next section. Although RCU core processing is initiated from softirq, the actual grace-period initialization, quiescent-state forcing, and grace-period cleanup are run from a kthread, which is described later in this section. In the meantime, here are the RCU core functions, rcu_check_quiescent_state(), __rcu_process_callbacks(), rcu_preempt_process_callbacks(), and rcu_process_callbacks():

  1 static void
  2 rcu_check_quiescent_state(struct rcu_state *rsp, struct rcu_data *rdp)
  3 {
  4   if (check_for_new_grace_period(rsp, rdp))
  5     return;
  6   if (!rdp->qs_pending)
  7     return;
  8   if (!rdp->passed_quiesce)
  9     return;
 10   rcu_report_qs_rdp(rdp->cpu, rsp, rdp, rdp->passed_quiesce_gpnum);
 11 }
 12 
 13 static void
 14 __rcu_process_callbacks(struct rcu_state *rsp, struct rcu_data *rdp)
 15 {
 16   unsigned long flags;
 17 
 18   WARN_ON_ONCE(rdp->beenonline == 0);
 19   if (ULONG_CMP_LT(ACCESS_ONCE(rsp->jiffies_force_qs), jiffies))
 20     force_quiescent_state(rsp, 1);
 21   rcu_process_gp_end(rsp, rdp);
 22   rcu_check_quiescent_state(rsp, rdp);
 23   if (cpu_needs_another_gp(rsp, rdp)) {
 24     raw_spin_lock_irqsave(&rcu_get_root(rsp)->lock, flags);
 25     rcu_start_gp(rsp, flags);
 26   }
 27   if (cpu_has_callbacks_ready_to_invoke(rdp))
 28     invoke_rcu_callbacks(rsp, rdp);
 29 }
 30 
 31 static void rcu_preempt_process_callbacks(void)
 32 {
 33   __rcu_process_callbacks(&rcu_preempt_state,
 34         &__get_cpu_var(rcu_preempt_data));
 35 }
 36 
 37 static void rcu_process_callbacks(struct softirq_action *unused)
 38 {
 39   trace_rcu_utilization("Start RCU core");
 40   __rcu_process_callbacks(&rcu_sched_state,
 41         &__get_cpu_var(rcu_sched_data));
 42   __rcu_process_callbacks(&rcu_bh_state, &__get_cpu_var(rcu_bh_data));
 43   rcu_preempt_process_callbacks();
 44   trace_rcu_utilization("End RCU core");
 45 }

The rcu_check_quiescent_state() function shown on lines 1-11 checks to see if a quiescent state has been recorded in the specified rcu_data structure, and, if so, invokes rcu_report_qs_rdp() to announce it to the rcu_node tree. It must run on the CPU corresponding to the specified rcu_data structure. Line 4 invokes check_for_new_grace_period() to check to see if there is a new grace period that this CPU is unaware of, and if so, line 5 returns.

Quick Quiz 19: Why does line 5 of rcu_check_quiescent_state() just return? Isn't that giving up a chance to report a quiescent state?
Answer

Line 6 checks to see if a quiescent state is needed from the current CPU, and if not, line 7 returns. Line 8 then checks to see if this CPU has passed through a quiescent state, and if not, line 9 returns. Otherwise, a quiescent state is needed from this CPU and the CPU has recently passed through a quiescent state, so line 10 invokes rcu_report_qs_rdp() to report this quiescent state to the rcu_node tree.

Quick Quiz 20: Why not check rdp->passed_quiesce_gpnum right in rcu_check_quiescent_state() rather than incurring the extra function-call overhead (and added argument) passing it in to rcu_report_qs_rdp()?
Answer

The __rcu_process_callbacks() function, shown on lines 13-29, conducts one pass of the RCU core state machine for a given flavor of RCU. Line 18 complains if RCU does not believe that the current CPU is online. Line 19 checks to see if the current grace period has gone on long enough that it is now time to force quiescent states, and if so, line 20 attempts the forcing. Line 21 checks to see if this CPU's idea of the current grace period has ended, and line 22 checks to see if this CPU has passed through some quiescent states that need to be reported up the rcu_node tree. Line 23 checks to see if this CPU needs another RCU grace period and RCU is idle, in which case line 24 acquires the root rcu_node structure's ->lock and line 25 invokes rcu_start_gp() to start a new grace period.

Quick Quiz 21: But __rcu_process_callbacks() fails to release the root rcu_node structure's ->lock! Won't that result in deadlock?
Answer

Finally, line 27 checks to see if this CPU has any RCU callbacks whose grace period has ended, and, if so, line 28 calls invoke_rcu_callbacks() to invoke them.

The rcu_preempt_process_callbacks() function shown on lines 31-35 is a wrapper around __rcu_process_callbacks() that does RCU-core processing for RCU-preempt. If there is no RCU-preempt in the kernel, for example, for kernels built with CONFIG_PREEMPT=n, then rcu_preempt_process_callbacks() is an empty function.

The rcu_process_callbacks() function shown on lines 37-45 is a wrapper function that invokes __rcu_process_callbacks() for each flavor of RCU configured into the kernel. Lines 39 and 44 trace the start and end of RCU core processing, while lines 40 and 41 do RCU core processing for RCU-sched and line 42 does RCU core processing for RCU-bh. Finally, line 43 invokes the rcu_preempt_process_callbacks() function described above in order to do RCU core processing for RCU-preempt, but only if it is configured into the kernel.

@@@ rcu_gp_init() and rcu_gp_cleanup().

The main function for the grace-period kthread is rcu_gp_kthread(), shown below:

  1 static int __noreturn rcu_gp_kthread(void *arg)
  2 {
  3   int fqs_state;
  4   unsigned long j;
  5   int ret;
  6   struct rcu_state *rsp = arg;
  7   struct rcu_node *rnp = rcu_get_root(rsp);
  8 
  9   for (;;) {
 10     for (;;) {
 11       wait_event_interruptible(rsp->gp_wq,
 12              rsp->gp_flags &
 13              RCU_GP_FLAG_INIT);
 14       if ((rsp->gp_flags & RCU_GP_FLAG_INIT) &&
 15           rcu_gp_init(rsp))
 16         break;
 17       cond_resched();
 18       flush_signals(current);
 19     }
 20     fqs_state = RCU_SAVE_DYNTICK;
 21     j = jiffies_till_first_fqs;
 22     if (j > HZ) {
 23       j = HZ;
 24       jiffies_till_first_fqs = HZ;
 25     }
 26     for (;;) {
 27       rsp->jiffies_force_qs = jiffies + j;
 28       ret = wait_event_interruptible_timeout(rsp->gp_wq,
 29           (rsp->gp_flags & RCU_GP_FLAG_FQS) ||
 30           (!ACCESS_ONCE(rnp->qsmask) &&
 31            !rcu_preempt_blocked_readers_cgp(rnp)),
 32           j);
 33       if (!ACCESS_ONCE(rnp->qsmask) &&
 34           !rcu_preempt_blocked_readers_cgp(rnp))
 35         break;
 36       if (ret == 0 || (rsp->gp_flags & RCU_GP_FLAG_FQS)) {
 37         fqs_state = rcu_gp_fqs(rsp, fqs_state);
 38         cond_resched();
 39       } else {
 40         cond_resched();
 41         flush_signals(current);
 42       }
 43       j = jiffies_till_next_fqs;
 44       if (j > HZ) {
 45         j = HZ;
 46         jiffies_till_next_fqs = HZ;
 47       } else if (j < 1) {
 48         j = 1;
 49         jiffies_till_next_fqs = 1;
 50       }
 51     }
 52     rcu_gp_cleanup(rsp);
 53   }
 54 }
Checking For Pending RCU Core Work
The __rcu_pending(), rcu_preempt_pending(), and rcu_pending() functions check to see whether there is any RCU core work needed on the part of the calling CPU:
  1 static int __rcu_pending(struct rcu_state *rsp, struct rcu_data *rdp)
  2 {
  3   struct rcu_node *rnp = rdp->mynode;
  4 
  5   rdp->n_rcu_pending++;
  6   check_cpu_stall(rsp, rdp);
  7   if (rcu_scheduler_fully_active &&
  8       rdp->qs_pending && !rdp->passed_quiesce) {
  9     rdp->n_rp_qs_pending++;
 10     if (!rdp->preemptible &&
 11         ULONG_CMP_LT(ACCESS_ONCE(rsp->jiffies_force_qs) - 1,
 12          jiffies))
 13       set_need_resched();
 14   } else if (rdp->qs_pending && rdp->passed_quiesce) {
 15     rdp->n_rp_report_qs++;
 16     return 1;
 17   }
 18   if (cpu_has_callbacks_ready_to_invoke(rdp)) {
 19     rdp->n_rp_cb_ready++;
 20     return 1;
 21   }
 22   if (cpu_needs_another_gp(rsp, rdp)) {
 23     rdp->n_rp_cpu_needs_gp++;
 24     return 1;
 25   }
 26   if (ACCESS_ONCE(rnp->completed) != rdp->completed) {
 27     rdp->n_rp_gp_completed++;
 28     return 1;
 29   }
 30   if (ACCESS_ONCE(rnp->gpnum) != rdp->gpnum) {
 31     rdp->n_rp_gp_started++;
 32     return 1;
 33   }
 34   if (rcu_gp_in_progress(rsp) &&
 35       ULONG_CMP_LT(ACCESS_ONCE(rsp->jiffies_force_qs), jiffies)) {
 36     rdp->n_rp_need_fqs++;
 37     return 1;
 38   }
 39   rdp->n_rp_need_nothing++;
 40   return 0;
 41 }
 42 
 43 static int rcu_preempt_pending(int cpu)
 44 {
 45   return __rcu_pending(&rcu_preempt_state,
 46            &per_cpu(rcu_preempt_data, cpu));
 47 }
 48 
 49 static int rcu_pending(int cpu)
 50 {
 51   return __rcu_pending(&rcu_sched_state, &per_cpu(rcu_sched_data, cpu)) ||
 52          __rcu_pending(&rcu_bh_state, &per_cpu(rcu_bh_data, cpu)) ||
 53          rcu_preempt_pending(cpu);
 54 }

The __rcu_pending() function shown on lines 1-41 determines whether the specified flavor of RCU needs any immediate work on the part of the CPU corresponding to the specified rcu_data structure. Line 5 counts the calls to __rcu_pending() for diagnostic and tracing purposes, while line 6 issues CPU stall warnings if warranted (which will be discussed in depth in another article in this series).

Lines 7 and 8 check to see if the current grace period needs a quiescent state from the current CPU. If so, line 9 counts this event and lines 10-12 check to see if this is some RCU flavor other than RCU-preempt for which this CPU is about to invoke the wrath of force_quiescent_state(), and if so, line 13 pokes the scheduler in an attempt to make a quiescent state happen. Otherwise, if the current grace period does not need a quiescent state from the current CPU, line 14 checks to see if the current CPU recently passed through a quiescent state that has not yet been reported up the rcu_node tree. In this case, line 15 counts the event and line 16 tells the caller that this CPU has core-RCU work to do.

Line 18 checks to see if this CPU has RCU callbacks whose grace period has expired, and if so, line 19 counts the event and line 20 tells the caller that this CPU has core-RCU work to do. Lines 22-25 operate similarly if there is no grace period in progress and this CPU has callbacks queued that need one, lines 26-29 operate similarly if the current CPU is not yet aware that a grace period has ended, lines 30-33 operate simillarly if the current CPU is not yet aware that a grace period has started, and finally lines 34-38 operate similarly if a grace period has extended long enough that quiescent-state forcing is warranted.

Execution reaches line 39 if the RCU core needs nothing from the current CPU. This event is also counted, and line 40 informs the caller.

The rcu_preempt_pending() function, shown on lines 43-47, invokes __rcu_pending() to see if RCU-preempt core processing needs something from the current CPU. If RCU-preempt is not configured into the kernel, this function simply unconditionally returns zero. After all, that which is not there needs nothing. Usually, anyway.

Finally, the rcu_pending() function shown on lines 49-54 checks all RCU flavors, returning true if any of them require anything from the current CPU.

Core RCU Work Instigated by Callback Enqueuing

Normally, call_rcu() simply enqueues a callback and returns. However, there are some rather nasty code sequences that a user process can execute that generate very large quantities of callbacks, for example, close(open("/dev/NULL",O_RDONLY)) in a tight loop. Such code might well be pointless, but the kernel must nevertheless handle it gracefully. Therefore, __call_rcu() checks for this sort of condition and undertakes RCU code work as needed to avert disaster.

  1 static void
  2 __call_rcu(struct rcu_head *head, void (*func)(struct rcu_head *rcu),
  3      struct rcu_state *rsp)
  4 {
  5   unsigned long flags;
  6   struct rcu_data *rdp;
  7 
  8   debug_rcu_head_queue(head);
  9   head->func = func;
 10   head->next = NULL;
 11   smp_mb();
 12   local_irq_save(flags);
 13   rdp = this_cpu_ptr(rsp->rda);
 14   *rdp->nxttail[RCU_NEXT_TAIL] = head;
 15   rdp->nxttail[RCU_NEXT_TAIL] = &head->next;
 16   rdp->qlen++;
 17   if (__is_kfree_rcu_offset((unsigned long)func))
 18     trace_rcu_kfree_callback(rsp->name, head, (unsigned long)func,
 19            rdp->qlen);
 20   else
 21     trace_rcu_callback(rsp->name, head, rdp->qlen);
 22   if (irqs_disabled_flags(flags)) {
 23     local_irq_restore(flags);
 24     return;
 25   }
 26   if (unlikely(rdp->qlen > rdp->qlen_last_fqs_check + qhimark)) {
 27     rcu_process_gp_end(rsp, rdp);
 28     check_for_new_grace_period(rsp, rdp);
 29     if (!rcu_gp_in_progress(rsp)) {
 30       unsigned long nestflag;
 31       struct rcu_node *rnp_root = rcu_get_root(rsp);
 32 
 33       raw_spin_lock_irqsave(&rnp_root->lock, nestflag);
 34       rcu_start_gp(rsp, nestflag);
 35     } else {
 36       rdp->blimit = LONG_MAX;
 37       if (rsp->n_force_qs == rdp->n_force_qs_snap &&
 38           *rdp->nxttail[RCU_DONE_TAIL] != head)
 39         force_quiescent_state(rsp, 0);
 40       rdp->n_force_qs_snap = rsp->n_force_qs;
 41       rdp->qlen_last_fqs_check = rdp->qlen;
 42     }
 43   } else if (ULONG_CMP_LT(ACCESS_ONCE(rsp->jiffies_force_qs), jiffies))
 44     force_quiescent_state(rsp, 1);
 45   local_irq_restore(flags);
}

Lines 8-21 deal with callback handling and will therefore be addressed in another article in this series. Line 22 checks to see if interrupts were disabled upon entry to __call_rcu(), in which case it might not be safe to invoke the RCU core due to potential deadlock situations. In this case, line 23 restores interrupts and line 24 returns.

Line 26 checks to see too many RCU callbacks (“too many” defaults to 10,000) have been enqueued since the last time __call_rcu() undertook RCU-core processing for the current CPU and RCU flavor, and if so, lines 27-42 do the core processing. Line 27 checks for the end of an old grace period, and line 28 checks for the beginning of a new grace period, but if line 29 finds that there is no grace period in progress despite there being tens of thousands of callbacks being queued on this CPU, then lines 30-34 start a new grace period.

Otherwise, an RCU grace period is in progress, so lines 36-42 attempt to accelerate it. Line 36 increases this CPU's rcu_data structure's ->blimit in order to avoid throttline callback invocation. If line 37 sees that quiescent states have not been forced recently and if there are RCU callbacks enqueued on this CPU that need another grace period (other than the callback we just now enqueued), then line 39 forces quiescent states vigorously. Lines 40-41 retrigger the check for forcing yet more quiescent states, just in case 10,000 additional RCU callbacks are posted soon after we return.

If this rcu_data structure's RCU callback queue is not excessively long, then line 43 checks to see if it is time to force quiescent states, and, if so, line 44 does the required forcing.

In either case, line 45 re-enables interrupts in preparation for returning to the caller.

Special Considerations for Preemptible RCU

Preemptible RCU must report a quiescent state when a task that blocked in an RCU-preempt read-side critical section completes that critical section. The rcu_preempt_blocked_readers_cgp(), rcu_report_unblock_qs_rnp(), and rcu_preempt_check_blocked_tasks() functions handle this task.

  1 static int rcu_preempt_blocked_readers_cgp(struct rcu_node *rnp)
  2 {
  3   return rnp->gp_tasks != NULL;
  4 }
  5 
  6 static void rcu_preempt_check_blocked_tasks(struct rcu_node *rnp)
  7 {
  8   WARN_ON_ONCE(rcu_preempt_blocked_readers_cgp(rnp));
  9   if (!list_empty(&rnp->blkd_tasks))
 10     rnp->gp_tasks = rnp->blkd_tasks.next;
 11   WARN_ON_ONCE(rnp->qsmask);
 12 }
 13 
 14 static void rcu_report_unblock_qs_rnp(struct rcu_node *rnp, unsigned long flags)
 15   __releases(rnp->lock)
 16 {
 17   unsigned long mask;
 18   struct rcu_node *rnp_p;
 19 
 20   if (rnp->qsmask != 0 || rcu_preempt_blocked_readers_cgp(rnp)) {
 21     raw_spin_unlock_irqrestore(&rnp->lock, flags);
 22     return;
 23   }
 24   rnp_p = rnp->parent;
 25   if (rnp_p == NULL) {
 26     rcu_report_qs_rsp(&rcu_preempt_state, flags);
 27     return;
 28   }
 29   mask = rnp->grpmask;
 30   raw_spin_unlock(&rnp->lock);
 31   raw_spin_lock(&rnp_p->lock);
 32   rcu_report_qs_rnp(mask, &rcu_preempt_state, rnp_p, flags);
 33 }

The rcu_preempt_blocked_readers_cgp() function shown on lines 1-4 checks to see if there are any RCU readers holding up the current grace period that have blocked at least once during their current RCU read-side critical section, and that have therefore been queued on the specified rcu_node structure. The actual check on line 3 is far simpler than the description: If the specified rcu_node structure's ->gp_tasks field is non-NULL, then there is at least one such reader.

The rcu_preempt_check_blocked_tasks() function shown on lines 6-12 handles special RCU-preempt processing at grace-period start. Line 8 complains if there are readers listed as blocking the new grace period despite the fact that this grace period has not really started yet. Line 9 checks to see if there are any tasks that have blocked at least once within their current RCU read-side critical sections, and if so, line 10 marks them all as blocking the new grace period. Finally, line 11 complains if anything else is marked as blocking the new grace period despite initialization having not really started yet.

Quick Quiz 22: How can you be sure that all the tasks that blocked at least once in their current RCU read-side critical sections need to block the new grace period?
Answer

The rcu_report_unblock_qs_rnp() function shown on lines 14-33 reports a quiescent state up the rcu_node hierarchy when the last task that blocked in its current RCU read-side critical section exits that critical section. Line 20 checks to see if any CPUs or tasks are still blocking the current grace period, and if so, line 21 restores interrupts and line 22 returns. Line 24 advances up the rcu_node hierarchy, but if the current rcu_node structure is the root, then line 26 invokes rcu_report_qs_rsp() to end the current grace period and line 27 returns.

Otherwise, there is a parent node. In this case, line 29 obtains the mask containing the bit corresponding to the current rcu_node structure in its parent's ->qsmask field. Line 30 releases the current rcu_node structure's ->lock and line 31 acquires the parent's ->lock. Finally, line 32 invokes rcu_report_qs_rnp() to propagate the quiescent state up the rcu_node hierarchy.

Quick Quiz 23: How can we be sure that the old grace period won't end before we can report the quiescent state? In other words, why don't we need to record the grace-period number and check it, as is done in rcu_report_qs_rdp()?
Answer

Summary

This article has described RCU's grace-period computation. It has focused on the simple case, ignoring interactions with dyntick-idle mode, offline CPUs, RCU priority boosting, and RCU CPU stall warnings. These topics are taken up in separate articles in this series.

Acknowledgments

I am grateful to Cheng Xu and Mingming Cao for their help in rendering this article human readable.

Legal Statement

This work represents the view of the author and does not necessarily represent the view of IBM.

Linux is a registered trademark of Linus Torvalds.

Other company, product, and service names may be trademarks or service marks of others.

Answers to Quick Quizzes

Quick Quiz 1: Is there a type of quiescent state other than CPU hotplug events that is not handled on a per-CPU basis?

Answer: Yes. RCU-preempt's outermost rcu_read_unlock() signals entry into a quiescent state, but is handled on a per-task basis rather than on a per-CPU basis. This is for performance reasons: rcu_read_unlock() can be invoked extremely frequently, and processing it on a per-CPU basis would require disabling preemption. By handling this type of quiescent state on a per-task basis, we avoid the overhead of disabling and re-enabling preemption. (We are looking to go further and to also avoid the function-call overhead currently incurred by RCU-preempt's rcu_read_lock() and rcu_read_unlock() implementations, but that is a job for another day.)

Back to Quick Quiz 1.

Quick Quiz 2: But doesn't this mean that call_rcu() is not wait free?

Answer: This safety feature does indeed cost call_rcu() any pretense of unconditional wait-freedom, but system survivability trumps academic purity any day of the week. I am after all a developer, not a researcher!

Back to Quick Quiz 2.

Quick Quiz 3: How does a given non-root rcu_node structure know which of its parent's ->qsmask bits to clear?
Answer

Answer: Each rcu_node structure has a ->grpmask mask with a single bit set that corresponds to this rcu_node structure's bit in its parent's ->qsmask field.

Back to Quick Quiz 3.

Quick Quiz 4: What happens with the blkd_tasks list and the gp_tasks pointer for RCU-bh and RCU-sched?

Answer: Nothing happens with them. In the rcu_node structures handling RCU-bh and RCU-sched, the ->blkd_tasks lists remain empty and the ->gp_tasks pointers remain NULL.

Back to Quick Quiz 4.

Quick Quiz 5: Why bother with the ->passed_quiesce_gpnum field? Given that the grace period cannot end until each CPU passes through a quiescent state, the grace-period number cannot change, so what is the point of tracking it?

Answer: Unfortunately, the grace-period number can in fact change between the time that a CPU passes through a quiescent state and the time that it gets around to announcing this to RCU. The reason for this is that dyntick-idle and CPU-offline events can cause other CPUs to announce on behalf of this CPU. If the other CPU announces before this CPU gets around to it, the grace-period number might well have changed. Therefore, the ->passed_quiesce_gpnum field must be meticulously checked in order to avoid erroneously announcing a quiescent state from some past grace period.

Back to Quick Quiz 5.

Quick Quiz 6: What is the exceptional case where a CPU can assume that its callbacks need only wait until the end of the next grace period, despite that CPU being aware that the prior grace period has ended?

Answer: The exception is the CPU that starts the next grace period.

Back to Quick Quiz 6.

Quick Quiz 7: Why not just immediately announce the quiescent state to the RCU core? Wouldn't that be far simpler and faster?

Answer: It might be simpler, but it would be very unlikely to be faster. In most workloads, there are far more quiescent states than grace periods, so it makes sense to optimize the performance of quiescent states. Announcing a quiescent state to the RCU core requires acquiring the corresponding rcu_node structure's ->lock, which is not acceptable on the context-switch fastpath. We therefore have the quiescent states interact with the rcu_data structure, and announce to the RCU core only once per grace period per CPU.

Back to Quick Quiz 7.

Quick Quiz 8: Why are the positions of the “.” and the “?” in the diagram are reversed? After all, CPU 1 has announced a quiescent state to the RCU core and CPU 0 has not yet done so.

Answer: The “qsm.?” represents a bit mask, so that CPU 0 corresponds to the low-order bit. Yes, this can be confusing, but it is several thousand years too late to advocate for little-endian representation of numerical values. Besides which, little-endian representation would probably simply change the nature of the confusion, not eliminate it. In this case, the confusion would simply move from within the diagram to between the diagram and the code. Having the confusion within the diagram makes it more obvious, thus reducing the chance that people will inject bugs into RCU due to their failure to recognize that they are confused. You might as well face the fact that life is inherently confusing. The sooner you reconcile yourself to that fact, the better off you will be.

Back to Quick Quiz 8.

Quick Quiz 9: Why not make CPU 0 announce its own quiescent states? Wouldn't that simplify things by eliminating a class of race conditions?

Answer: This would indeed eliminate a class of race conditions, but it would unfortunately also sharply limit power savings. RCU therefore accepts the race conditions (which are mediated straightforwardly by the ->lock field in the rcu_node structure) in order to allow dyntick-idle CPUs to remain in deeper sleep states for longer periods of time.

Back to Quick Quiz 9.

Quick Quiz 10: What is the purpose of the ACCESS_ONCE() calls?

Answer: ACCESS_ONCE() simply returns its argument, but uses volatile casts in order to prevent the compiler from refetching its argument (as it might in cases of register pressure) or from combining successive accesses to the same variable. This is unnecessary if rcu_gp_in_progress() is invoked with the root rcu_node structure's lock held, but is required otherwise. However, the performance impact is too small to justify multiple versions of the rcu_gp_in_progress(), so we instead have a single version that conservatively uses ACCESS_ONCE() even when called from code paths where this is unnecessary.

Back to Quick Quiz 10.

Quick Quiz 11: In __rcu_process_gp_end() in line 20, why bother comparing rdp->gpnum to rdp->completed? Why not just unconditionally set rdp->gpnum to a sane value, for example, rnp->gpnum? Or even rdp->completed?

Answer: Unconditionally setting rdp->gpnum to rnp->gpnum could cause the CPU to fail to initialize for the new grace period, which could result in the grace period failing to ever complete. Unconditionally setting rdp->gpnum to rdp->completed suffers from a more subtle failure mode. It turns out that there are sequences of events that can result in a given CPU becoming aware of a new grace period before realizing that the old one ended. Unconditionally updating rdp->gpnum could therefore cause the CPU to forget that it had already noted the new grace period.

Back to Quick Quiz 11.

Quick Quiz 12: But nothing can yet apply quiescent states to this new grace period, so does it really make sense to say that it has started?

Answer: You might well argue that grace periods start at any number of places, including the following, in rough order of increasing time:

  1. The time of the enqueuing of the first RCU callback requiring this new grace period. After all, if there was no such callback, there would be no need for a new grace period.
  2. Lines 8, 9, and 13 of rcu_start_gp(), as these lines made the final decision to start a new grace period.
  3. Line 18 of rcu_start_gp(), which incremented the rcu_state structure's ->gpnum field.
  4. Lines 27 and 44 of rcu_start_gp(), which communicate the new grace period down to the rcu_node structures.
  5. Function __note_new_gpnum(), which sets up a given CPU's rcu_data structure to detect quiescent states for the new grace period.

So, which is it? The only reasonable answer is “if you have to ask, you are giving up all hope of constructing a production-quality RCU implementation.” In short, it is best to construct RCU so that it simply doesn't care. More generally, avoiding caring too much about exactly when things start and stop is a good parallel-programming design principle.

Back to Quick Quiz 12.

Quick Quiz 13: Why not change the locking hierarchy so that we could just hold the root rcu_node structure's ->lock while acquiring ->onofflock? That would get rid of the single-node special case, simplifying the code.

Answer: The problem is that the CPU-hotplug code path requires us to acquire the root rcu_node structure's ->lock while holding that of a leaf rcu_node structure. We therefore absolutely must drop the root rcu_node structure's lock before acquiring that of a leaf rcu_node structure. One way to combine those two code paths while still avoiding deadlock would be to omit the single-node optimization entirely. Now that leaf rcu_node structures handle at most 16 CPUs (rather than 32 or 64), this approach might make some sense. However, the conditional is set up so that gcc should be able to sort things out at compile time, so there is no runtime penalty for the check.

Back to Quick Quiz 13.

Quick Quiz 14: That is just plain silly! Given that CPUs are coming online and going offline anyway, what possible sense does it make for RCU to bury its head in the sand and ignore these CPU-hotplug events?

Answer: It is not a matter of ignoring the events, but rather of keeping RCU's state space down to a dull roar. The CPU hotplug events will update state as soon as we release the the rcu_state structure's ->onofflock at the end of rcu_start_gp(), and will apply their changes to the rcu_node tree at that time. In the meantime, holding off those changes allows a much simpler implementation of rcu_start_gp().

Back to Quick Quiz 14.

Quick Quiz 15: Why not abstract rcu_node initialization?

Answer: Because I just now noticed that it might be a good idea to do so. But are there other options?

Back to Quick Quiz 15.

Quick Quiz 16: But this is redundant with the check in rcu_report_qs_rdp()! Why not remove one of these checks?

Answer: Unfortunately, rcu_report_qs_rdp() needs to keep its check in order to properly update the rcu_data structure, and rcu_report_qs_rnp() needs to keep its check due to its being called from functions other than rcu_report_qs_rdp().

Back to Quick Quiz 16.

Quick Quiz 17: But we released the previous rcu_node structure's ->lock, so why couldn't a new grace period have started? This might well result in some of the child rcu_node structure's ->qsmask bits being set, wouldn't it?

Answer: This cannot happen because the next grace period cannot start until after the current grace period ends, and the current grace period cannot end until all the quiescent states are reported up the rcu_node tree. One such quiescent state is currently being reported by the current CPU, so until this CPU finishes, there can be no new grace period and thus no bits set in the child rcu_node structure's ->qsmask field.

Back to Quick Quiz 17.

Quick Quiz 18: Why open-code cpu_needs_another_gp() on line 13 of rcu_report_qs_rsp()?

Answer: Because cpu_needs_another_gp() fails if there is already a grace period in progress, which there currently is. The obvious way of avoiding this problem would be to move the assignment to rsp->completed on line 23 up to precede line 13, but this would allow some other CPU to be starting a new grace period while the current CPU is marking the old grace period as being completed, which is at best unclean. So the explicit check on line 13 really is necessary.

Back to Quick Quiz 18.

Quick Quiz 19: Why does line 5 of rcu_check_quiescent_state() just return? Isn't that giving up a chance to report a quiescent state?

Answer: Because the CPU just now learned of the grace period, there is no way that it can have already passed through a quiescent state for this new grace period. To see this, take a look back at the implementation of check_for_new_grace_period() and then see what would happen if the remainder of rcu_check_quiescent_state() were executed in that state.

One exception to this is RCU-preempt, which could in principle check to see if the current CPU was in an RCU read-side critical section. This is a potential future optimization, but a low-priority one.

Back to Quick Quiz 19.

Quick Quiz 20: Why not check rdp->passed_quiesce_gpnum right in rcu_check_quiescent_state() rather than incurring the extra function-call overhead (and added argument) passing it in to rcu_report_qs_rdp()?

Answer: Because we must hold this CPU's leaf rcu_node structure's ->lock in order to safely carry out the needed comparison.

Back to Quick Quiz 20.

Quick Quiz 21: But __rcu_process_callbacks() fails to release the root rcu_node structure's ->lock! Won't that result in deadlock?

Answer: No deadlocks will result because rcu_start_gp() releases that lock.

Back to Quick Quiz 21.

Quick Quiz 22: How can you be sure that all the tasks that blocked at least once in their current RCU read-side critical sections need to block the new grace period?

Answer: By definition, all RCU read-side critical sections that start before a given grace period must complete before that grace period can be allowed to complete. Therefore, all such critical sections must block the new grace period.

Back to Quick Quiz 22.

Quick Quiz 23: How can we be sure that the old grace period won't end before we can report the quiescent state? In other words, why don't we need to record the grace-period number and check it, as is done in rcu_report_qs_rdp()?

Answer: Unlike for the CPU's quiescent states handled by rcu_report_qs_rdp(), there is no possibility of some other CPU or task reporting a quiescent state on behalf of the current task. Instead, the task must remove itself from the ->blkd_tasks list and report its own quiescent state. Because this code path has verified that this task was the last thing holding up the current grace period, there is no possibility of a new grace period starting before this task completes reporting its quiescent state.

Back to Quick Quiz 23.