September 21, 2012 (Linux 3.6+)
This article was contributed by Paul E. McKenney
And then there are of course the customary answers to the quick quizzes.
RCU updaters must wait for all pre-existing RCU read-side critical sections, and the ends of RCU read-side critical sections are communicated via quiescent states. The length of time for all pre-existing RCU read-side critical sections to complete is known as a grace period, and the job of grace-period detection is therefore to efficiently process quiescent-state information, which is a special challenge on large systems.
Grace-period detection is therefore done using a combining tree, with different types of quiescent-state information being fed in at different levels, as shown in the following diagram:
The reason that dyntick-idle information enters via the
rcu_dynticks
structure is that this information is specific
to each CPU, but common to all flavors of RCU, and thus is processed
by the per-CPU data structure that is common to all flavors of RCU.
Similarly, passage of CPUs through normal quiescent states (such as
context switches) is handled by the per-CPU rcu_data
structure because this information is specific both to the corresponding
CPU and to the flavor of RCU.
In addition, both dyntick-idle transitions and normal quiescent states
can occur extremely frequently, as in many times per millisecond.
It is therefore critically important that these events be handled
as efficiently as possible, in other words, on a per-CPU basis.
Quick Quiz 1:
Is there a type of quiescent state other than CPU hotplug
events that is not handled on a per-CPU basis?
Answer
In contrast, CPU hotplug events are normally rare and incur high
overhead.
There is therefore no reason to avoid locking for CPU hotplug events,
and so they are tracked at the leaf level of the rcu_node
tree.
As noted in the article on RCU data structures, quiescent states propagate
up from the bottom of the tree in the above diagram, and grace periods
emerge out the top.
The full grace-period detection process is extensive, so discussion of it is subdivided into four sections, as follows:
This article covers grace-period detection, outlined in the uppermost blue dashed box in the above diagram. This covers the portions of the grace-period-detection process that are exercised most heavily on a system that has frequent short bursts of computation on all CPUs that is not undergoing any CPU hotplug operations. In this case, any dyntick-idle sojourns will be short enough that they will not hold up grace periods and there will be no need for RCU priority boosting—and hopefully no need for stall detection. The other three boxes are discussed in other articles.
RCU's grace-period detection is a state machine that is driven
primarily from RCU_SOFTIRQ
, which invokes
rcu_process_callbacks()
.
This state machine can also be invoked from __call_rcu()
,
which is a safety feature that prevents a blizzard of call_rcu()
invocations from OOMing the system.
Quick Quiz 2:
But doesn't this mean that call_rcu()
is not
wait free?
Answer
Grace periods are tracked in different ways at different levels
of the combining tree.
At the very top, the rcu_state
structure has its
->gpnum
field one greater than its
->completed
field if there is a grace period in
progress, otherwise these two fields are equal to each other.
These two fields appear in the other structures, but the instances
in the rcu_state
structure are the master copies.
Within the tree itself, each rcu_node
structure
uses the ->qsmask
mask to track which of its
descendants are not ready for the current grace period to complete.
At the leaf rcu_node
level, each bit of the
->qsmask
mask corresponds to a CPU.
For RCU-preempt, each rcu_node
structure also maintains a
->blkd_tasks
list of tasks that have incurred
a context switch within their current RCU read-side critical sections.
The ->gp_tasks
pointer references the first of those
tasks that is preventing the current grace period from completing,
or is NULL
if there is no such task.
Therefore, once a given rcu_node
structure's
->qsmask
mask is zero and its ->gp_tasks
pointer is NULL
, then every CPU and task represented
by the subtree up to and including that rcu_node
structure is ready for the current grace period to end.
If this rcu_node
structure is at the root of the tree,
that means that the grace period has ended, otherwise it means
this rcu_node
structure's bit in its parent's
->qsmask
field may now be cleared.
Quick Quiz 3:
How does a given non-root rcu_node
structure
know which of its parent's ->qsmask
bits to clear?
Answer
Answer
Quick Quiz 4:
What happens with the blkd_tasks
list and
the gp_tasks
pointer for RCU-bh and RCU-sched?
Answer
Feeding into the bottom of the combining tree are the
rcu_data
structures, one per CPU.
Each structure has a ->qs_pending
field that
indicates that a quiescent state is needed from the corresponding
CPU, a ->passed_quiesce
field that indicates that
this CPU has in fact passed through a quiescent state, and finally
a ->passed_quiesce_gpnum
field that indicates
what grace-period number was in effect when the CPU passed through
the quiescent state.
Quick Quiz 5:
Why bother with the ->passed_quiesce_gpnum
field?
Given that the grace period cannot end until each CPU passes through
a quiescent state, the grace-period number cannot change, so what
is the point of tracking it?
Answer
Finally, the rcu_dynticks
structure has the
->dynticks
field, which contains an even value
if this CPU is in dyntick-idle mode (which is an extended quiescent
state) and contains an odd value otherwise.
The primary job of the grace-period state machine is to propagate quiescent-state indications as far up the tree as possible, declaring the end of the current grace period when the whole tree has cleared.
This example features a system with four CPUs, and uses a tree of
rcu_node
structures with one root node and a pair of
leaf nodes, as shown below:
The annotations show the values of the relevant fields of each
structure.
The “g0c0” in each structure indicates that both the
->gpnum
and ->completed
fields of
each structure are zero, so that RCU is idle.
The “fqs:0” in the rcu_state
structures indicate
that the ->fqs_state
field is zero, which again indicates
that RCU is idle.
This field controls the state machine for forcing quiescent states,
which can take on the following values:
RCU_GP_IDLE=0
indicates that RCU is idle.
RCU_GP_INIT=1
indicates that RCU is initializing
a new grace period.
RCU_SAVE_DYNTICK=2
indicates the first phase
of quiescent-state forcing is in progress, in which
initial dyntick-idle information is collected.
RCU_FORCE_QS=3
indicates that the second phase
of quiescent state forcing is in progress, in which
dyntick-idle quiescent states are reported, along with
offline CPUs.
In addition, online CPUs that have failed to pass through
quiescent state are also sent reschedule IPIs.
The “qsm..” in the rcu_node
structures indicate
that the ->qsmask
fields are all zero, in other words,
RCU is not waiting on any CPU (which is not surprising given that
RCU is idle).
The “b” and “g” in the rcu_node
structures indicate that the ->blkd_tasks
and
gp_tasks
pointers, respectively, are NULL
.
The “qsp:0” in the rcu_data
structures
indicates that the ->qs_pending
fields are zero,
which in turn means that RCU is not waiting for a quiescent state from
any of them.
The “pq” in the rcu_data
structures
indicate that the ->passed_quiesce
fields are set
to 1, which in turns means that each CPU has passed a quiescent
state since the beginning of the last grace period.
The “pgc:0” in the rcu_data
structures
indicate that the ->passed_quiesce_gpnum
fields are
set to zero, which in turn means that each CPU last passed through
a quiescent state during grace period 0.
The “dts:0” in the rcu_data
structures
indicate that the ->dynticks_snap
fields are set
to zero, which likely means that they are at their initialized values,
but could also mean that each CPU needed quiescent-state forcing
during grace period 0.
Finally, the “dt:1” in the rcu_dynticks
structure indicates that the ->dynticks
fields
are all equal to one, which is an odd number, in turn indicating
that none of the CPUs is in dyntick-idle mode.
The fact that quiescent states and grace periods propagate through this tree non-atomically means that different CPUs might disagree about the grace-period state, as shown in the following chart:
Each row of this figure corresponds to one leaf
rcu_node
structure, with time advancing from left to right.
In the blue region in the lower left, a grace period has started, but
some rcu_node
structures (and therefore also their CPUs)
are not yet aware of it.
In other words, the rcu_state
structure has been updated
to reflect the new grace period, but the blue-colored rcu_node
structures have not.
Nevertheless, during this period, some CPUs (such as those corresponding
to CPU 0) might already have passed through their quiescent
states.
This means that a CPU cannot assume that its callbacks need only wait
until the end of the next grace period, because that grace period
might well already have started.
In the pink region, all rcu_node
structures have been updated
to reflect the new grace period, but at least one CPU corresponding to each
rcu_node
structure has not yet managed to pass through
a quiescent state.
The pink rcu_node
structures therefore have non-zero
->qsmask
fields.
In contrast, all CPUs corresponding to the orange rcu_node
structures have already passed through a quiescent state, but there
is at least one CPU that has not yet passed through a quiescent state.
In other words, the orange rcu_node
structures'
->qsmask
fields are zero, but there must be at least
one other rcu_node
structure with a non-zero
->qsmask
field.
In the yellow region, the ->qsmask
fields of all
of the rcu_node
structures are zero, which means that
the grace period has completed, the rcu_node
structures'
->completed
fields have not yet been updated accordingly.
It is important to note that it is impossible to distinguish between
the orange and yellow states by looking at any individual
leaf rcu_node
structure: One must instead look at either the
root rcu_node
structure, the rcu_state
structure, or all of the leaf rcu_node
structures.
Interestingly enough, this yellow region is the last chance for
CPUs to assume that their callbacks need only wait for the end
of the next grace period, with one exception.
Quick Quiz 6:
What is the exceptional case where a CPU can assume that its callbacks
need only wait until the end of the next grace period, despite that
CPU being aware that the prior grace period has ended?
Answer
In the green region, RCU is idle: no grace period is in progress. This sequence repeats, as can be seen in the right-hand side of the diagram.
The following example will illustrate this grace-period process in more detail:
rcu_state
structure.
rcu_data
structure indicates that
it has already passed through a quiescent state for the
already-completed grace period, and therefore takes no action.
rcu_node
structure for
the new grace period.
rcu_node
structure for
itself and CPU 0 for the new grace period.
rcu_data
structure for the
new grace period.
rcu_node
structure for
CPUs 2 and 3 for the new grace period.
rcu_data
structure.
rcu_data
structure.
rcu_node
structure.
rcu_node
structure.
rcu_node
structure.
Because CPU 1 has already announced a quiescent state
for this grace period, this task does not block grace
period 1, but might block grace period 2 if
it remains preempted long enough.
Nevertheless, task B is queued onto CPU 1's
rcu_node
structure.
rcu_node
structure, announcing that everything
represented by the left subtree has passed through a
quiescent state.
rcu_node
structure, and, because there are no more tasks on this
structure and because both CPUs have passed through quiescent
states, CPU 3 continues up to the root
rcu_node
, announcing that everything
represented by the right subtree has passed through a
quiescent state.
rcu_node
structure reflects
the entire tree having passed through a quiescent state,
CPU 3 ends grace period 1.
rcu_state
StructureGrace periods are started in the following situations:
__call_rcu()
,
and finds a large backlog of callbacks when there is no grace
period in progress.
__rcu_process_callbacks()
finds that it has callbacks in need of a grace period, and no
grace period is in progress.
force_quiescent_state()
, which has now
completed.
In our example, CPU 1 starts a new RCU grace period by invoking
rcu_start_gp()
, which checks to see if starting a grace
period is appropriate, and, if so, increments the rcu_state
structure's ->gpnum
field, resulting in the state shown
below:
At this point, we proceed with CPU 3's early quiescent state.
When CPU 3 passes through a quiescent state, it sets its
rcu_data
structure's ->passed_quiesce_gpnum
to ->gpnum
and ->passed_quiesce
to 1.
But these are the values that these fields already had, so there is
no effect.
Which is to be expected: There is no grace period active, so there
is nothing for the quiescent states to do.
When CPU 3 takes a scheduling-clock interrupt, it invokes
__rcu_pending()
, which finds that
this CPU's rcu_data
structure's ->qs_pending
field is zero, which in turn causes it to refrain from invoking the
RCU core code.
Again, this is to be expected, as there is no grace period active,
so there is nothing for RCU to do.
rcu_node
StructureCPU 1 continues through rcu_start_gp()
,
initializing the root rcu_node
structure.
This initialization includes the ->gpnum
field,
->completed
field (which is initialized to the
same zero value that it had originally), and the ->qsmask
field (where the question marks indicate that quiescent states are
required from everything).
Because initialization has started, the rcu_state
structure's ->fqs_state
field is set to “1”.
This results in the following state:
At this point, we initialize the leftmost
rcu_node
structure.
rcu_node
StructureCPU 1 continues further through rcu_start_gp()
,
initializing the left-most rcu_node
structure, which
covers CPUs 0 and 1.
This proceeds much as for the root rcu_node
structure,
with the following result:
CPU 1 has now initialized two of the three rcu_node
structures, and has but one left to go.
However, CPU 1 is handled by the rcu_node
structure
that was just initialized, which requires special handling.
rcu_data
StructureAt this point, CPU 1 initializes its own
rcu_data
structure, setting the ->qs_pending
field and clearing the ->passed_quiesce
field,
as shown below.
CPU 1 could now start announcing quiescent states against this new grace period, except for the fact that it is still busy initializing it. In general, however, CPUs can and do announce quiescent states against grace periods that are not yet fully initialized. It is important to note that other CPUs can do useful work while CPU 1 is initializing for the new grace period, for exmaple, they might enter dyntick-idle mode.
CPU 0 then enters dyntick-idle mode, so that its
rcu_dyntick
structure's ->dynticks
field is incremented from 1 to 2, as shown below:
Because CPU 0 is now in dyntick-idle mode, it no longer needs to inform the RCU core of passage through quiescent states. Instead, other CPUs will (eventually) recognize that CPU 0 is in dyntick-idle mode, and thus is in an extended quiescent state. These other CPUs will therefore announce CPU 0's quiescent states on its behalf.
But now back to CPU 1's initialization for the new RCU grace period.
rcu_node
StructureCPU 1 then continues initializing the new grace period in
rcu_start_gp()
by initializing the right-hand
rcu_node
structure, the one corresponding to
CPUs 2 and 3.
This completes initialization, so the fqs state is advanced.
This proceeds as before, with the result as shown below:
The new RCU grace period is now fully initialized, so quiescent
states may now be announced against it by any CPU that has initialized
its own rcu_data
structure, which in this example includes
only CPU 1.
CPU 2 now notices that a new grace period has started
by comparing its rcu_data
structure's
->gpnum
field to that of its leaf rcu_node
structure.
These two fields differ, and thus CPU 2 initializes its
rcu_data
structure to account for this new grace period,
as shown below:
CPU 2's rcu_data
structure is now set to indicate
that the RCU core needs a quiescent state from CPU 2 and has not yet
seen one.
At this point, CPU 1 passes through a quiescent state.
Because it has already initialized its rcu_data
structure
to reflect the new (now current) grace period, this quiescent state is
applied against this grace period, as shown below:
CPU 1's rcu_data
structure now shows that
CPU 1 has passed through a quiescent state that applies to
grace period number 1.
However, this quiescent state has not yet been announced to the
RCU core, so CPU 1's rcu_node
structure is still
unaware that CPU 1 has passed through a quiescent state.
Quick Quiz 7:
Why not just immediately announce the quiescent state to the RCU core?
Wouldn't that be far simpler and faster?
Answer
But before that announcement happens, CPU 3 joins the grace-period-1 party.
At this point, CPU 3 notices that its rcu_data
structure's ->gpnum
field does not match that of
its leaf rcu_node
structure, and therefore initializes
its rcu_data
structure to reflect the current grace
period, as shown below:
CPU 3 is now ready to record quiescent states against the current grace period. But now back to CPU 2...
CPU 2 now passes through a quiescent state, and because it has
initialized its rcu_data
structure to reflect the current
grace period, this quiescent state may be applied against the current
grace period.
CPU 2 therefore updates its rcu_data
structure as
follows:
CPU 2 will announce this quiescent state to the RCU core later. In the meantime, over to CPU 1.
At this point, CPU 1 takes a scheduling-clock interrupt.
Because CPU 1's rcu_data
structure indicates that
it has passed through a quiescent state for the current grace period
that the RCU core does not yet know about,
this quiescent state is announced to the RCU core.
The ->qs_pending
field of CPU 1
rcu_data
structure is cleared, as is the
corresponding bit in
the leftmost rcu_node
structure's
->qsmask
field.
Because the bit corresponding to CPU 0 is still set,
the information does not propagate up to the root
rcu_node
structure.
This results in the state shown below:
Quick Quiz 8:
Why are the positions of the “.” and the “?”
in the diagram are reversed?
After all, CPU 1 has announced a quiescent state to the RCU core
and CPU 0 has not yet done so.
Answer
CPU 0 must report a quiescent state before any change can
be propagated up from the leftmost rcu_node
structure
up to the root.
And now CPU 2 takes a scheduling-clock interrupt.
Because CPU 2's rcu_data
structure indicates that
it has passed through a quiescent state for the current grace period
that the RCU core does not yet know about,
this quiescent state is announced to the RCU core.
The ->qs_pending
field of CPU 2
rcu_data
structure is cleared, as is the
corresponding bit in
the rightmost rcu_node
structure's
->qsmask
field.
Because the bit corresponding to CPU 3 is still set,
the information does not propagate up to the root
rcu_node
structure.
This results in the state shown below:
CPU 3 must report a quiescent state before any change can
be propagated up from the leftmost rcu_node
structure
up to the root.
CPU 3 then context switches away from task A while
in an RCU read-side critical section.
Because CPU 3 has not previously passed through a quiescent state
during this grace period, task A is queued on the rightmost
rcu_node
structure, with both the ->blkd_tasks
and ->gp_tasks
pointers referencing it.
CPU 3 also records a quiescent state in its rcu_data
structure by setting its rcu_data
structure's
passed_quiesce
field to 1
and also the ->passed_quiesce_gpnum
field to the
current grace-period number.
This results in the following:
Task A must resume and complete its RCU read-side critical section before the current grace period can complete.
Now CPU 3 takes a scheduling-clock interrupt.
Because CPU 3's rcu_data
structure indicates that
it has passed through a quiescent state for the current grace period
that the RCU core does not yet know about,
this quiescent state is announced to the RCU core.
The ->qs_pending
field of CPU 3
rcu_data
structure is cleared, as is the
corresponding bit in
the rightmost rcu_node
structure's
->qsmask
field.
All of the ->qs_pending
bits are now clear, but because
task A is still blocked in an RCU read-side critical section,
the information does not propagate up to the root
rcu_node
structure.
This results in the state shown below:
CPU 0 and task A are now the only things blocking completion of the current grace period.
CPU 1 then context switches away from task B while
in an RCU read-side critical section.
Task B is therefore queued on the leftmost
rcu_node
structure, but
because CPU 1 has already passed through a quiescent state
during this grace period, only the ->blkd_tasks
pointer references it.
This results in the following:
Both CPU 0 and task A are still blocking completion of the current grace period. If it stays preempted long enough, task B will eventually block the next grace period.
CPU 1 now forces quiescent states.
Because the rcu_state
structure's
->fqs_state
field is currently “2”
(RCU_SAVE_DYNTICK
), this pass simply records quiescent
states, but might also carry out RCU priority boosting.
The only CPU that has not yet passed through a quiescent state is
CPU 0, so its rcu_dynticks
structure's
->dynticks
counter is copied to its
rcu_data
structure's ->dynticks_snap
field.
Finally, the rcu_state
structure's ->fqs_state
field is set to “3” (RCU_FORCE_QS
),
resulting in the following:
Both CPU 0 and task A are still blocking completion of the current grace period. However, the RCU core is one step closer to determining that CPU 0 is in dyntick-idle mode, which is a quiescent state.
CPU 0 now takes an interrupt, which causes it to exit
dyntick-idle mode, at least from RCU's perspective.
This CPU's rcu_dynticks
structure's
->dynticks
counter is therefore incremented, giving
the odd (non-dyntick-idle) value of “3”, as shown below.
Still we have both CPU 0 and task A blocking completion of the current grace period.
CPU 0 now returns from interrupt, which causes it to re-enter
dyntick-idle mode, again, at least from RCU's perspective.
This CPU's rcu_dynticks
structure's
->dynticks
counter is therefore incremented once
again, giving
the even (dyntick-idle) value of “4”, as shown below.
CPU 0 is once again in an extended quiescent state.
Now CPU 3 notes that it has been some time since the
last forcing of quiescent states and that the grace period is still
in progress, and therefore forces quiescent states once more.
It notes that CPU 1's rcu_dynticks
structure's
->dynticks
field is even, which indicates that
CPU 0 is in a extended quiescent state.
It therefore announces this to the RCU core on CPU 0's behalf.
It might also priority-boost task A.
Quick Quiz 9:
Why not make CPU 0 announce its own quiescent states?
Wouldn't that simplify things by eliminating a class of race conditions?
Answer
This announcement clears all of the ->qsmask
bits
in the leftmost rcu_node
structure, so this time
state is propagated to the root rcu_node
structure,
as shown below.
Now only task A is blocking completion of the current grace period.
At this point, task A resumes on CPU 1 and completes its
RCU read-side critical section.
It removes itself from the rightmost rcu_node
structure's
->blkd_tasks
list, and notes that it was the last entity
on this structure blocking the current grace period.
It therefore propagates state up to the root rcu_node
structure,
and finds that there is no longer anything blocking the current grace
period.
It therefore updates the rcu_state
structure's
->completed
field to match the ->gpnum
field, and then similarly updates all of the rcu_node
structure's ->completed
fields,
resulting in the state shown below:
The grace period has now officially completed, but none of the
CPUs are yet aware of this fact.
They will become aware on their next invocation of the RCU core,
when they will update the ->completed
field of
their own rcu_data
structures.
Starting a grace period involves rcu_gp_in_progress()
,
cpu_needs_another_gp()
, __rcu_process_gp_end()
,
__note_new_gpnum()
, and rcu_start_gp_per_cpu()
,
each of which is shown below:
1 static int rcu_gp_in_progress(struct rcu_state *rsp) 2 { 3 return ACCESS_ONCE(rsp->completed) != ACCESS_ONCE(rsp->gpnum); 4 } 5 6 static int 7 cpu_needs_another_gp(struct rcu_state *rsp, struct rcu_data *rdp) 8 { 9 return *rdp->nxttail[RCU_DONE_TAIL] && !rcu_gp_in_progress(rsp); 10 } 11 12 __rcu_process_gp_end(struct rcu_state *rsp, struct rcu_node *rnp, struct rcu_data *rdp) 13 { 14 if (rdp->completed != rnp->completed) { 15 rdp->nxttail[RCU_DONE_TAIL] = rdp->nxttail[RCU_WAIT_TAIL]; 16 rdp->nxttail[RCU_WAIT_TAIL] = rdp->nxttail[RCU_NEXT_READY_TAIL]; 17 rdp->nxttail[RCU_NEXT_READY_TAIL] = rdp->nxttail[RCU_NEXT_TAIL]; 18 rdp->completed = rnp->completed; 19 trace_rcu_grace_period(rsp->name, rdp->gpnum, "cpuend"); 20 if (ULONG_CMP_LT(rdp->gpnum, rdp->completed)) 21 rdp->gpnum = rdp->completed; 22 if ((rnp->qsmask & rdp->grpmask) == 0) 23 rdp->qs_pending = 0; 24 } 25 } 26 27 static void __note_new_gpnum(struct rcu_state *rsp, struct rcu_node *rnp, struct rcu_data *rdp) 28 { 29 if (rdp->gpnum != rnp->gpnum) { 30 rdp->gpnum = rnp->gpnum; 31 trace_rcu_grace_period(rsp->name, rdp->gpnum, "cpustart"); 32 if (rnp->qsmask & rdp->grpmask) { 33 rdp->qs_pending = 1; 34 rdp->passed_quiesce = 0; 35 } else 36 rdp->qs_pending = 0; 37 } 38 } 39 40 rcu_start_gp_per_cpu(struct rcu_state *rsp, struct rcu_node *rnp, struct rcu_data *rdp) 41 { 42 __rcu_process_gp_end(rsp, rnp, rdp); 43 rdp->nxttail[RCU_NEXT_READY_TAIL] = rdp->nxttail[RCU_NEXT_TAIL]; 44 rdp->nxttail[RCU_WAIT_TAIL] = rdp->nxttail[RCU_NEXT_TAIL]; 45 __note_new_gpnum(rsp, rnp, rdp); 46 }
The rcu_gp_in_progress()
function is shown above on
lines 1-4.
It simply compares the specified rcu_state
structure's
->completed
and ->gpnum
fields,
returning true if they differ, in other words, if there is an RCU
grace period in progress.
Quick Quiz 10:
What is the purpose of the ACCESS_ONCE()
calls?
Answer
The cpu_needs_another_gp()
function,
shown on lines 6-10, checks to see
if this CPU has callbacks that are waiting for a grace period
(*rdp->nxttail[RCU_DONE_TAIL]
) when there is no grace
period in progress.
When this condition occurs, the calling CPU will normally start a new
grace period.
The __rcu_process_gp_end()
function is shown
on lines 12-25, and is used by the CPU starting the new grace
period to accelerate recognition of the completion of the old one.
Line 14 checks to see if the current CPU is already aware of
the new grace period by comparing that CPU's rcu_data
structure's ->completed
field with the same CPU's
leaf rcu_node
structure's ->completed
field.
If they differ, then this CPU is not yet aware that the old grace period
has completed, an omission that it addresses by executing lines 15-23.
Lines 15-17 advance RCU callbacks, and will be dealt with
in another article devoted to the processing of RCU callbacks.
Line 18 records the number of the last completed grace period
in order to avoid executing this code twice for the same grace period.
Line 19 does tracing, and lines 20 and 21
initialize the CPU's rcu_data
structure's
->gpnum
field in case this CPU just exited a
dyntick-idle sojourn so long that grace-period number wrapped.
Line 22 checks to see if the current grace period
is waiting on any quiescent states from the current CPU,
as might be the case if this CPU just came online.
If not, line 23 sets this CPU's rcu_data
structure's
->qs_pending
field to zero to prevent the CPU from
needlessly attempting to announce any quiescent states.
Quick Quiz 11:
In __rcu_process_gp_end()
in line 20, why bother comparing rdp->gpnum
to rdp->completed
?
Why not just unconditionally set rdp->gpnum
to a sane value, for example, rnp->gpnum
?
Or even rdp->completed
?
Answer
The __note_new_gpnum()
function, shown on
lines 27-38, initializes this CPU for the new grace period.
Line 29 checks to see if this CPU is already aware of the new
grace period by comparing the rcu_data
structure's
->gpnum
field to that of the corresponding leaf
rcu_node
structure.
If the CPU is indeed unaware of the new grace period, it executes
lines 30-36 to carry out the needed initialization.
Line 30 records the new ->gpnum
in the CPU's
rcu_data
structure to avoid initializing twice,
while line 31 carries out tracing.
Line 32 checks to see if the current grace period needs a
quiescent state from this CPU, and if so, lines 33 and 34
set the rcu_data
structure to cause the CPU to record
the next quiescent state.
Otherwise, line 36 prevents the CPU from attempting to report
an unneeded quiescent state.
The rcu_start_gp_per_cpu()
function, shown on
lines 40-46, optimizes grace-period startup for the CPU starting
the new grace period.
Line 42 invokes __rcu_process_gp_end()
to handle
the end of the prior grace period, if needed, lines 43 and 44
advance RCU callbacks, and line 45 initializes this CPU for
the start of the new grace period.
The starting of grace periods is driven by rcu_start_gp()
,
as shown below:
1 static void 2 rcu_start_gp(struct rcu_state *rsp, unsigned long flags) 3 __releases(rcu_get_root(rsp)->lock) 4 { 5 struct rcu_data *rdp = this_cpu_ptr(rsp->rda); 6 struct rcu_node *rnp = rcu_get_root(rsp); 7 8 if (!rcu_scheduler_fully_active || 9 !cpu_needs_another_gp(rsp, rdp)) { 10 raw_spin_unlock_irqrestore(&rnp->lock, flags); 11 return; 12 } 13 if (rsp->fqs_active) { 14 rsp->fqs_need_gp = 1; 15 raw_spin_unlock_irqrestore(&rnp->lock, flags); 16 return; 17 } 18 rsp->gpnum++; 19 trace_rcu_grace_period(rsp->name, rsp->gpnum, "start"); 20 WARN_ON_ONCE(rsp->fqs_state == RCU_GP_INIT); 21 rsp->fqs_state = RCU_GP_INIT; 22 rsp->jiffies_force_qs = jiffies + RCU_JIFFIES_TILL_FORCE_QS; 23 record_gp_stall_check_time(rsp); 24 if (NUM_RCU_NODES == 1) { 25 rcu_preempt_check_blocked_tasks(rnp); 26 rnp->qsmask = rnp->qsmaskinit; 27 rnp->gpnum = rsp->gpnum; 28 rnp->completed = rsp->completed; 29 rsp->fqs_state = RCU_SIGNAL_INIT; 30 rcu_start_gp_per_cpu(rsp, rnp, rdp); 31 rcu_preempt_boost_start_gp(rnp); 32 trace_rcu_grace_period_init(rsp->name, rnp->gpnum, 33 rnp->level, rnp->grplo, 34 rnp->grphi, rnp->qsmask); 35 raw_spin_unlock_irqrestore(&rnp->lock, flags); 36 return; 37 } 38 raw_spin_unlock(&rnp->lock); 39 raw_spin_lock(&rsp->onofflock); 40 rcu_for_each_node_breadth_first(rsp, rnp) { 41 raw_spin_lock(&rnp->lock); 42 rcu_preempt_check_blocked_tasks(rnp); 43 rnp->qsmask = rnp->qsmaskinit; 44 rnp->gpnum = rsp->gpnum; 45 rnp->completed = rsp->completed; 46 if (rnp == rdp->mynode) 47 rcu_start_gp_per_cpu(rsp, rnp, rdp); 48 rcu_preempt_boost_start_gp(rnp); 49 trace_rcu_grace_period_init(rsp->name, rnp->gpnum, 50 rnp->level, rnp->grplo, 51 rnp->grphi, rnp->qsmask); 52 raw_spin_unlock(&rnp->lock); 53 } 54 rnp = rcu_get_root(rsp); 55 raw_spin_lock(&rnp->lock); 56 rsp->fqs_state = RCU_SIGNAL_INIT; 57 raw_spin_unlock(&rnp->lock); 58 raw_spin_unlock_irqrestore(&rsp->onofflock, flags); 59 }
Line 3 is a
sparse
directive that states that this function
must be called with the root rcu_node
structure's
->lock
held, and that it releases it before returning.
Line 8 checks to see if the scheduler has already spawned the
first task and line 9 checks to see if this CPU needs a grace period.
If the answer to either of these are “no”,
lines 10 and 11 release the root rcu_node
structure's
->lock
and return.
Otherwise, line 13 checks to see if some other CPU is currently
attempting to force quiescent states.
If so, line 14 sets the rcu_state
structure's
->fqs_need_gp
field so that this other CPU will
start a grace period on our behalf, and lines 15 and 16
release the root rcu_node
structure's
->lock
and return.
Line 18 increments the rcu_state
structure's
->gpnum
field, which officially starts the new
grace period.
Quick Quiz 12:
But nothing can yet apply quiescent states to this new
grace period, so does it really make sense to say that it has started?
Answer
Line 19 traces the new grace period and line 20
complains if some other CPU is already starting a new grace period.
Line 21 causes concurrent attempts to force quiescent states to
hold off until we have fully initialized the rcu_node
structures for the new grace period.
Line 22 records the time (in jiffies) at which quiescent states
should be forced, assuming that the new grace period does not complete
first.
Line 23 computes and records the time at which RCU CPU stall warnings
should be printed, again assuming that the new grace period does not
complete first.
Line 24 checks to see if the tree of rcu_node
structures consists of only a single node, and if so, lines 25-36
initialize that single node.
Line 25 carries out some RCU-preempt-specific debug checks,
line 26 sets up the rcu_node
structure's
->qsmask
field to wait for quiescent states from all
online CPUs corresponding to this rcu_node
structure,
and lines 27 and 28 update to the current grace-period
number.
Line 29 re-enables forcing of quiescent states and line 30
sets up the current CPU to detect quiescent states for the current
grace period.
Line 31 schedules RCU priority boosting for RCU-preempt,
and lines 32-34 trace the initialization of this rcu_node
structure.
Finally, lines 35 and 36
release the root rcu_node
structure's
->lock
and return.
Lines 38-58 of rcu_start_gp()
are executed only
on systems where there is more than one rcu_node
structure.
Line 38 releases the root rcu_node
structure's
->lock
in order to avoid deadlock when line 29
acquires the rcu_state
structure's ->onofflock
,
which excludes changes in RCU's idea of which CPUs are online.
Quick Quiz 13:
Why not change the locking hierarchy so that we could just
hold the root rcu_node
structure's ->lock
while acquiring ->onofflock
?
That would get rid of the single-node special case, simplifying the code.
Answer
Quick Quiz 14:
That is just plain silly!
Given that CPUs are coming online and going offline anyway,
what possible sense does it make for RCU to bury its head in the
sand and ignore these CPU-hotplug events?
Answer
Each pass through the loop spanning lines 40-53 initializes
one rcu_node
structure.
Within this loop, line 41 acquires the current rcu_node
structure's ->lock
and line 52 releases it.
The intervening lines operate in the same manner as the corresponding
lines did in the single-rcu_node
case, except that
line 46 checks to make sure that the current rcu_node
structure corresponds to the current CPU before line 47
initializes this CPU's rcu_data
structure for the new
grace period.
In addition, re-enabling quiescent-state forcing is deferred to outside
of the loop.
Quick Quiz 15:
Why not abstract rcu_node
initialization?
Answer
rcu_node
structures are initialized,
lines 54 and 55 acquire the root rcu_node
structure's ->lock
, line 56 re-enables quiescent-state
forcing, and line 57 releases the ->lock
.
Finally, line 58 releases the rcu_state
structure's
->onofflock
.
invoke_rcu_core()
, rcu_process_gp_end()
,
note_new_gpnum()
, and check_for_new_grace_period()
functions are used in the combining-tree algorithm that reduces
quiescent states into grace periods.
These functions as shown below:
1 static void invoke_rcu_core(void) 2 { 3 raise_softirq(RCU_SOFTIRQ); 4 } 5 6 static void 7 rcu_process_gp_end(struct rcu_state *rsp, struct rcu_data *rdp) 8 { 9 unsigned long flags; 10 struct rcu_node *rnp; 11 12 local_irq_save(flags); 13 rnp = rdp->mynode; 14 if (rdp->completed == ACCESS_ONCE(rnp->completed) || 15 !raw_spin_trylock(&rnp->lock)) { 16 local_irq_restore(flags); 17 return; 18 } 19 __rcu_process_gp_end(rsp, rnp, rdp); 20 raw_spin_unlock_irqrestore(&rnp->lock, flags); 21 } 22 23 static void note_new_gpnum(struct rcu_state *rsp, struct rcu_data *rdp) 24 { 25 unsigned long flags; 26 struct rcu_node *rnp; 27 28 local_irq_save(flags); 29 rnp = rdp->mynode; 30 if (rdp->gpnum == ACCESS_ONCE(rnp->gpnum) || 31 !raw_spin_trylock(&rnp->lock)) { 32 local_irq_restore(flags); 33 return; 34 } 35 __note_new_gpnum(rsp, rnp, rdp); 36 raw_spin_unlock_irqrestore(&rnp->lock, flags); 37 } 38 39 static int 40 check_for_new_grace_period(struct rcu_state *rsp, struct rcu_data *rdp) 41 { 42 unsigned long flags; 43 int ret = 0; 44 45 local_irq_save(flags); 46 if (rdp->gpnum != rsp->gpnum) { 47 note_new_gpnum(rsp, rdp); 48 ret = 1; 49 } 50 local_irq_restore(flags); 51 return ret; 52 }
The invoke_rcu_core()
function shown on lines 1-4
simply does a raise_softirq()
in order to cause
rcu_process_callbacks()
to be invoked in a clean environment.
The rcu_process_gp_end()
function shown on
lines 6-21 acquires the CPU's leaf rcu_node
structure's lock if it is available, and, if so, invokes
__rcu_process_gp_end()
with the lock held.
This function handles the possibility of migration from one CPU
to another by disabling irqs first and acquiring the lock second,
rather than acquiring the lock and disabling irqs in a single operation.
The note_new_gpnum()
function shown on lines 23-37
is a similar wrapper for __note_new_gpnum()
.
The check_for_new_grace_period()
function
shown on lines 39-52 checks to see if there is a new grace
period that this CPU is unaware of (line 46), and, if so,
invokes note_new_gpnum()
to become aware of it.
Line 51 returns true iff there was a new grace period.
rcu_report_qs_rdp()
, next
rcu_report_qs_rnp()
, and finally
rcu_report_qs_rsp()
.
1 static void 2 rcu_report_qs_rdp(int cpu, struct rcu_state *rsp, struct rcu_data *rdp, long lastgp) 3 { 4 unsigned long flags; 5 unsigned long mask; 6 struct rcu_node *rnp; 7 8 rnp = rdp->mynode; 9 raw_spin_lock_irqsave(&rnp->lock, flags); 10 if (lastgp != rnp->gpnum || rnp->completed == rnp->gpnum) { 11 rdp->passed_quiesce = 0; 12 raw_spin_unlock_irqrestore(&rnp->lock, flags); 13 return; 14 } 15 mask = rdp->grpmask; 16 if ((rnp->qsmask & mask) == 0) { 17 raw_spin_unlock_irqrestore(&rnp->lock, flags); 18 } else { 19 rdp->qs_pending = 0; 20 rdp->nxttail[RCU_NEXT_READY_TAIL] = rdp->nxttail[RCU_NEXT_TAIL]; 21 rcu_report_qs_rnp(mask, rsp, rnp, flags); 22 } 23 }
The rcu_report_qs_rdp()
reports a quiescent state to
the CPU's rcu_data
structure, which must be invoked with
preemption disabled.
Line 8 obtains a pointer to the CPU's rcu_node
structure, and line 9 acquires that structure's ->lock
.
Line 10 checks to see whether the quiescent state being announced
corresponds to a now-completed grace period, and, if so, line 11
sets up the CPU to look for a new quiescent state if there is a new grace
period, and lines 12 and 13 release the leaf rcu_node
structure's ->lock
and return.
Otherwise, execution continues at line 15, which picks up the
bit corresponding to this CPU in its leaf rcu_node
structure's ->qsmask
field.
Line 16 then checks to see if a quiescent state has already been
reported for the current grace period, and, if so, line 17
releases the leaf rcu_node
structure's ->lock
.
Otherwise, line 19 clears this CPU's rcu_data
structure's ->qs_pending
field to acknowledge the
announcement (and thus preventing the CPU from attempting to announce
additional quiescent states for this grace period),
line 20 does callback handling (which is discussed elsewhere),
and line 21 invokes rcu_report_qs_rnp()
,
which releases the leaf rcu_node
structure's
->lock
, and is discussed next.
1 static void 2 rcu_report_qs_rnp(unsigned long mask, struct rcu_state *rsp, 3 struct rcu_node *rnp, unsigned long flags) 4 __releases(rnp->lock) 5 { 6 struct rcu_node *rnp_c; 7 8 for (;;) { 9 if (!(rnp->qsmask & mask)) { 10 raw_spin_unlock_irqrestore(&rnp->lock, flags); 11 return; 12 } 13 rnp->qsmask &= ~mask; 14 trace_rcu_quiescent_state_report(rsp->name, rnp->gpnum, 15 mask, rnp->qsmask, rnp->level, 16 rnp->grplo, rnp->grphi, 17 !!rnp->gp_tasks); 18 if (rnp->qsmask != 0 || rcu_preempt_blocked_readers_cgp(rnp)) { 19 raw_spin_unlock_irqrestore(&rnp->lock, flags); 20 return; 21 } 22 mask = rnp->grpmask; 23 if (rnp->parent == NULL) { 24 break; 25 } 26 raw_spin_unlock_irqrestore(&rnp->lock, flags); 27 rnp_c = rnp; 28 rnp = rnp->parent; 29 raw_spin_lock_irqsave(&rnp->lock, flags); 30 WARN_ON_ONCE(rnp_c->qsmask); 31 } 32 rcu_report_qs_rsp(rsp, flags); 33 }
The rcu_report_qs_rnp()
function reports the quiescent state up
the rcu_node
tree.
Each pass through the loop spanning lines 8-31 handles one level
of this tree.
Line 9 checks to see if this quiescent state has already been
reported, and, if so, line 10 releases the current
rcu_node
structure's ->lock
field
and line 11 returns.
Quick Quiz 16:
But this is redundant with the check in
rcu_report_qs_rdp()
!
Why not remove one of these checks?
Answer
Line 13 clears the specified bit from the current
rcu_node
structure's ->qsmask
structure
and lines 14-17 trace this action.
If line 18 finds that there are more quiescent states required
(“rnp->qsmask != 0”) or there are tasks blocked in RCU
read-side critical sections that are blocking the current grace period,
then line 19 releases the current
rcu_node
structure's ->lock
and line 20
returns.
Otherwise, line 22 picks up the bit corresponding to
the current rcu_node
structure in its parent's
->qsmask
field.
Line 23 checks to see if the current rcu_node
structure
has a parent, and, if not (in other words, the current rcu_node
structure is the root), line 24 exits the loop.
Otherwise, line 26 releases the current
rcu_node
structure's ->lock
.
Line 27 retains a pointer to the current rcu_node
structure for diagnostic purposes, and line 28 advances up the
tree to the parent rcu_node
structure, whose
->lock
line 29 acquires.
Line 30 complains if the previous (now child) rcu_node
structure is still waiting for any quiescent states.
Quick Quiz 17:
But we released the previous rcu_node
structure's
->lock
, so why couldn't a new grace period have started?
This might well result in some of the child rcu_node
structure's ->qsmask
bits being set, wouldn't it?
Answer
Line 32 is reached if the root rcu_node
structure shows that all needed quiescent states have been reported for
the current grace period, in which case rcu_report_qs_rsp()
is invoked to end the grace period.
This function is shown below.
1 static void rcu_report_qs_rsp(struct rcu_state *rsp, unsigned long flags) 2 __releases(rcu_get_root(rsp)->lock) 3 { 4 unsigned long gp_duration; 5 struct rcu_node *rnp = rcu_get_root(rsp); 6 struct rcu_data *rdp = this_cpu_ptr(rsp->rda); 7 8 WARN_ON_ONCE(!rcu_gp_in_progress(rsp)); 9 smp_mb(); 10 gp_duration = jiffies - rsp->gp_start; 11 if (gp_duration > rsp->gp_max) 12 rsp->gp_max = gp_duration; 13 if (*rdp->nxttail[RCU_WAIT_TAIL] == NULL) { 14 raw_spin_unlock(&rnp->lock); 15 rcu_for_each_node_breadth_first(rsp, rnp) { 16 raw_spin_lock(&rnp->lock); 17 rnp->completed = rsp->gpnum; 18 raw_spin_unlock(&rnp->lock); 19 } 20 rnp = rcu_get_root(rsp); 21 raw_spin_lock(&rnp->lock); 22 } 23 rsp->completed = rsp->gpnum; 24 trace_rcu_grace_period(rsp->name, rsp->completed, "end"); 25 rsp->fqs_state = RCU_GP_IDLE; 26 rcu_start_gp(rsp, flags); 27 }
The rcu_report_qs_rsp()
function announces the full
set of quiescent states to the rcu_state
structure, thus
ending the grace period—and possibly starting another one.
Line 8 complains bitterly if there is no grace period in
progress, while line 9 preserves ordering in order to ensure
that all grace-period and pre-grace-period activity is seen by all
CPUs to precede the assignments to the various ->completed
fields that mark the end of this grace period.
Lines 11 and 12 accumulate the maximum grace-period
duration for tracing and diagnostic purposes.
Line 13 checks to see if the current CPU needs a new grace
period, and if not, lines 14-21 update the ->completed
fields in all the rcu_node
structures, momentarily
releasing the root rcu_node
structure's ->lock
in order avoid deadlock.
Quick Quiz 18:
Why open-code cpu_needs_another_gp()
on
line 13 of rcu_report_qs_rsp()
?
Answer
Line 23 updates the rcu_state
structure's
->completed
field, thus officially marking the end
of the old grace period.
Line 24 traces the end of the old grace period, line 25
sets ->fqs_state
to the idle state,
and finally line 26 invokes rcu_start_gp()
to start a new grace period if warranted.
The RCU core processing drives the state machine that is RCU.
It is initiated from softirq context, and is typically started when
rcu_check_callbacks()
notices that something needs to
be done, which it notices due to the rcu_pending()
function's
return value, which is described in the next section.
Although RCU core processing is initiated from softirq, the actual
grace-period initialization, quiescent-state forcing, and
grace-period cleanup are run from a kthread, which is described
later in this section.
In the meantime, here are the RCU core functions,
rcu_check_quiescent_state()
,
__rcu_process_callbacks()
,
rcu_preempt_process_callbacks()
, and
rcu_process_callbacks()
:
1 static void 2 rcu_check_quiescent_state(struct rcu_state *rsp, struct rcu_data *rdp) 3 { 4 if (check_for_new_grace_period(rsp, rdp)) 5 return; 6 if (!rdp->qs_pending) 7 return; 8 if (!rdp->passed_quiesce) 9 return; 10 rcu_report_qs_rdp(rdp->cpu, rsp, rdp, rdp->passed_quiesce_gpnum); 11 } 12 13 static void 14 __rcu_process_callbacks(struct rcu_state *rsp, struct rcu_data *rdp) 15 { 16 unsigned long flags; 17 18 WARN_ON_ONCE(rdp->beenonline == 0); 19 if (ULONG_CMP_LT(ACCESS_ONCE(rsp->jiffies_force_qs), jiffies)) 20 force_quiescent_state(rsp, 1); 21 rcu_process_gp_end(rsp, rdp); 22 rcu_check_quiescent_state(rsp, rdp); 23 if (cpu_needs_another_gp(rsp, rdp)) { 24 raw_spin_lock_irqsave(&rcu_get_root(rsp)->lock, flags); 25 rcu_start_gp(rsp, flags); 26 } 27 if (cpu_has_callbacks_ready_to_invoke(rdp)) 28 invoke_rcu_callbacks(rsp, rdp); 29 } 30 31 static void rcu_preempt_process_callbacks(void) 32 { 33 __rcu_process_callbacks(&rcu_preempt_state, 34 &__get_cpu_var(rcu_preempt_data)); 35 } 36 37 static void rcu_process_callbacks(struct softirq_action *unused) 38 { 39 trace_rcu_utilization("Start RCU core"); 40 __rcu_process_callbacks(&rcu_sched_state, 41 &__get_cpu_var(rcu_sched_data)); 42 __rcu_process_callbacks(&rcu_bh_state, &__get_cpu_var(rcu_bh_data)); 43 rcu_preempt_process_callbacks(); 44 trace_rcu_utilization("End RCU core"); 45 }
The rcu_check_quiescent_state()
function shown on
lines 1-11
checks to see if a quiescent state has been recorded in the specified
rcu_data
structure, and, if so, invokes
rcu_report_qs_rdp()
to announce it to the rcu_node
tree.
It must run on the CPU corresponding to the specified rcu_data
structure.
Line 4 invokes check_for_new_grace_period()
to check to see if there is a new grace period that this CPU is unaware
of, and if so, line 5 returns.
Quick Quiz 19:
Why does line 5 of rcu_check_quiescent_state()
just return?
Isn't that giving up a chance to report a quiescent state?
Answer
Line 6 checks to see if a quiescent state is needed from
the current CPU, and if not, line 7 returns.
Line 8 then checks to see if this CPU has passed through a
quiescent state, and if not, line 9 returns.
Otherwise, a quiescent state is needed from this CPU and the CPU
has recently passed through a quiescent state, so
line 10 invokes rcu_report_qs_rdp()
to
report this quiescent state to the rcu_node
tree.
Quick Quiz 20:
Why not check rdp->passed_quiesce_gpnum
right in rcu_check_quiescent_state()
rather
than incurring the extra function-call overhead (and added
argument) passing it in to rcu_report_qs_rdp()
?
Answer
The __rcu_process_callbacks()
function,
shown on lines 13-29, conducts one pass of the RCU core
state machine for a given flavor of RCU.
Line 18 complains if RCU does not believe that the current
CPU is online.
Line 19 checks to see if the current grace period has gone on
long enough that it is now time to force quiescent states,
and if so, line 20 attempts the forcing.
Line 21 checks to see if this CPU's idea of the current grace
period has ended, and line 22 checks to see if this CPU has
passed through some quiescent states that need to be reported
up the rcu_node
tree.
Line 23 checks to see if this CPU needs another RCU grace
period and RCU is idle, in which case line 24 acquires the
root rcu_node
structure's ->lock
and line 25 invokes rcu_start_gp()
to start
a new grace period.
Quick Quiz 21:
But __rcu_process_callbacks()
fails to
release the root rcu_node
structure's ->lock
!
Won't that result in deadlock?
Answer
Finally, line 27 checks to see if this CPU has any RCU
callbacks whose grace period has ended, and, if so, line 28
calls invoke_rcu_callbacks()
to invoke them.
The rcu_preempt_process_callbacks()
function shown
on lines 31-35 is a wrapper around __rcu_process_callbacks()
that does RCU-core processing for RCU-preempt.
If there is no RCU-preempt in the kernel, for example, for
kernels built with CONFIG_PREEMPT=n
, then
rcu_preempt_process_callbacks()
is an empty function.
The rcu_process_callbacks()
function shown on
lines 37-45 is a wrapper function that invokes
__rcu_process_callbacks()
for each flavor of RCU
configured into the kernel.
Lines 39 and 44 trace the start and end of RCU core
processing,
while lines 40 and 41 do RCU core processing for RCU-sched
and line 42 does RCU core processing for RCU-bh.
Finally, line 43 invokes the rcu_preempt_process_callbacks()
function described above in order to do RCU core processing for
RCU-preempt, but only if it is configured into the kernel.
@@@ rcu_gp_init() and rcu_gp_cleanup().
The main function for the grace-period kthread is
rcu_gp_kthread()
, shown below:
1 static int __noreturn rcu_gp_kthread(void *arg) 2 { 3 int fqs_state; 4 unsigned long j; 5 int ret; 6 struct rcu_state *rsp = arg; 7 struct rcu_node *rnp = rcu_get_root(rsp); 8 9 for (;;) { 10 for (;;) { 11 wait_event_interruptible(rsp->gp_wq, 12 rsp->gp_flags & 13 RCU_GP_FLAG_INIT); 14 if ((rsp->gp_flags & RCU_GP_FLAG_INIT) && 15 rcu_gp_init(rsp)) 16 break; 17 cond_resched(); 18 flush_signals(current); 19 } 20 fqs_state = RCU_SAVE_DYNTICK; 21 j = jiffies_till_first_fqs; 22 if (j > HZ) { 23 j = HZ; 24 jiffies_till_first_fqs = HZ; 25 } 26 for (;;) { 27 rsp->jiffies_force_qs = jiffies + j; 28 ret = wait_event_interruptible_timeout(rsp->gp_wq, 29 (rsp->gp_flags & RCU_GP_FLAG_FQS) || 30 (!ACCESS_ONCE(rnp->qsmask) && 31 !rcu_preempt_blocked_readers_cgp(rnp)), 32 j); 33 if (!ACCESS_ONCE(rnp->qsmask) && 34 !rcu_preempt_blocked_readers_cgp(rnp)) 35 break; 36 if (ret == 0 || (rsp->gp_flags & RCU_GP_FLAG_FQS)) { 37 fqs_state = rcu_gp_fqs(rsp, fqs_state); 38 cond_resched(); 39 } else { 40 cond_resched(); 41 flush_signals(current); 42 } 43 j = jiffies_till_next_fqs; 44 if (j > HZ) { 45 j = HZ; 46 jiffies_till_next_fqs = HZ; 47 } else if (j < 1) { 48 j = 1; 49 jiffies_till_next_fqs = 1; 50 } 51 } 52 rcu_gp_cleanup(rsp); 53 } 54 }
__rcu_pending()
,
rcu_preempt_pending()
, and
rcu_pending()
functions check to see whether there
is any RCU core work needed on the part of the calling CPU:
1 static int __rcu_pending(struct rcu_state *rsp, struct rcu_data *rdp) 2 { 3 struct rcu_node *rnp = rdp->mynode; 4 5 rdp->n_rcu_pending++; 6 check_cpu_stall(rsp, rdp); 7 if (rcu_scheduler_fully_active && 8 rdp->qs_pending && !rdp->passed_quiesce) { 9 rdp->n_rp_qs_pending++; 10 if (!rdp->preemptible && 11 ULONG_CMP_LT(ACCESS_ONCE(rsp->jiffies_force_qs) - 1, 12 jiffies)) 13 set_need_resched(); 14 } else if (rdp->qs_pending && rdp->passed_quiesce) { 15 rdp->n_rp_report_qs++; 16 return 1; 17 } 18 if (cpu_has_callbacks_ready_to_invoke(rdp)) { 19 rdp->n_rp_cb_ready++; 20 return 1; 21 } 22 if (cpu_needs_another_gp(rsp, rdp)) { 23 rdp->n_rp_cpu_needs_gp++; 24 return 1; 25 } 26 if (ACCESS_ONCE(rnp->completed) != rdp->completed) { 27 rdp->n_rp_gp_completed++; 28 return 1; 29 } 30 if (ACCESS_ONCE(rnp->gpnum) != rdp->gpnum) { 31 rdp->n_rp_gp_started++; 32 return 1; 33 } 34 if (rcu_gp_in_progress(rsp) && 35 ULONG_CMP_LT(ACCESS_ONCE(rsp->jiffies_force_qs), jiffies)) { 36 rdp->n_rp_need_fqs++; 37 return 1; 38 } 39 rdp->n_rp_need_nothing++; 40 return 0; 41 } 42 43 static int rcu_preempt_pending(int cpu) 44 { 45 return __rcu_pending(&rcu_preempt_state, 46 &per_cpu(rcu_preempt_data, cpu)); 47 } 48 49 static int rcu_pending(int cpu) 50 { 51 return __rcu_pending(&rcu_sched_state, &per_cpu(rcu_sched_data, cpu)) || 52 __rcu_pending(&rcu_bh_state, &per_cpu(rcu_bh_data, cpu)) || 53 rcu_preempt_pending(cpu); 54 }
The __rcu_pending()
function shown on lines 1-41
determines whether the specified flavor of RCU needs any immediate
work on the part of the CPU corresponding to the specified
rcu_data
structure.
Line 5 counts the calls to __rcu_pending()
for
diagnostic and tracing purposes, while line 6 issues CPU
stall warnings if warranted (which will be discussed in depth in
another article in this series).
Lines 7 and 8 check to see if the current grace
period needs a quiescent state from the current CPU.
If so, line 9 counts this event and lines 10-12 check to see
if this is some RCU flavor other than RCU-preempt for which this CPU
is about to invoke the wrath of force_quiescent_state(), and if so,
line 13 pokes the scheduler in an attempt to make a quiescent state
happen.
Otherwise, if the current grace period does not need a quiescent state
from the current CPU, line 14 checks to see if the current CPU
recently passed through a quiescent state that has not yet been reported
up the rcu_node
tree.
In this case, line 15 counts the event and line 16 tells
the caller that this CPU has core-RCU work to do.
Line 18 checks to see if this CPU has RCU callbacks whose grace period has expired, and if so, line 19 counts the event and line 20 tells the caller that this CPU has core-RCU work to do. Lines 22-25 operate similarly if there is no grace period in progress and this CPU has callbacks queued that need one, lines 26-29 operate similarly if the current CPU is not yet aware that a grace period has ended, lines 30-33 operate simillarly if the current CPU is not yet aware that a grace period has started, and finally lines 34-38 operate similarly if a grace period has extended long enough that quiescent-state forcing is warranted.
Execution reaches line 39 if the RCU core needs nothing from the current CPU. This event is also counted, and line 40 informs the caller.
The rcu_preempt_pending()
function, shown on
lines 43-47, invokes __rcu_pending()
to see if RCU-preempt core processing needs something from the
current CPU.
If RCU-preempt is not configured into the kernel, this function
simply unconditionally returns zero.
After all, that which is not there needs nothing.
Usually, anyway.
Finally, the rcu_pending()
function shown on
lines 49-54
checks all RCU flavors, returning true if any of them require anything
from the current CPU.
Normally, call_rcu()
simply enqueues a callback and
returns.
However, there are some rather nasty code sequences that a user process
can execute that generate very large quantities of callbacks, for
example, close(open("/dev/NULL",O_RDONLY))
in a tight loop.
Such code might well be pointless, but the kernel must nevertheless handle
it gracefully.
Therefore, __call_rcu()
checks for this sort of condition
and undertakes RCU code work as needed to avert disaster.
1 static void 2 __call_rcu(struct rcu_head *head, void (*func)(struct rcu_head *rcu), 3 struct rcu_state *rsp) 4 { 5 unsigned long flags; 6 struct rcu_data *rdp; 7 8 debug_rcu_head_queue(head); 9 head->func = func; 10 head->next = NULL; 11 smp_mb(); 12 local_irq_save(flags); 13 rdp = this_cpu_ptr(rsp->rda); 14 *rdp->nxttail[RCU_NEXT_TAIL] = head; 15 rdp->nxttail[RCU_NEXT_TAIL] = &head->next; 16 rdp->qlen++; 17 if (__is_kfree_rcu_offset((unsigned long)func)) 18 trace_rcu_kfree_callback(rsp->name, head, (unsigned long)func, 19 rdp->qlen); 20 else 21 trace_rcu_callback(rsp->name, head, rdp->qlen); 22 if (irqs_disabled_flags(flags)) { 23 local_irq_restore(flags); 24 return; 25 } 26 if (unlikely(rdp->qlen > rdp->qlen_last_fqs_check + qhimark)) { 27 rcu_process_gp_end(rsp, rdp); 28 check_for_new_grace_period(rsp, rdp); 29 if (!rcu_gp_in_progress(rsp)) { 30 unsigned long nestflag; 31 struct rcu_node *rnp_root = rcu_get_root(rsp); 32 33 raw_spin_lock_irqsave(&rnp_root->lock, nestflag); 34 rcu_start_gp(rsp, nestflag); 35 } else { 36 rdp->blimit = LONG_MAX; 37 if (rsp->n_force_qs == rdp->n_force_qs_snap && 38 *rdp->nxttail[RCU_DONE_TAIL] != head) 39 force_quiescent_state(rsp, 0); 40 rdp->n_force_qs_snap = rsp->n_force_qs; 41 rdp->qlen_last_fqs_check = rdp->qlen; 42 } 43 } else if (ULONG_CMP_LT(ACCESS_ONCE(rsp->jiffies_force_qs), jiffies)) 44 force_quiescent_state(rsp, 1); 45 local_irq_restore(flags); }
Lines 8-21 deal with callback handling and will therefore
be addressed in another article in this series.
Line 22 checks to see if interrupts were disabled upon entry
to __call_rcu()
, in which case it might not be safe
to invoke the RCU core due to potential deadlock situations.
In this case, line 23 restores interrupts and line 24
returns.
Line 26 checks to see too many RCU callbacks
(“too many” defaults to 10,000) have been
enqueued since the last time __call_rcu()
undertook RCU-core
processing for the current CPU and RCU flavor, and if so,
lines 27-42 do the core processing.
Line 27 checks for the end of an old grace period, and
line 28 checks for the beginning of a new grace period, but
if line 29 finds that there is no grace period in progress
despite there being tens of thousands of callbacks being queued
on this CPU, then lines 30-34 start a new grace period.
Otherwise, an RCU grace period is in progress, so lines 36-42
attempt to accelerate it.
Line 36 increases this CPU's rcu_data
structure's
->blimit
in order to avoid throttline callback
invocation.
If line 37 sees that quiescent states have not been forced
recently and if there are RCU callbacks enqueued on this CPU
that need another grace period (other than the callback we just
now enqueued), then line 39 forces quiescent states vigorously.
Lines 40-41 retrigger the check for forcing yet more
quiescent states, just in case 10,000 additional RCU callbacks
are posted soon after we return.
If this rcu_data
structure's RCU callback queue
is not excessively long, then line 43 checks to see if it
is time to force quiescent states, and, if so, line 44
does the required forcing.
In either case, line 45 re-enables interrupts in preparation for returning to the caller.
Preemptible RCU must report a quiescent state when a task that
blocked in an RCU-preempt read-side critical section completes that
critical section.
The rcu_preempt_blocked_readers_cgp()
,
rcu_report_unblock_qs_rnp()
, and
rcu_preempt_check_blocked_tasks()
functions handle this task.
1 static int rcu_preempt_blocked_readers_cgp(struct rcu_node *rnp) 2 { 3 return rnp->gp_tasks != NULL; 4 } 5 6 static void rcu_preempt_check_blocked_tasks(struct rcu_node *rnp) 7 { 8 WARN_ON_ONCE(rcu_preempt_blocked_readers_cgp(rnp)); 9 if (!list_empty(&rnp->blkd_tasks)) 10 rnp->gp_tasks = rnp->blkd_tasks.next; 11 WARN_ON_ONCE(rnp->qsmask); 12 } 13 14 static void rcu_report_unblock_qs_rnp(struct rcu_node *rnp, unsigned long flags) 15 __releases(rnp->lock) 16 { 17 unsigned long mask; 18 struct rcu_node *rnp_p; 19 20 if (rnp->qsmask != 0 || rcu_preempt_blocked_readers_cgp(rnp)) { 21 raw_spin_unlock_irqrestore(&rnp->lock, flags); 22 return; 23 } 24 rnp_p = rnp->parent; 25 if (rnp_p == NULL) { 26 rcu_report_qs_rsp(&rcu_preempt_state, flags); 27 return; 28 } 29 mask = rnp->grpmask; 30 raw_spin_unlock(&rnp->lock); 31 raw_spin_lock(&rnp_p->lock); 32 rcu_report_qs_rnp(mask, &rcu_preempt_state, rnp_p, flags); 33 }
The rcu_preempt_blocked_readers_cgp()
function
shown on lines 1-4 checks to see if there are any RCU
readers holding up the current grace period that have blocked at
least once during their current RCU read-side critical section,
and that have therefore been queued on the specified rcu_node
structure.
The actual check on line 3 is far simpler than the description:
If the specified rcu_node
structure's
->gp_tasks
field is non-NULL
, then there
is at least one such reader.
The rcu_preempt_check_blocked_tasks()
function
shown on lines 6-12 handles special RCU-preempt processing
at grace-period start.
Line 8 complains if there are readers listed as blocking the new grace
period despite the fact that this grace period has not really started
yet.
Line 9 checks to see if there are any tasks that have blocked
at least once within their current RCU read-side critical sections,
and if so, line 10 marks them all as blocking the new grace
period.
Finally, line 11 complains if anything else is marked as blocking
the new grace period despite initialization having not really started yet.
Quick Quiz 22:
How can you be sure that all the tasks that blocked at least
once in their current RCU read-side critical sections need to block
the new grace period?
Answer
The rcu_report_unblock_qs_rnp()
function
shown on lines 14-33 reports a quiescent state up the
rcu_node
hierarchy when the last task that blocked in
its current RCU read-side critical section exits that critical
section.
Line 20 checks to see if any CPUs or tasks are still blocking
the current grace period, and if so, line 21 restores interrupts
and line 22 returns.
Line 24 advances up the rcu_node
hierarchy, but
if the current rcu_node
structure is the root,
then line 26 invokes rcu_report_qs_rsp()
to end
the current grace period and line 27 returns.
Otherwise, there is a parent node.
In this case, line 29 obtains the mask containing the bit
corresponding to the current rcu_node
structure
in its parent's ->qsmask
field.
Line 30 releases the current rcu_node
structure's
->lock
and line 31 acquires the parent's
->lock
.
Finally, line 32 invokes rcu_report_qs_rnp()
to propagate the quiescent state up the rcu_node
hierarchy.
Quick Quiz 23:
How can we be sure that the old grace period won't end
before we can report the quiescent state?
In other words, why don't we need to record the grace-period
number and check it, as is done in rcu_report_qs_rdp()
?
Answer
This work represents the view of the author and does not necessarily represent the view of IBM.
Linux is a registered trademark of Linus Torvalds.
Other company, product, and service names may be trademarks or service marks of others.
Quick Quiz 1: Is there a type of quiescent state other than CPU hotplug events that is not handled on a per-CPU basis?
Answer:
Yes. RCU-preempt's outermost rcu_read_unlock()
signals entry into a quiescent state, but is handled on a per-task
basis rather than on a per-CPU basis.
This is for performance reasons: rcu_read_unlock()
can
be invoked extremely frequently, and processing it on a per-CPU
basis would require disabling preemption.
By handling this type of quiescent state on a per-task basis, we
avoid the overhead of disabling and re-enabling preemption.
(We are looking to go further and to also avoid the function-call
overhead currently incurred by RCU-preempt's rcu_read_lock()
and rcu_read_unlock()
implementations, but that is
a job for another day.)
Quick Quiz 2:
But doesn't this mean that call_rcu()
is not
wait free?
Answer:
This safety feature does indeed cost call_rcu()
any
pretense of unconditional wait-freedom, but system survivability trumps
academic purity any day of the week.
I am after all a developer, not a researcher!
Quick Quiz 3:
How does a given non-root rcu_node
structure
know which of its parent's ->qsmask
bits to clear?
Answer
Answer:
Each rcu_node
structure has a
->grpmask
mask with a single bit set that
corresponds to this rcu_node
structure's
bit in its parent's ->qsmask
field.
Quick Quiz 4:
What happens with the blkd_tasks
list and
the gp_tasks
pointer for RCU-bh and RCU-sched?
Answer:
Nothing happens with them.
In the rcu_node
structures handling RCU-bh and
RCU-sched, the ->blkd_tasks
lists remain empty and the
->gp_tasks
pointers remain NULL
.
Quick Quiz 5:
Why bother with the ->passed_quiesce_gpnum
field?
Given that the grace period cannot end until each CPU passes through
a quiescent state, the grace-period number cannot change, so what
is the point of tracking it?
Answer:
Unfortunately, the grace-period number can in fact change
between the time that a CPU passes through a quiescent state and
the time that it gets around to announcing this to RCU.
The reason for this is that dyntick-idle and CPU-offline events can
cause other CPUs to announce on behalf of this CPU.
If the other CPU announces before this CPU gets around to it, the
grace-period number might well have changed.
Therefore, the ->passed_quiesce_gpnum
field must be
meticulously checked in order to avoid erroneously announcing a
quiescent state from some past grace period.
Quick Quiz 6: What is the exceptional case where a CPU can assume that its callbacks need only wait until the end of the next grace period, despite that CPU being aware that the prior grace period has ended?
Answer: The exception is the CPU that starts the next grace period.
Quick Quiz 7: Why not just immediately announce the quiescent state to the RCU core? Wouldn't that be far simpler and faster?
Answer:
It might be simpler, but it would be very unlikely to be faster.
In most workloads, there are far more quiescent states than grace periods,
so it makes sense to optimize the performance of quiescent states.
Announcing a quiescent state to the RCU core requires acquiring the
corresponding rcu_node
structure's ->lock
,
which is not acceptable on the context-switch fastpath.
We therefore have the quiescent states interact with the
rcu_data
structure, and announce to the RCU core
only once per grace period per CPU.
Quick Quiz 8: Why are the positions of the “.” and the “?” in the diagram are reversed? After all, CPU 1 has announced a quiescent state to the RCU core and CPU 0 has not yet done so.
Answer: The “qsm.?” represents a bit mask, so that CPU 0 corresponds to the low-order bit. Yes, this can be confusing, but it is several thousand years too late to advocate for little-endian representation of numerical values. Besides which, little-endian representation would probably simply change the nature of the confusion, not eliminate it. In this case, the confusion would simply move from within the diagram to between the diagram and the code. Having the confusion within the diagram makes it more obvious, thus reducing the chance that people will inject bugs into RCU due to their failure to recognize that they are confused. You might as well face the fact that life is inherently confusing. The sooner you reconcile yourself to that fact, the better off you will be.
Quick Quiz 9: Why not make CPU 0 announce its own quiescent states? Wouldn't that simplify things by eliminating a class of race conditions?
Answer:
This would indeed eliminate a class of race conditions, but
it would unfortunately also sharply limit power savings.
RCU therefore accepts the race conditions (which are mediated
straightforwardly by the ->lock
field in the
rcu_node
structure) in order to allow dyntick-idle
CPUs to remain in deeper sleep states for longer periods of time.
Quick Quiz 10:
What is the purpose of the ACCESS_ONCE()
calls?
Answer:
ACCESS_ONCE()
simply returns its argument,
but uses volatile casts in order to
prevent the compiler from refetching its argument (as it might in cases
of register pressure) or from combining successive accesses to the
same variable.
This is unnecessary if rcu_gp_in_progress()
is invoked
with the root rcu_node
structure's lock held, but is
required otherwise.
However, the performance impact is too small to justify multiple versions of
the rcu_gp_in_progress()
, so we instead have a single
version that conservatively
uses ACCESS_ONCE()
even when called from code paths where
this is unnecessary.
Quick Quiz 11:
In __rcu_process_gp_end()
in line 20, why bother comparing rdp->gpnum
to rdp->completed
?
Why not just unconditionally set rdp->gpnum
to a sane value, for example, rnp->gpnum
?
Or even rdp->completed
?
Answer:
Unconditionally setting rdp->gpnum
to
rnp->gpnum
could cause the CPU to fail to initialize
for the new grace period, which could result in the grace period
failing to ever complete.
Unconditionally setting rdp->gpnum
to
rdp->completed
suffers from a more subtle failure
mode.
It turns out that there are sequences of events that can result in
a given CPU becoming aware of a new grace period before realizing
that the old one ended.
Unconditionally updating rdp->gpnum
could therefore
cause the CPU to forget that it had already noted the new grace period.
Quick Quiz 12: But nothing can yet apply quiescent states to this new grace period, so does it really make sense to say that it has started?
Answer: You might well argue that grace periods start at any number of places, including the following, in rough order of increasing time:
rcu_start_gp()
,
as these lines made the final decision to start a new grace
period.
rcu_start_gp()
, which incremented
the rcu_state
structure's ->gpnum
field.
rcu_start_gp()
, which
communicate the new grace period down to the
rcu_node
structures.
__note_new_gpnum()
, which sets up a given
CPU's rcu_data
structure to detect quiescent
states for the new grace period.
So, which is it? The only reasonable answer is “if you have to ask, you are giving up all hope of constructing a production-quality RCU implementation.” In short, it is best to construct RCU so that it simply doesn't care. More generally, avoiding caring too much about exactly when things start and stop is a good parallel-programming design principle.
Quick Quiz 13:
Why not change the locking hierarchy so that we could just
hold the root rcu_node
structure's ->lock
while acquiring ->onofflock
?
That would get rid of the single-node special case, simplifying the code.
Answer:
The problem is that the CPU-hotplug code path requires us
to acquire the root rcu_node
structure's
->lock
while holding that of a leaf
rcu_node
structure.
We therefore absolutely must drop the root rcu_node
structure's lock before acquiring that of a leaf rcu_node
structure.
One way to combine those two code paths while still avoiding deadlock
would be to omit the single-node optimization entirely.
Now that leaf rcu_node
structures handle at most 16
CPUs (rather than 32 or 64), this approach might make some sense.
However, the conditional is set up so that gcc should be able to
sort things out at compile time, so there is no runtime penalty
for the check.
Quick Quiz 14: That is just plain silly! Given that CPUs are coming online and going offline anyway, what possible sense does it make for RCU to bury its head in the sand and ignore these CPU-hotplug events?
Answer:
It is not a matter of ignoring the events, but rather of
keeping RCU's state space down to a dull roar.
The CPU hotplug events will update state as soon as we release
the the rcu_state
structure's ->onofflock
at the end of rcu_start_gp()
,
and will apply their changes to the rcu_node
tree at
that time.
In the meantime, holding off those changes allows a much simpler
implementation of rcu_start_gp()
.
Quick Quiz 15:
Why not abstract rcu_node
initialization?
Answer: Because I just now noticed that it might be a good idea to do so. But are there other options?
Quick Quiz 16:
But this is redundant with the check in
rcu_report_qs_rdp()
!
Why not remove one of these checks?
Answer:
Unfortunately, rcu_report_qs_rdp()
needs to
keep its check in order to properly update the rcu_data
structure,
and rcu_report_qs_rnp()
needs to keep its check due to
its being called from functions other than rcu_report_qs_rdp()
.
Quick Quiz 17:
But we released the previous rcu_node
structure's
->lock
, so why couldn't a new grace period have started?
This might well result in some of the child rcu_node
structure's ->qsmask
bits being set, wouldn't it?
Answer:
This cannot happen because the next grace period cannot
start until after the current grace period ends, and the current
grace period cannot end until all the quiescent states are reported
up the rcu_node
tree.
One such quiescent state is currently being reported by the current
CPU, so until this CPU finishes, there can be no new grace period
and thus no bits set in the child rcu_node
structure's
->qsmask
field.
Quick Quiz 18:
Why open-code cpu_needs_another_gp()
on
line 13 of rcu_report_qs_rsp()
?
Answer:
Because cpu_needs_another_gp()
fails if there
is already a grace period in progress, which there currently is.
The obvious way of avoiding this problem would be to move the
assignment to rsp->completed
on line 23 up to
precede line 13, but this would allow some other CPU to be
starting a new grace period while the current CPU is marking the
old grace period as being completed, which is at best unclean.
So the explicit check on line 13 really is necessary.
Quick Quiz 19:
Why does line 5 of rcu_check_quiescent_state()
just return?
Isn't that giving up a chance to report a quiescent state?
Answer:
Because the CPU just now learned of the grace period,
there is no way that it can have already passed through a quiescent
state for this new grace period.
To see this, take a look back at the implementation of
check_for_new_grace_period()
and then see what would happen if the remainder of
rcu_check_quiescent_state()
were executed in that state.
One exception to this is RCU-preempt, which could in principle check to see if the current CPU was in an RCU read-side critical section. This is a potential future optimization, but a low-priority one.
Quick Quiz 20:
Why not check rdp->passed_quiesce_gpnum
right in rcu_check_quiescent_state()
rather
than incurring the extra function-call overhead (and added
argument) passing it in to rcu_report_qs_rdp()
?
Answer:
Because we must hold this CPU's leaf rcu_node
structure's ->lock
in order to safely
carry out the needed comparison.
Quick Quiz 21:
But __rcu_process_callbacks()
fails to
release the root rcu_node
structure's ->lock
!
Won't that result in deadlock?
Answer:
No deadlocks will result because rcu_start_gp()
releases that lock.
Quick Quiz 22: How can you be sure that all the tasks that blocked at least once in their current RCU read-side critical sections need to block the new grace period?
Answer: By definition, all RCU read-side critical sections that start before a given grace period must complete before that grace period can be allowed to complete. Therefore, all such critical sections must block the new grace period.
Quick Quiz 23:
How can we be sure that the old grace period won't end
before we can report the quiescent state?
In other words, why don't we need to record the grace-period
number and check it, as is done in rcu_report_qs_rdp()
?
Answer:
Unlike for the CPU's quiescent states handled by
rcu_report_qs_rdp()
, there is no possibility of some
other CPU or task reporting a quiescent state on behalf of the current
task.
Instead, the task must remove itself from the ->blkd_tasks
list and report its own quiescent state.
Because this code path has verified that this task was the last thing
holding up the current grace period, there is no possibility of a new
grace period starting before this task completes reporting its
quiescent state.