August 1, 2012 (v3.5+)
This article was contributed by Paul E. McKenney
And then there are of course the inescapable answers to the quick quizzes.
If a grace period extends too long, RCU attempts to force the detection of quiescent states. This activity is repeated periodically until the grace period completes. Each instance of quiescent-state forcing looks for the following situations:
Quick Quiz 1:
But if the CPU is offline, why should RCU be even expecting
a quiescent state from it?
Answer
In the past, quiescent-state forcing also included sending a resched IPI to CPUs that were slow about arriving at quiescent states, but it turns out that this is usually ineffective.
The procedure used when quiescent-state forcing can best be shown using the following diagram:
This procedure scans the leaf rcu_node
structures,
checking the ->qsmask
bits to find CPUs that have
not yet reported quiescent states.
Each such CPUs is checked to see if it is offline (but only if the
grace period is at least one jiffy old), and the corresponding
rcu_dynticks
structure is checked to see if the CPU
is in dyntick-idle mode.
If not, and if this is the first attempt to force quiescent states for
this grace period, the rcu_dynticks
structure's
->dynticks
field is copied to the rcu_data
structure's ->dynticks_snap
field.
Later attempts to force quiescent states can then compare
this ->dynticks_snap
against
the ->dynticks
field to see if the CPU passed through
dyntick-idle mode in the meantime.
In addition, if the grace period has extended too long (500 milliseconds
by default), quiescent-state forcing will initiate RCU priority
boosting on any tasks queued on any of the leaf or root rcu_node
structures.
There only needs to be one instance of quiescent-state forcing running
for a given RCU flavor at any given time.
The forcing of quiescent states is normally time-based, so that the
first CPU that notices that the current grace period has been running
long attempts to initiate forcing.
This can be accomplished using spin_trylock()
, but on
large systems this approach can result in excessive memory contention,
which can significantly slow down the entire system.
Quiescent-state forcing is therefore initiated using a modified
tournament-locking scheme, making use of a dedicated
per-rcu_node
->fqslock
field, organized
as shown below:
A given CPU first checks to see if quiescent-state forcing is already
ongoing, and if not, uses spin_trylock()
to attempt to acquire
that CPU's leaf rcu_node
structure's ->fqslock
.
If quiescent-state forcing was already ongoing or if it failed to
acquire the lock, the CPU goes on its way because quiescent-state
forcing is already running or soon will be.
On the other hand, if the CPU succeeds in acquiring the lock,
it moves up to the next level in the rcu_node
tree
and repeats this process.
After successfully acquiring the lock at any non-leaf level of the
tree, the CPU releases the lock acquired at the previous level.
This approach allows quiescent-state forcing to be reliably initiated while bounding the degree of memory contention.
With this background, we are ready to go through the code.
The dyntick_save_progress_counter()
and
rcu_implicit_dynticks_qs()
functions are invoked
on the first and subsequent invocations (respectively) of quiescent-state
forcing during a given grace period.
They both are invoked on each CPU that has not yet passed through
a quiescent state during the current grace period.
1 static int dyntick_save_progress_counter(struct rcu_data *rdp) 2 { 3 rdp->dynticks_snap = atomic_add_return(0, &rdp->dynticks->dynticks); 4 return (rdp->dynticks_snap & 0x1) == 0; 5 } 6 7 static int rcu_implicit_dynticks_qs(struct rcu_data *rdp) 8 { 9 unsigned int curr; 10 unsigned int snap; 11 12 curr = (unsigned int)atomic_add_return(0, &rdp->dynticks->dynticks); 13 snap = (unsigned int)rdp->dynticks_snap; 14 15 if ((curr & 0x1) == 0 || UINT_CMP_GE(curr, snap + 2)) { 16 trace_rcu_fqs(rdp->rsp->name, rdp->gpnum, rdp->cpu, "dti"); 17 rdp->dynticks_fqs++; 18 return 1; 19 } 20 if (ULONG_CMP_GE(rdp->rsp->gp_start + 2, jiffies)) 21 return 0; 22 barrier(); 23 if (cpu_is_offline(rdp->cpu)) { 24 trace_rcu_fqs(rdp->rsp->name, rdp->gpnum, rdp->cpu, "ofl"); 25 rdp->offline_fqs++; 26 return 1; 27 } 28 return 0; 29 }
The dyntick_save_progress_counter()
function spans
lines 1-5, and is invoked on the first attempt to force
quiescent states.
Line 3 takes a snapshot of the CPU's ->dynticks
counter,
and line 4 returns true
if the CPU is in
dyntick-idle mode, in other words, if the counter has an even value.
The CPU is specified by the rcu_data
structure passed in.
The rcu_implicit_dynticks_qs()
function spans
lines 7-29, and is invoked on each leaf rcu_node
structure on the second and subsequent attempts to force quiescent
states.
Again, the CPU is specified by the rcu_data
structure
passed in.
Line 12 picks up another snapshot of the dynticks-idle counter,
and line 13 picks up the snapshot from the first attempt to
force quiescent states during this grace period.
Line 13 checks to see if this CPU is currently in dyntick-idle
mode (curr
has an even value) or if this CPU has passed
through dyntick-idle mode at least once since the first attempt to
force quiescent states (curr
exceeds snap
by at least two.
If so, lines 16-18 do tracing, increment statistics, and
return indicating that this CPU has passed through a quiescent state
during the current grace period.
Otherwise, line 20 checks to see if this grace period is old enough for it to be safe to check for offline CPUs, and, if so line 21 returns indicating that this CPU has not yet passed through a quiescent state for this grace period. If not, execution continues with line 22 ensuring that the compiler does not interchange the checks. Line 23 checks to see if this CPU is offline, and if so, lines 24-26 do tracing, accumulate statistics, and return indicating that this CPU has passed through a quiescent state during the current grace period. Otherwise, line 28 returns indicating that this CPU has not yet passed through a quiescent state for this grace period.
Quick Quiz 2:
But if a CPU is offline, it cannot possibly be executing
RCU read-side critical sections.
So why does the grace period have to be a given age for offline
to count as a quiescent state?
Answer
The force_qs_rnp()
function shown below sequences through
the CPUs that have not yet reported a quiescent state for the current
grace period, invoking either of the
dyntick_save_progress_counter()
or
rcu_implicit_dynticks_qs()
functions for each such CPU.
1 static void force_qs_rnp(struct rcu_state *rsp, int (*f)(struct rcu_data *)) 2 { 3 unsigned long bit; 4 int cpu; 5 unsigned long flags; 6 unsigned long mask; 7 struct rcu_node *rnp; 8 9 rcu_for_each_leaf_node(rsp, rnp) { 10 cond_resched(); 11 mask = 0; 12 raw_spin_lock_irqsave(&rnp->lock, flags); 13 if (!rcu_gp_in_progress(rsp)) { 14 raw_spin_unlock_irqrestore(&rnp->lock, flags); 15 return; 16 } 17 if (rnp->qsmask == 0) { 18 rcu_initiate_boost(rnp, flags); 19 continue; 20 } 21 cpu = rnp->grplo; 22 bit = 1; 23 for (; cpu <= rnp->grphi; cpu++, bit <<= 1) { 24 if ((rnp->qsmask & bit) != 0 && 25 f(per_cpu_ptr(rsp->rda, cpu))) 26 mask |= bit; 27 } 28 if (mask != 0) { 29 rcu_report_qs_rnp(mask, rsp, rnp, flags); 30 continue; 31 } 32 raw_spin_unlock_irqrestore(&rnp->lock, flags); 33 } 34 rnp = rcu_get_root(rsp); 35 if (rnp->qsmask == 0) { 36 raw_spin_lock_irqsave(&rnp->lock, flags); 37 rcu_initiate_boost(rnp, flags); 38 } 39 }
Each pass through the loop spanning lines 9-33 handles one
leaf rcu_node
structure.
Line 10 invokes the scheduler to maintain low latencies on
CONFIG_PREEMPT=n
kernels and
line 11 sets the mask of CPUs for which to report quiescent
states to all zero.
Line 12 acquires the current leaf rcu_node
structure's
->lock
, which is released by either line 14
or 32.
If line 13 determines that there is no grace period in progress,
line 14 releases ->lock
and line 15 returns.
Line 17 checks to see if there are any CPUs for this
rcu_node
structure that have not yet reported a
quiescent state for this grace period, and if not, line 17
invokes rcu_initiate_boost()
to do any needed RCU
priority boosting and line 19 restarts the loop, advancing
to the next leaf rcu_node
structure.
Quick Quiz 3:
If there is no grace period in progress, why on earth are
we trying to force quiescent states in the first place?
Answer
Quick Quiz 4:
Wait a minute!
Line 19 of force_qs_rnp()
continues the loop without releasing ->lock
!
Won't that result in “scheduling while atomic” warnings
from the cond_resched()
on line 10?
Answer
Lines 21 and 22 initialize loop control variables to
reference the first CPU covered by this rcu_node
structure
in preparation for the loop spanning lines 23-27, each pass through
which checks one CPU for quiescent states.
Line 24 checks to see if this CPU has already reported a quiescent
state for the current grace period, and if not, line 25
invokes the specified function, which will be
dyntick_save_progress_counter()
on the first attempt
to force quiescent states for a given grace period and
rcu_implicit_dynticks_qs()
for subsequent attempts.
If the function indicates that the CPU has now passed through a
quiescent state, line 26 ORs the current CPU's bit into the mask.
Line 28 checks to see if any of the holdout CPUs for this
rcu_node
structure passed through a quiescent state,
and if so, line 29 reports these quiescent states
and line 30 restarts the loop beginning at line 9 with the
next rcu_node
structure.
Finally, line 32 releases the current rcu_node
structure's
lock in preparation for the next pass through the outer loop.
Upon exit from the outer loop, line 34 picks up a pointer
to the root rcu_node
structure.
Line 35 checks to see if there is anything other than
preempted tasks preventing the current grace period from
completing, and if not, line 36 acquires the root
rcu_node
structure's ->lock
and
line 37 invokes rcu_initiate_boost()
to
carry out any required RCU priority boosting.
Quick Quiz 5:
Why wait for all the CPUs to pass through quiescent states
before boosting tasks preempted within RCU read-side critical
sections?
Answer
Quick Quiz 6:
And how do we know that the old grace period didn't end, with
a new one taking its place?
Wouldn't that cause the call to rcu_initiate_boost()
on line 37 of force_qs_rnp()
to prematurely boost tasks for the new grace period?
Answer
The rcu_gp_fqs()
function shown below invokes
force_qs_rnp()
, passing in either
dyntick_save_progress_counter()
or
rcu_implicit_dynticks_qs()
, depending on whether this
is the first or subsequent attempt, respectively, to force quiescent
states during the current grace period.
1 int rcu_gp_fqs(struct rcu_state *rsp, int fqs_state_in) 2 { 3 unsigned long flags; 4 int fqs_state = fqs_state_in; 5 struct rcu_node *rnp = rcu_get_root(rsp); 6 7 rsp->n_force_qs++; 8 if (fqs_state == RCU_SAVE_DYNTICK) { 9 force_qs_rnp(rsp, dyntick_save_progress_counter); 10 fqs_state = RCU_FORCE_QS; 11 } else { 12 force_qs_rnp(rsp, rcu_implicit_dynticks_qs); 13 } 14 if (ACCESS_ONCE(rsp->gp_flags) & RCU_GP_FLAG_FQS) { 15 raw_spin_lock_irqsave(&rnp->lock, flags); 16 rsp->gp_flags &= ~RCU_GP_FLAG_FQS; 17 raw_spin_unlock_irqrestore(&rnp->lock, flags); 18 } 19 return fqs_state; 20 }
Line 7 updates statistics.
Line 8 checks the state to see if this is the first attempt to
force quiescent states during the current grace period, and if so,
line 9 invokes force_qs_rnp()
passing in
dyntick_save_progress_counter()
and line 10
updates the state to indicate that the first attempt to
force quiescent states for this grace period has completed.
Otherwise, line 12 invokes force_qs_rnp()
passing in
rcu_implicit_dynticks_qs
.
Line 14 checks to see if this round of quiescent-state forcing
was explicitly requested (as opposed to implicitly “requested”
via the passage of time), and if so,
line 15 acquires the root rcu_node
structure's
->lock
, line 16 clears the request bit,
and line 17 releases the lock.
Either way, line 19 returns the updated state value.
Finally, the force_quiescent_state()
function
requests that quiescent-state forcing begin:
1 static void force_quiescent_state(struct rcu_state *rsp) 2 { 3 unsigned long flags; 4 bool ret; 5 struct rcu_node *rnp; 6 struct rcu_node *rnp_old = NULL; 7 8 rnp = per_cpu_ptr(rsp->rda, raw_smp_processor_id())->mynode; 9 for (; rnp != NULL; rnp = rnp->parent) { 10 ret = (ACCESS_ONCE(rsp->gp_flags) & RCU_GP_FLAG_FQS) || 11 !raw_spin_trylock(&rnp->fqslock); 12 if (rnp_old != NULL) 13 raw_spin_unlock(&rnp_old->fqslock); 14 if (ret) { 15 rsp->n_force_qs_lh++; 16 return; 17 } 18 rnp_old = rnp; 19 } 20 raw_spin_lock_irqsave(&rnp_old->lock, flags); 21 raw_spin_unlock(&rnp_old->fqslock); 22 if (ACCESS_ONCE(rsp->gp_flags) & RCU_GP_FLAG_FQS) { 23 rsp->n_force_qs_lh++; 24 raw_spin_unlock_irqrestore(&rnp_old->lock, flags); 25 return; 26 } 27 rsp->gp_flags |= RCU_GP_FLAG_FQS; 28 raw_spin_unlock_irqrestore(&rnp_old->lock, flags); 29 wake_up(&rsp->gp_wq); 30 }
Line 8 obtains a pointer to the current CPU's rcu_node
structure,
and each pass through the loop spanning lines 9-19 contends for
the modified tournament lock at each level of the rcu_node
tree.
Line 10 checks to see if a request for forcing of quiescent states
has already been posted, and line 11 attempts to acquire the
current rcu_node
structure's ->fqslock
,
recording the result in local variable ret
, which
will be set if either the request was already posted or the attempt
to acquire the lock failed, in other words, ret
will
be set if there is no reason to progress to the next level of
the rcu_node
tree.
If line 12 determines that we are no longer on the leaf level,
line 13 releases the previous rcu_node
structure's
->fqslock
.
If line 14 sees that there is no reason to progress to the
next level of the tree, line 15 updates statistics and line 16
returns to the caller.
If execution progresses to line 20, rnp_old
references the root rcu_node
structure.
Line 20 acquires the root rcu_node
structure's
->lock
and line 21 releases its
->fqslock
.
If line 22 determines that there is still no quiescent-state
forcing request posted, line 23 updates statistics,
line 24 releases the ->lock
, and line 25
returns to the caller.
Otherwise, line 27 sets the request flag, line 28
releases the root rcu_node
structure's ->lock
,
and line 29 wakes up rcu_gp_kthread()
.
Quick Quiz 7:
But what invokes rcu_gp_fqs
?
Answer
This article has described RCU's quiescent-state forcing, examining how RCU deals with quiescent states that extend for too long.
Quick Quiz 8:
But all this “forcing” doesn't really force anything!
Doesn't this do nothing but detect pre-existing quiescent states?
Answer
This work represents the view of the author and does not necessarily represent the view of IBM.
Linux is a registered trademark of Linus Torvalds.
Other company, product, and service names may be trademarks or service marks of others.
Quick Quiz 1: But if the CPU is offline, why should RCU be even expecting a quiescent state from it?
Answer:
Normally, the settings of the ->qsmaskinit
bits would indeed prevent RCU from expecting a quiescent state from
an offline CPU.
However, CPU-hotplug operations are not atomic, so it is possible
for a grace period that starts concurrently with a CPU-hotplug operation
to be expecting a quiescent state from an offline CPU.
Hence the forcing of quiescent states must check for offline CPUs.
Quick Quiz 2: But if a CPU is offline, it cannot possibly be executing RCU read-side critical sections. So why does the grace period have to be a given age for offline to count as a quiescent state?
Answer: The way CPU hotplug works, an offline CPU takes one more pass through the scheduler on its way to the idle loop on its way out. The scheduler uses RCU read-side critical sections, so RCU does have to pay attention to CPUs for a short time after they mark themselves offline. Similarly, CPUs can execute RCU read-side critical sections shortly before they mark themselves online. Both cases are handled by requiring that the grace period be at least one jiffy old before interpreting offline as a quiescent state.
Yes, this is a kludge. Yes, it is in the process of being fixed.
Quick Quiz 3: If there is no grace period in progress, why on earth are we trying to force quiescent states in the first place?
Answer: There presumably was a grace period in progress when we started forcing quiescent states. However, it is quite possible that the last CPU subsequently independently reported a quiescent state, which could cause the grace period to end before we finished forcing quiescent states.
Quick Quiz 4:
Wait a minute!
Line 19 of force_qs_rnp()
continues the loop without releasing ->lock
!
Won't that result in “scheduling while atomic” warnings
from the cond_resched()
on line 10?
Answer:
Nope!
The rcu_initiate_boost()
function releases the lock and
restores interrupt state.
Quick Quiz 5: Why wait for all the CPUs to pass through quiescent states before boosting tasks preempted within RCU read-side critical sections?
Answer:
Boosting is expensive, so we delay it until those tasks
are the only thing on this rcu_node
structure blocking
the current grace period.
The hope is that the delay will enable at least some of them to find
their way out of their respective RCU read-side critical sections unassisted.
Quick Quiz 6:
And how do we know that the old grace period didn't end, with
a new one taking its place?
Wouldn't that cause the call to rcu_initiate_boost()
on line 37 of force_qs_rnp()
to prematurely boost tasks for the new grace period?
Answer: Nope! The quiescent state forcing is executing in the context of the same kthread that will eventually initiate the next grace period. Therefore, although the old grace period can end asynchronously, a new one cannot possibly start it until we finish forcing quiescent states for the old one.
Quick Quiz 7:
But what invokes rcu_gp_fqs
?
Answer:
It is invoked by rcu_gp_kthread()
, which
is discussed elsewhere.
Quick Quiz 8: But all this “forcing” doesn't really force anything! Doesn't this do nothing but detect pre-existing quiescent states?
Answer:
This is true for dyntick-idle and offline-CPU detection.
However, for kernels built with CONFIG_RCU_BOOST=y
,
the calls to rcu_initiate_boost()
really do force
quiescent states by priority-boosting tasks preempted within
RCU read-side critical sections.
All that aside, the name is historical from back when
force_quiescent_state()
attempted to use resched IPIs
to force context switches on reluctant CPUs.