September 26, 2011
This article was contributed by Paul E. McKenney
And what kind of RCU documentation would this be without the answers to the quick quizzes?
The normal RCU grace periods prioritize low overhead over latency. In fact, the longer the latency, the greater the number of RCU updates that will be served by a single RCU grace period, so that the overhead of detecting this single RCU grace period may be amortized over a larger number of updates.
In contrast, for an expedited RCU grace period, speed is of the essence. The expedited RCU implementation therefore takes extreme measures, for example, sending IPIs to all CPUs, in order to reduce grace-period latency. These measures are described in more detail in the next section.
The basic operation of the RCU-sched and RCU-bh expedited grace periods is straightforward. The idea is to use the stop-CPU subsystem, which uses high-priority kthreads to interrupt processing on each online CPU. This forces every CPU to undergo a context switch, which results in a grace period for both RCU-sched and RCU-bh.
The actual implementation uses additional optimizations, for example, to allow multiple concurrent requests to be satisfied by a single stop-CPU operation. These optimizations will be described in the implementation section that steps through the actual code.
The stop-CPU approach outlined above does not suffice for RCU-preempt
due to the possibility of tasks preempted in RCU-preempt read-side critical
sections.
However, the RCU-preempt expedited grace periods still uses the stop-CPU
subsystem (by invoking synchronize_sched_expedited()
)
to force a context switch on every CPU.
This means that every task that started an RCU-preempt read-side critical
section prior to the expedited grace period will now be enqueued on some
rcu_node
structure's ->blkd_tasks
list.
When one of these tasks exits its RCU read-side critical
section, it removes itself from the ->blkd_tasks
list
that it is queued on.
Therefore, once all of these tasks have removed themselves, the
expedited grace period will be complete.
Of course, a task might remain preempted in its RCU read-side critical section for an extended period of time. Therefore, in kernel configurations supporting RCU priority boosting, all tasks blocking a given RCU-preempt expedited grace period are subjected to RCU priority boosting.
Regardless of whether or not priority boosting is available,
the end of the expedited grace period is detected using the
rcu_node
tree as a combining tree, in a manner similar
to that of the normal grace periods, but with no per-CPU component.
This process is illustrated by the following sequence of events,
which are illustrated below by the corresponding sequence of diagrams
with commentary:
The initial state of the expedited grace period combining tree is as follows:
The rcu_data
and rcu_dynticks
structures are omitted from this diagram because they play no part
in the expedited grace period.
However, their absence requires us to keep track of the fact that the
leftmost rcu_node
structure covers CPUs 0 and 1
while the rightmost rcu_node
structure covers
CPUs 2 and 3.
The “em..” in each rcu_node
structure
represents the state of the ->expmask
field, each bit
of which indicates whether or not there are tasks in the corresponding
subtree of the rcu_node
tree blocking an expedited
grace period, with “?” indicating that there are and
“.” indicating that there are not.
The “b” represents the ->blkd_tasks
field
that heads the list of tasks blocked within an RCU read-side critical
section,
and the “e” field represents the
->exp_tasks
field that indicates which blocked tasks
are blocking the current expedited grace period.
When task A blocks within an RCU read-side critical section
while running on CPU 0, it is queued on the leftmost
rcu_node
structure, as shown below:
Because there is no expedited grace period in progress, task A is not blocking any expedited grace period.
When tasks B and C enter their RCU read-side critical sections, there is no immediate visible change in expedited grace period state. However, as we will see, the fact that these tasks are in RCU read-side critical sections will influence any future expedited grace period.
CPU 0's initiation of an expedited grace period will be illustrated
in several steps.
The first step is an invocation of synchronize_sched_expedited()
,
which is invoked as a way of forcing each CPU to undergo a context switch,
thus forcing each task in an RCU read-side critical section to enqueue itself
on one of the rcu_node
structures as shown below:
Quick Quiz 1:
Why put task C at the head of the list?
That is just backwards!!!
Answer
CPU 0 now initializes the ->expmask
fields
of all non-leaf rcu_node
structures, resulting in the
following state:
Quick Quiz 2:
Why don't the leaf rcu_node
structures also
have their ->expmask
fields initialized?
Answer
Next, CPU 0 scans all the rcu_node
structures,
pointing the ->exp_tasks
field to the head of any non-empty
->blkd_tasks
lists, as shown below.
As can be seen from this diagram, tasks A, B, and C are blocking the expedited grace period.
Quick Quiz 3:
Given that it was not running at the time that the
expedited grace period started, why is task A blocking
the expedited grace period?
Answer
Because all of the leaf rcu_node
structures have
at least one task blocking the expedited grace period, the root
rcu_node
structure's two ->expmask
bits both remain set.
CPU 0 now blocks waiting for the expedited grace period to complete.
If a task D blocks within an RCU read-side critical section
while running on CPU 2, it will be queued on the rightmost
rcu_node
, as shown below:
Because task D started its RCU read-side critical section after the expedited grace period started, it does not block the expedited grace period.
When task B exits its RCU read-side critical section, it will
remove itself from its ->blkd_tasks
list.
Because that list then has no more tasks blocking the current
expedited grace period, the corresponding bit in the root
rcu_node
structure's ->expmask
field
is cleared, as shown below:
Now tasks A and C are the only ones left blocking the expedited grace period.
When task A exits its RCU read-side critical section, it will
also remove itself from its ->blkd_tasks
list.
However, this list still contains another task blocking the
current expedited grace period, so no further action is taken.
The state is then as shown below:
Now only task C blocks the expedited grace period.
When task C exits its RCU read-side critical section, it too
will remove itself from its ->blkd_tasks
list.
Because that list then has no more tasks blocking the current
expedited grace period, the corresponding bit in the root
rcu_node
structure's ->expmask
field
is cleared, as shown below:
The expedited grace period has now completed.
Given this background, we are now ready to take a look at the code.
The RCU-sched and RCU-bh implementations, being handled by the same code, are described in the following section. The RCU-preempt implementation is described in the section after that.
The RCU-sched and RCU-bh implementations are provided by the
synchronize_sched_expedited_cpu_stop()
and
synchronize_sched_expedited()
functions shown below:
1 static atomic_t sync_sched_expedited_started = ATOMIC_INIT(0); 2 static atomic_t sync_sched_expedited_done = ATOMIC_INIT(0); 3 4 static int synchronize_sched_expedited_cpu_stop(void *data) 5 { 6 smp_mb(); 7 return 0; 8 } 9 10 void synchronize_sched_expedited(void) 11 { 12 int firstsnap, s, snap, trycount = 0; 13 14 firstsnap = snap = atomic_inc_return(&sync_sched_expedited_started); 15 get_online_cpus(); 16 while (try_stop_cpus(cpu_online_mask, 17 synchronize_sched_expedited_cpu_stop, 18 NULL) == -EAGAIN) { 19 put_online_cpus(); 20 if (trycount++ < 10) 21 udelay(trycount * num_online_cpus()); 22 else { 23 synchronize_sched(); 24 return; 25 } 26 s = atomic_read(&sync_sched_expedited_done); 27 if (UINT_CMP_GE((unsigned)s, (unsigned)firstsnap)) { 28 smp_mb(); 29 return; 30 } 31 get_online_cpus(); 32 snap = atomic_read(&sync_sched_expedited_started); 33 smp_mb(); 34 } 35 do { 36 s = atomic_read(&sync_sched_expedited_done); 37 if (UINT_CMP_GE((unsigned)s, (unsigned)snap)) { 38 smp_mb(); 39 break; 40 } 41 } while (atomic_cmpxchg(&sync_sched_expedited_done, s, snap) != s); 42 put_online_cpus(); 43 }
The sync_sched_expedited_started
and
sync_sched_expedited_done
variables on lines 1
and 2 respectively act somewhat like a ticket lock.
These are used to allow a given synchronize_sched_expedited()
call to determine whether it can rely on concurrent calls to
synchronize_sched_expedited()
having done its work.
The synchronize_sched_expedited_cpu_stop()
function
shown on lines 4-8 is
invoked in stop-CPU context, and simply does a memory barrier.
Quick Quiz 4:
Given that the scheduler has done a full context switch
in order to allow the stop-CPU context to start running, why
bother with the memory barrier.
Answer
The synchronize_sched_expedited()
function shown on
lines 10-43 does RCU-sched expedited grace periods, which also
serves for the RCU-bh expedited grace periods.
Line 14 atomically increments the
sync_sched_expedited_started
counter, returning the new
value, which will be used later to determine if some other task
did our work for us.
Line 15 holds off CPU-hotplug operations for the duration of
the expedited grace period.
Each pass through the loop spanning lines 16-34 attempts
to do a stop-CPUs operation, which is used in this case solely for
the fact that it forces a context switch on every CPU, thus forcing
both an RCU-sched and an RCU-bh grace period.
Lines 16-18 invoke try_stop_cpus()
, exiting the
loop if this succeeds.
Otherwise, the loop body is executed.
Line 19 re-enables CPU-hotplug operations.
Line 20 increments the number of attempts, and if there have not been
too many, line 21 delays to avoid memory contention that might otherwise
occur in the case of multiple concurrent calls to
synchronize_sched_expedited()
.
Otherwise, we have spent too much time trying to expedite a grace
period, so line 23 simply invokes synchronize_sched()
and
line 24 returns.
Line 26 reads the sync_sched_expedited_done
and
line 27 checks to see if some concurrent execution of
synchronize_sched_expedited()
ran after we started,
in which case our work is done for use, so line 28 executes
a memory barrier to ensure that the caller's later actions happen
after the expedited grace period, and line 29 returns.
We get to line 31 when we need to try try_stop_cpus()
once again.
Line 31 holds off CPU-hotplug operations, line 32
gets a new snapshot of sync_sched_expedited_started
, and
line 33 ensures that the snapshot happens before the
try_stop_cpu()
call that will be executed on the next pass
through the loop.
If try_stop_cpus()
ever succeeds, we exit the loop,
thus starting atomic_cmpxchg()
loop spanning lines 35-41.
Line 36 picks up the current value of
sync_sched_expedited_done
, and then line 37
checks to see if this counter has already passed our most recent
snapshot of sync_sched_expedited_started
, and, if so,
line 38 executes a memory barrier to ensure that the caller's
subsequent actions are seen by all to occur after the expedited
grace period, and line 39 exits the loop.
Otherwise, line 41 updates sync_sched_expedited_done
to our last snapshot of sync_sched_expedited_started
,
but only if no other CPU has updated it in the meantime.
Upon exit from the atomic_cmpxchg()
loop,
line 42 re-enables CPU-hotplug operations.
Quick Quiz 5:
Wouldn't it be a lot simpler to call stop_cpus()
instead of dealing with failure from try_stop_cpus()
?
Answer
The RCU-preempt implementation is built on top of the
synchronize_sched_expedited()
implementation described
in the previous section, using the
sync_rcu_preempt_exp_done()
,
rcu_report_exp_rnp()
, and
sync_rcu_preempt_exp_init()
functions shown below,
along with the synchronize_rcu_expedited()
function
discussed later.
1 static int sync_rcu_preempt_exp_done(struct rcu_node *rnp) 2 { 3 return !rcu_preempted_readers_exp(rnp) && 4 ACCESS_ONCE(rnp->expmask) == 0; 5 } 6 7 static void rcu_report_exp_rnp(struct rcu_state *rsp, struct rcu_node *rnp) 8 { 9 unsigned long flags; 10 unsigned long mask; 11 12 raw_spin_lock_irqsave(&rnp->lock, flags); 13 for (;;) { 14 if (!sync_rcu_preempt_exp_done(rnp)) { 15 raw_spin_unlock_irqrestore(&rnp->lock, flags); 16 break; 17 } 18 if (rnp->parent == NULL) { 19 raw_spin_unlock_irqrestore(&rnp->lock, flags); 20 wake_up(&sync_rcu_preempt_exp_wq); 21 break; 22 } 23 mask = rnp->grpmask; 24 raw_spin_unlock(&rnp->lock); 25 rnp = rnp->parent; 26 raw_spin_lock(&rnp->lock); 27 rnp->expmask &= ~mask; 28 } 29 } 30 31 static void 32 sync_rcu_preempt_exp_init(struct rcu_state *rsp, struct rcu_node *rnp) 33 { 34 unsigned long flags; 35 int must_wait = 0; 36 37 raw_spin_lock_irqsave(&rnp->lock, flags); 38 if (list_empty(&rnp->blkd_tasks)) 39 raw_spin_unlock_irqrestore(&rnp->lock, flags); 40 else { 41 rnp->exp_tasks = rnp->blkd_tasks.next; 42 rcu_initiate_boost(rnp, flags); 43 must_wait = 1; 44 } 45 if (!must_wait) 46 rcu_report_exp_rnp(rsp, rnp); 47 }
The sync_rcu_preempt_exp_done()
function
shown on lines 1-5 checks to see if all of the specified
rcu_node
structure's readers blocking
the current RCU-preempt expedited grace period have exited their
RCU read-side sections.
Line 8 checks to see whether all readers blocking the current
expedited grace period queued directly on this rcu_node
structure have finished, while line 9 carries out the same
check for all rcu_node
structures subordinate to
the one specified.
Quick Quiz 6:
How could any task possibly be queued on other than a leaf
rcu_node
structure?
Answer
The rcu_report_exp_rnp()
function shown on
lines 7-29 propagates exits from RCU read-side critical sections
up the rcu_node
tree.
Line 12 acquires the specified rcu_node
structure's
->lock
.
Each pass through the loop spanning lines 18-28 handles one level
of the rcu_node
tree.
Line 14 checks to see if all tasks associated with the current
rcu_node
structure or one of its decendants have completed,
and if not, line 15 releases the rcu_node
structure's
->lock
and line 16 exits the loop.
Otherwise, line 18 checks to see if this rcu_node
has a parent, and if not, line 19
releases the rcu_node
structure's ->lock
,
line 20 wakes up the task that initiated the expedited grace
period, and line 21 exits the loop.
Otherwise, it is necessary to propagate up the tree.
Line 23 records the current rcu_node
structure's
bit position in its parent's ->expmask
field and
line 24 releases the curreent rcu_node
structure's
->lock
.
Line 25 moves up to the parent and line 26 acquires its
->lock
.
Finally, line 27 clears the child rcu_node
structure's
bit in the parent's ->expmask
, followed by another
pass through the loop.
Quick Quiz 7:
If control reaches line 19 of
rcu_report_exp_rnp()
, how do we know that the
expedited grace period really is completed?
Answer
The sync_rcu_preempt_exp_init()
function shown on
lines 31-47 initializes the specified rcu_node
structure for a new expedited grace period.
Line 37 acquires the rcu_node
structure's
->lock
.
Line 38 then checks to see if there are any tasks blocked on this
rcu_node
structure, and if not, line 39 releases the
->lock
.
Otherwise, line 41 points the rcu_node
structure's
->exp_tasks
pointer to the first blocked task in the
list, line 42 initiates RCU priority boosting for kernels
supporting this notion, and line 43 records the fact that the
expedited grace period will have to wait on this rcu_node
structure.
Either way, if line 45 sees that we need to wait on this
rcu_node
structure, and if not, line 46
reports that fact up the rcu_node
tree.
Quick Quiz 8:
What happens if there are no tasks blocking the current
expedited grace period?
Won't that result in the wake_up()
happening before the
initiating task blocks, in turn resulting in a hang?
Answer
Now on to the synchronize_rcu_expedited()
function, along with its data variables, all shown below:
1 static DECLARE_WAIT_QUEUE_HEAD(sync_rcu_preempt_exp_wq); 2 static long sync_rcu_preempt_exp_count; 3 static DEFINE_MUTEX(sync_rcu_preempt_exp_mutex); 4 5 void synchronize_rcu_expedited(void) 6 { 7 unsigned long flags; 8 struct rcu_node *rnp; 9 struct rcu_state *rsp = &rcu_preempt_state; 10 long snap; 11 int trycount = 0; 12 13 smp_mb(); 14 snap = ACCESS_ONCE(sync_rcu_preempt_exp_count) + 1; 15 smp_mb(); 16 while (!mutex_trylock(&sync_rcu_preempt_exp_mutex)) { 17 if (trycount++ < 10) 18 udelay(trycount * num_online_cpus()); 19 else { 20 synchronize_rcu(); 21 return; 22 } 23 if ((ACCESS_ONCE(sync_rcu_preempt_exp_count) - snap) > 0) 24 goto mb_ret; 25 } 26 if ((ACCESS_ONCE(sync_rcu_preempt_exp_count) - snap) > 0) 27 goto unlock_mb_ret; 28 synchronize_sched_expedited(); 29 raw_spin_lock_irqsave(&rsp->onofflock, flags); 30 rcu_for_each_nonleaf_node_breadth_first(rsp, rnp) { 31 raw_spin_lock(&rnp->lock); 32 rnp->expmask = rnp->qsmaskinit; 33 raw_spin_unlock(&rnp->lock); 34 } 35 rcu_for_each_leaf_node(rsp, rnp) 36 sync_rcu_preempt_exp_init(rsp, rnp); 37 if (NUM_RCU_NODES > 1) 38 sync_rcu_preempt_exp_init(rsp, rcu_get_root(rsp)); 39 raw_spin_unlock_irqrestore(&rsp->onofflock, flags); 40 rnp = rcu_get_root(rsp); 41 wait_event(sync_rcu_preempt_exp_wq, 42 sync_rcu_preempt_exp_done(rnp)); 43 smp_mb(); 44 ACCESS_ONCE(sync_rcu_preempt_exp_count)++; 45 unlock_mb_ret: 46 mutex_unlock(&sync_rcu_preempt_exp_mutex); 47 mb_ret: 48 smp_mb(); 49 }
Line 1 shows the sync_rcu_preempt_exp_wq
wait queue on which synchronize_rcu_expedited()
blocks when needed,
line 2 shows the sync_rcu_preempt_exp_count
counter that enables concurrent calls to
synchronize_rcu_expedited()
to share a single
expedited grace period, and
line 3 defines the sync_rcu_preempt_exp_mutex
used for mutual exclusion.
Lines 13-15 take a snapshot of
sync_rcu_preempt_exp_count
so that we can later determine if
someone else did our work for us.
Each pass through the loop spanning lines 16-25 makes an attempt
to acquire sync_rcu_preempt_exp_mutex
.
If this attempt fails, line 17 checks to see if the number of tries
has been excessive, and, if not, line 18 delays for a short time,
otherwise, line 20 waits for a full grace period and line 21
returns.
Line 23 checks to see if a full expedited grace period has elapsed since we started, and if so, line 27 goes to clean up and exit, piggybacking on this other expedited grace period.
Quick Quiz 9:
But this check cannot succeed unless
sync_rcu_preempt_exp_count
has been incremented twice
since we first sampled it on line 14 of
synchronize_rcu_expedited
.
Since each expedited grace period increments this counter only
once, this means that two expedited grace periods have completed
during this interval.
So why shouldn't the comparison on line 23 be for
greater-than-or-equal rather than strictly greater-than?
Answer
Once the loop exits, execution reaches line 26 with
the sync_rcu_preempt_exp_mutex
mutex held.
Line 26 performs the same check as did line 23, and
if some other expedited grace period started after we did and
has already completed, then line 27 goes to clean up and exit.
Line 28 invokes synchronize_sched_expedited()
,
which has the side-effect of forcing each currently-executing
RCU-preempt read-side critical section to be enqueued on one of
the leaf rcu_node
structure's ->blkd_tasks
lists.
Once this is complete, it is only necessary to wait for each
queued task to dequeue itself.
Quick Quiz 10:
Suppose that there is a continual stream of new tasks
blocking within RCU-preempt read-side critical sections.
Won't that prevent the expedited grace period from ever completing?
Answer
Line 29 acquires the rcu_state
structure's
->onofflock
, holding off changes to RCU's idea of
which CPUs are online until line 39, where this lock is released.
Lines 30-34 set up all of the non-leaf rcu_node
structures to wait for all queued tasks to complete by setting
each ->expmask
field to the corresponding
->qsmaskinit
field under the protection of the
corresponding ->lock
.
Lines 35-38 invokes sync_rcu_preempt_exp_init()
on each leaf and the root rcu_node
structures, which
records which portions of the rcu_node
tree contain
queued tasks that block the current expedited grace period.
Line 40 obtains a pointer to the root rcu_node
structure so that lines 41 and 42 can wait for all queued
tasks to exit their RCU read-side critical sections.
Line 43 executes a memory barrier to ensure that the expedited
grace-period computations are seen to precede incrementing of
sync_rcu_preempt_exp_count
on line 44.
Line 46 releases sync_rcu_preempt_exp_mutex
and line 48 executes a memory barrier to ensure that
accesses to sync_rcu_preempt_exp_count
are seen to
happen before any actions that the caller might take after
return from synchronize_rcu_expedited()
.
This work represents the view of the author and does not necessarily represent the view of IBM.
Linux is a registered trademark of Linus Torvalds.
Other company, product, and service names may be trademarks or service marks of others.
Quick Quiz 1: Why put task C at the head of the list? That is just backwards!!!
Answer: The reason will become apparent later in this example when task D blocks.
Quick Quiz 2:
Why don't the leaf rcu_node
structures also
have their ->expmask
fields initialized?
Answer:
Because there cannot be any tasks queued below the leaf
rcu_node
structures, so there is no need for the leaf
rcu_node
structures to track anything.
The reason for this asymmetry is that we use the same
rcu_node
tree that is used by the normal grace periods,
which do need the leaf rcu_node
structures to track
per-CPU status.
Quick Quiz 3: Given that it was not running at the time that the expedited grace period started, why is task A blocking the expedited grace period?
Answer: Because it entered an RCU read-side critical section before the expedited grace period started, and it remains in that critical section. Therefore it must by definition block the expedited grace period. As well as any later non-expedited grace period, for that matter.
Quick Quiz 4: Given that the scheduler has done a full context switch in order to allow the stop-CPU context to start running, why bother with the memory barrier.
Answer: Pure paranoia. Just in case someone comes up with a hyper-optimized code path through the scheduler...
Quick Quiz 5:
Wouldn't it be a lot simpler to call stop_cpus()
instead of dealing with failure from try_stop_cpus()
?
Answer:
Ah, but the alternative is massive contention on the
stop_cpus_mutex
that is unconditionally acquired by
stop_cpus()
.
Such contention would be a very bad idea on systems with large
numbers of CPUs.
In addition, using stop_cpus()
would prevent a single
stop-CPU operation from benefiting an arbitrarily large number of
concurrent synchronize_sched_expedited()
invocations.
Quick Quiz 6:
How could any task possibly be queued on other than a leaf
rcu_node
structure?
Answer:
If a task is queued on a given leaf rcu_node
structure, but then all CPUs corresponding to that rcu_node
structure go offline, that task will be moved to the root
rcu_node
structure.
Quick Quiz 7:
If control reaches line 19 of
rcu_report_exp_rnp()
, how do we know that the
expedited grace period really is completed?
Answer:
We reach line 19 if we are at the root rcu_node
and if there are no tasks blocking the current expedited grace period
on this or any subordinate rcu_node
.
This means that there are no longer any tasks blocking the current
expedited grace period, so it is by definition done.
Quick Quiz 8:
What happens if there are no tasks blocking the current
expedited grace period?
Won't that result in the wake_up()
happening before the
initiating task blocks, in turn resulting in a hang?
Answer:
The wake_up()
might well happen before the
initiating task blocks, but this cannot result in a hang.
The race conditions are resolved by use of wait_event()
.
Quick Quiz 9:
But this check cannot succeed unless
sync_rcu_preempt_exp_count
has been incremented twice
since we first sampled it on line 14 of
synchronize_rcu_expedited
.
Since each expedited grace period increments this counter only
once, this means that two expedited grace periods have completed
during this interval.
So why shouldn't the comparison on line 23 be for
greater-than-or-equal rather than strictly greater-than?
Answer: Suppose that the comparison was greater-than-or-equal. Then the following sequence of events could occur:
synchronize_rcu_expedited()
.
synchronize_rcu_expedited()
sees sync_rcu_preempt_exp_count
equal to (say) 5.
It therefore sets local variable snap
to 6.
synchronize_rcu_expedited()
, incrementing
sync_rcu_preempt_exp_count
.
sync_rcu_preempt_exp_mutex
and sees that the value of sync_rcu_preempt_exp_count
is now 6.
It therefore immediately exits synchronize_rcu_expedited()
despite Task B still being in a pre-existing
RCU read-side critical section.
sync_rcu_preempt_exp_count
change twice, which guarantees that a full expedited grace period
will have completed.
Quick Quiz 10: Suppose that there is a continual stream of new tasks blocking within RCU-preempt read-side critical sections. Won't that prevent the expedited grace period from ever completing?
Answer: No, because the expedited grace period only waits on tasks that are already enqueued. It does not wait on tasks that enqueue themselves later.