May 12, 2013 (Linux 3.9+)
This article was contributed by Paul E. McKenney
And then there are of course the ineludible answers to the quick quizzes.
Each entity waiting for an RCU callback to complete has an RCU
callback (rcu_head
structure) queued.
These callbacks are queued on the per-CPU rcu_data
structure's
->nxtlist
field in reverse time order, so that the oldest
callbacks are at the head of the list.
There is an array of pointers into this list, and these pointers segment
the list based on the state of the callbacks in each segment with respect
to RCU grace periods: Callbacks in the first segment have had their
grace period complete and are ready to be invoked, those in the second
segment are waiting for the current grace period to complete, those
in the third segment will wait for the next grace period to complete,
and those in the fourth and last segment have not yet been associated
with a specific grace period, and thus might wait for the next grace
period or some later grace period.
Callbacks are advanced from one segment to another by updating the elements in the pointer array, as will be demonstrated in the next section.
This section shows how a given CPU's callbacks are handled through the course of a series of grace periods.
Initially, each CPU's RCU callback list is empty, as shown below:
The column of empty boxes will be used later in this example to
hold the RCU grace-period number that will be in effect when the corresponding
group of callbacks is ready to invoke, which are stored in the
->nxtcompleted
array.
However, these numbers apply only to the RCU_WAIT_TAIL
and RCU_NEXT_READY_TAIL
groups.
The RCU_DONE_TAIL
group is already ready to invoke on the one
hand, and the RCU_NEXT_TAIL
group has not yet been associated
with a grace-period number on the other.
An invocation of call_rcu()
would enqueue an RCU
callback, but would not yet associate it with a specific
grace period, resulting in the state shown below:
A second invocation of call_rcu()
would enqueue another
RCU callback, but would still not associate either with
a specific grace period, resulting in the state shown below:
If this same CPU were to start a grace period, it would see that all the callbacks on its list were enqueued prior to the start of the new grace period, which would allow both of them to be handled by this new grace period, with grace period number 2. This results in the state shown below:
The enqueuing of a third callback would result in the following state, with CBs 1 and 2 waiting for the current grace period and CB 3 not yet being assigned to a specific grace period:
When this CPU reports a quiescent state up the rcu_node
tree, it knows that the next grace period cannot possibly have started.
It is therefore safe to associate CB 3 with the next grace period
(which is grace period 3), as follows:
When a fourth callback is enqueued, it is not possible to associate it with the next grace period: This CPU has already announced its quiescent state, so the current grace period could end (and a new one start) at any time. This fourth callback must therefore be left unassociated with any specific grace period, as shown below:
Quick Quiz 1:
Suppose we (incorrectly) associated CB 4 with the next
grace period number 3.
Exactly how could problems result?
Answer
When the current CPU notices that grace period number 2 has ended, it will advance its callbacks, resulting in the following state:
Quick Quiz 2:
But we assigned CB 4 to the next grace period.
Why is this safe?
Answer
When a fifth callback is registered, we finally have all four lists non-empty:
CBS 1 and 2 are now invoked, resulting in the following state:
The completion of the current grace period and two additional grace periods would result in all three remaining callbacks being invoked.
Another way of looking at the flow of callbacks through the system is via a state diagram that shows the callbacks moving from one segment of the callback list to another:
Callbacks enter the queue in the “Next” state,
which corresponds to the RCU_NEXT_TAIL
segment of the list.
If the CPU detects a change in the ->completed
value
or if this CPU either starts a grace period or reports a quiescent state,
all callbacks in the “Next” advance to the
“Next-Ready” state, which corresponds to the
RCU_NEXT_READY_TAIL
segment of the list.
This same event will advance callbacks to the “Wait” state
(RCU_WAIT_TAIL
segment) and to the “Done” state
(RCU_DONE_TAIL
segment).
When a CPU goes offline, callbacks move from the “Next-Ready” and “Wait” states back to the “Next” state. Callbacks in the “Done” state remain there.
Quick Quiz 3:
Why not map everything across?
Answer
Two additional complications are posed by rcu_barrier()
and by CPU-hotplug operations:
rcu_barrier()
function waits for all previously
registered RCU callbacks to be invoked.
It does this by registering a callback on each CPU, then waiting
for all of those callbacks to be invoked.
For this to work, the callbacks registered on a given CPU must
be maintained in order.
rcu_barrier()
callback might well
move with them, so they must remain in order.
->nxtlist
, but rather on a separate nocb_head
list, where it is dequeued, processed, and invoked by a separate kthread
dedicated to this purpose.
Given this background, it is time to look at the code itself.
This section covers the code, starting with callback registration,
continuing with grace-period callback processing, next callback
invocation, then rcu_barrier()
handling, and
finally the handling of callbacks orphaned by CPU-hotplug operations.
The __call_rcu()
function registers new callbacks,
although it is normally invoked through one of the
call_rcu()
, call_rcu_bh()
, or
call_rcu_sched()
wrapper functions.
1 static void 2 __call_rcu(struct rcu_head *head, void (*func)(struct rcu_head *rcu), 3 struct rcu_state *rsp, bool lazy) 4 { 5 unsigned long flags; 6 struct rcu_data *rdp; 7 8 WARN_ON_ONCE((unsigned long)head & 0x3); 9 debug_rcu_head_queue(head); 10 head->func = func; 11 head->next = NULL; 12 local_irq_save(flags); 13 rdp = this_cpu_ptr(rsp->rda); 14 if (unlikely(rdp->nxttail[RCU_NEXT_TAIL] == NULL)) { 15 WARN_ON_ONCE(1); 16 local_irq_restore(flags); 17 return; 18 } 19 ACCESS_ONCE(rdp->qlen)++; 20 if (lazy) 21 rdp->qlen_lazy++; 22 else 23 rcu_idle_count_callbacks_posted(); 24 smp_mb(); 25 *rdp->nxttail[RCU_NEXT_TAIL] = head; 26 rdp->nxttail[RCU_NEXT_TAIL] = &head->next; 27 28 if (__is_kfree_rcu_offset((unsigned long)func)) 29 trace_rcu_kfree_callback(rsp->name, head, (unsigned long)func, 30 rdp->qlen_lazy, rdp->qlen); 31 else 32 trace_rcu_callback(rsp->name, head, rdp->qlen_lazy, rdp->qlen); 33 __call_rcu_core(rsp, rdp, head, flags); 34 local_irq_restore(flags); 35 }
Line 8 verifies that the RCU callback is properly aligned,
which will be important when the pointer's low-order bits are
used to mark “lazy” callbacks.
Line 9 informs the debug-objects subsystem that the callback
is being queued, and lines 10 and 11 initialize the
rcu_head
structure.
Line 12 disables interrupts and line 34 restores them.
Line 13 obtains a pointer to this CPU's rcu_data
structure.
If line 14 sees that callback registry has been disabled due to
this CPU being offline,
lines 15-17 issue a warning, restore interrupts, and return to the
caller, respectively.
Line 19 increments the count of callbacks for this
rcu_data
structure.
If line 20 sees that this is a lazy callback
(e.g., one registered via kfree_rcu()
),
line 21 increments the count of lazy callbacks for this
rcu_data
structure,
otherwise line 23 informs RCU_FAST_NO_HZ
of a new
non-lazy callback registered on this CPU.
Line 24 ensures that the callback counts are updated before
the new callback is queued (for the benefit of _rcu_barrier()
),
and it also ensures that any prior updates to
RCU-protected data structures carried out by this CPU
are seen by all CPUs as happening
prior to any subsequent grace-period processing.
Lines 25 and 26 enqueue the rcu_head
structure to the tail of this CPU's callback list.
Note that the new callback is not yet associated with any specific
RCU grace period.
Lines 28-32 trace the new callback, with lines 29
and 30 tracing the kfree_rcu()
case and line 32 tracing
the default invoke-a-function case.
Finally, line 33 invokes __call_rcu_core()
to handle
any required special grace-period processing.
Quick Quiz 4:
Given that it just registered a new RCU callback, why wouldn't
__call_rcu()
always need to initiate
grace-period processing?
Answer
Callbacks are advanced as grace periods progress by the
cpu_has_callbacks_ready_to_invoke()
,
rcu_report_qs_rdp()
, and
__rcu_process_gp_end()
functions.
1 static int 2 cpu_has_callbacks_ready_to_invoke(struct rcu_data *rdp) 3 { 4 return &rdp->nxtlist != rdp->nxttail[RCU_DONE_TAIL]; 5 } 6 7 static void 8 rcu_report_qs_rdp(int cpu, struct rcu_state *rsp, struct rcu_data *rdp, long lastgp) 9 { 10 unsigned long flags; 11 unsigned long mask; 12 struct rcu_node *rnp; 13 14 rnp = rdp->mynode; 15 raw_spin_lock_irqsave(&rnp->lock, flags); 16 if (lastgp != rnp->gpnum || rnp->completed == rnp->gpnum) { 17 rdp->passed_quiesce = 0; 18 raw_spin_unlock_irqrestore(&rnp->lock, flags); 19 return; 20 } 21 mask = rdp->grpmask; 22 if ((rnp->qsmask & mask) == 0) { 23 raw_spin_unlock_irqrestore(&rnp->lock, flags); 24 } else { 25 rdp->qs_pending = 0; 26 rdp->nxttail[RCU_NEXT_READY_TAIL] = rdp->nxttail[RCU_NEXT_TAIL]; 27 rcu_report_qs_rnp(mask, rsp, rnp, flags); 28 } 29 } 30 31 static void 32 __rcu_process_gp_end(struct rcu_state *rsp, struct rcu_node *rnp, struct rcu_data *rdp) 33 { 34 if (rdp->completed != rnp->completed) { 35 rdp->nxttail[RCU_DONE_TAIL] = rdp->nxttail[RCU_WAIT_TAIL]; 36 rdp->nxttail[RCU_WAIT_TAIL] = rdp->nxttail[RCU_NEXT_READY_TAIL]; 37 rdp->nxttail[RCU_NEXT_READY_TAIL] = rdp->nxttail[RCU_NEXT_TAIL]; 38 rdp->completed = rnp->completed; 39 trace_rcu_grace_period(rsp->name, rdp->gpnum, "cpuend"); 40 if (ULONG_CMP_LT(rdp->gpnum, rdp->completed)) 41 rdp->gpnum = rdp->completed; 42 if ((rnp->qsmask & rdp->grpmask) == 0) 43 rdp->qs_pending = 0; 44 } 45 }
The cpu_has_callbacks_ready_to_invoke()
function
is shown on lines 1-5.
It simply checks to see whether the RCU_DONE_TAIL
pointer
references the callback list header (the ->nxtlist
field).
If not, then there are callbacks in the first segmetn of the list,
and these callbacks are ready to invoke.
Quick Quiz 5:
Hey!!!
The cpu_has_callbacks_ready_to_invoke()
function does not
actually advance callbacks.
What is the deal here?
Answer
The rcu_report_qs_rdp()
function shown on
lines 7-29 is mostly dealt with elsewhere.
The key line for callback advancement is line 26,
which is invoked if the current CPU has passed through a quiescent
state that counts against the current grace period (line 16)
when the RCU core still needs a quiescent state from this CPU
(line 22).
It is therefore safe for the CPU to associate all its remaining
unassociated callbacks with the next grace period, which line 26
does.
Quick Quiz 6:
But what happens if some other CPU reports a quiescent
state on behalf of this CPU, thus causing the current grace
period to end, and possibly causing this CPU to report a
quiescent state against the wrong CPU?
And invalidating the callback-advancement optimization, for that matter?
Answer
The __rcu_process_gp_end()
function on lines 31-45
advances callbacks when the CPU detects the end of a grace period.
As noted elsewhere, line 34 checks to see if the current CPU is
not yet aware that the grace period has ended, and, if so,
line 35 advances the RCU_DONE_TAIL
pointer to
mark all callbacks waiting for the just-completed grace period
as done,
line 36 advances the RCU_WAIT_TAIL
pointer to mark
all callbacks associated with the next (possibly already started)
grace period to wait for this same grace period,
and finally line 37 advances RCU_NEXT_READY_TAIL
to associate all remaining callbacks with the grace period after
that.
The rest of this function is discussed elsewhere.
The invoke_rcu_callbacks()
,
__rcu_reclaim()
,
rcu_do_batch()
, and
rcu_preempt_do_callbacks()
functions invoke RCU callbacks
whose grace periods have ended.
1 static void invoke_rcu_callbacks(struct rcu_state *rsp, struct rcu_data *rdp) 2 { 3 if (unlikely(!ACCESS_ONCE(rcu_scheduler_fully_active))) 4 return; 5 if (likely(!rsp->boost)) { 6 rcu_do_batch(rsp, rdp); 7 return; 8 } 9 invoke_rcu_callbacks_kthread(); 10 } 11 12 static inline bool __rcu_reclaim(char *rn, struct rcu_head *head) 13 { 14 unsigned long offset = (unsigned long)head->func; 15 16 if (__is_kfree_rcu_offset(offset)) { 17 RCU_TRACE(trace_rcu_invoke_kfree_callback(rn, head, offset)); 18 kfree((void *)head - offset); 19 return 1; 20 } else { 21 RCU_TRACE(trace_rcu_invoke_callback(rn, head)); 22 head->func(head); 23 return 0; 24 } 25 } 26 27 static void rcu_preempt_do_callbacks(void) 28 { 29 rcu_do_batch(&rcu_preempt_state, &__get_cpu_var(rcu_preempt_data)); 30 }
The invoke_rcu_callbacks()
function is shown on
lines 1-10, and it causes the callback-invocation function
rcu_do_batch()
to run.
This must be done differently depending on the context.
Line 3 checks to see if the scheduler has spawned the first non-idle
task, and if not, line 4 returns.
Line 5 checks to see if this flavor of RCU supports priority boosting,
and if not, line 6 invokes rcu_do_batch()
directly
and line 7 returns.
Otherwise, this flavor of RCU does support priority boosting, which
means that callback invocation must be done from a kthread, so
line 9 invokes invoke_rcu_callbacks_kthread()
to wake that kthread up.
The __rcu_reclaim()
function shown on lines 12-25
invokes the specified callback based on its type.
Line 16 invokes __is_kfree_rcu_offset()
to determine
whether the callback was registered by kfree_rcu()
, and if so,
line 17 traces this fact, line 18 invokes kfree()
on the callback, and line 19 returns 1 to indicate that this was
a lazy callback.
Otherwise, line 21 traces the fact that we are invoking a normal
callback, line 22 invokes it, and line 23 returns 0 to
indicate that this was a non-lazy callback.
Quick Quiz 7:
Why do the tracing before invoking the callback?
Answer
The rcu_preempt_do_callbacks()
function invokes
rcu_do_batch()
on the RCU-preempt flavor of RCU.
If RCU-preempt is not configured in the kernel, for example, for
CONFIG_PREEMPT=n
kernel builds, then
rcu_preempt_do_callbacks()
is an empty function.
1 static void rcu_do_batch(struct rcu_state *rsp, struct rcu_data *rdp) 2 { 3 unsigned long flags; 4 struct rcu_head *next, *list, **tail; 5 int bl, count, count_lazy; 6 7 if (!cpu_has_callbacks_ready_to_invoke(rdp)) { 8 trace_rcu_batch_start(rsp->name, rdp->qlen_lazy, rdp->qlen, 0); 9 trace_rcu_batch_end(rsp->name, 0, !!ACCESS_ONCE(rdp->nxtlist), 10 need_resched(), is_idle_task(current), 11 rcu_is_callbacks_kthread()); 12 return; 13 } 14 local_irq_save(flags); 15 WARN_ON_ONCE(cpu_is_offline(smp_processor_id())); 16 bl = rdp->blimit; 17 trace_rcu_batch_start(rsp->name, rdp->qlen_lazy, rdp->qlen, bl); 18 list = rdp->nxtlist; 19 rdp->nxtlist = *rdp->nxttail[RCU_DONE_TAIL]; 20 *rdp->nxttail[RCU_DONE_TAIL] = NULL; 21 tail = rdp->nxttail[RCU_DONE_TAIL]; 22 for (count = RCU_NEXT_SIZE - 1; count >= 0; count--) 23 if (rdp->nxttail[count] == rdp->nxttail[RCU_DONE_TAIL]) 24 rdp->nxttail[count] = &rdp->nxtlist; 25 local_irq_restore(flags); 26 count = count_lazy = 0; 27 while (list) { 28 next = list->next; 29 prefetch(next); 30 debug_rcu_head_unqueue(list); 31 if (__rcu_reclaim(rsp->name, list)) 32 count_lazy++; 33 list = next; 34 if (++count >= bl && 35 (need_resched() || 36 (!is_idle_task(current) && !rcu_is_callbacks_kthread()))) 37 break; 38 } 39 40 local_irq_save(flags); 41 trace_rcu_batch_end(rsp->name, count, !!list, need_resched(), 42 is_idle_task(current), 43 rcu_is_callbacks_kthread()); 44 if (list != NULL) { 45 *tail = rdp->nxtlist; 46 rdp->nxtlist = list; 47 for (count = 0; count < RCU_NEXT_SIZE; count++) 48 if (&rdp->nxtlist == rdp->nxttail[count]) 49 rdp->nxttail[count] = tail; 50 else 51 break; 52 } 53 smp_mb(); 54 rdp->qlen_lazy -= count_lazy; 55 rdp->qlen -= count; 56 rdp->n_cbs_invoked += count; 57 if (rdp->blimit == LONG_MAX && rdp->qlen <= qlowmark) 58 rdp->blimit = blimit; 59 if (rdp->qlen == 0 && rdp->qlen_last_fqs_check != 0) { 60 rdp->qlen_last_fqs_check = 0; 61 rdp->n_force_qs_snap = rsp->n_force_qs; 62 } else if (rdp->qlen < rdp->qlen_last_fqs_check - qhimark) 63 rdp->qlen_last_fqs_check = rdp->qlen; 64 WARN_ON_ONCE((rdp->nxtlist == NULL) != (rdp->qlen == 0)); 65 local_irq_restore(flags); 66 if (cpu_has_callbacks_ready_to_invoke(rdp)) 67 invoke_rcu_core(); 68 }
The rcu_do_batch()
function gathers RCU callbacks
whose grace period has ended and calls __rcu_reclaim
on them to invoke them.
This straightforward job is complicated by the need to avoid
high-latency bursts of callback processing where possible,
but to avoid out-of-memory conditions even if high latencies must
be incurred to do so.
Line 7 checks to see if there are callbacks whose grace period has ended, and if not, lines 8-11 trace an empty burst of callback invocation and line 12 returns.
Otherwise, execution continues at line 14, which disables
interrupts so that the callback lists can be manipulated safely.
Line 15 complains if rcu_do_batch()
finds itself
running on an offline CPU.
Line 16 picks up the current callback-invocation batch limit,
line 17 traces the start of callback invocation,
and lines 18-21 extract the callbacks ready for invocation onto
a local list named, strangely enough, list
.
Lines 22-24 adjust any of the callback-list pointers that were
pointing at the last ready-to-invoke callback so that they instead
reference the list header.
Line 25 restores interrupts, as the function will be invoking
callbacks from its local list.
Line 26 initializes both count
and count_lazy
.
Each pass through the loop spanning lines 27-38
invokes one callback.
Line 28 obtains a pointer to the following callback,
line 29 does a possibly-useless prefetch operation,
line 30 tells the debug-objects subsystem that the callback
has been invoked,
and line 31 calls __rcu_reclaim()
to invoke the
callback.
If the callback was lazy, line 32 counts it.
Line 33 advances to the next callback, and lines 34-36 check
to see if we need to exit the loop due to having exceeded the batch
limit when some non-idle task wants to use this CPU, and if so, line 37
exits the loop.
Quick Quiz 8:
Why is the call to debug_rcu_head_unqueue()
before the callback invocation?
Won't it miss some bugs that way?
After all, some other CPU might (incorrectly) invoke call_rcu()
on this same callback just after this CPU finishes the
debug_rcu_head_unqueue()
.
Answer
Quick Quiz 9:
What would happen if the current CPU went offline while
rcu_do_batch()
was invoking that CPU's callbacks?
Answer
Line 40 disables interrupts once more and lines 41-43
trace the end of this batch of callback invocation.
Line 44 checks to see if there are callbacks left on the
local list, and, if so, lines 45 and 46 requeue them
and lines 47-51 adjust any of the callback pointers that were
referencing the list header to instead reference the ->next
pointer of the last callback that was re-inserted from the local list.
Line 53 ensures that the callback list has been fully adjusted
before the statistics are updated (again for the benefit of
_rcu_barrier()
, and
lines 54-56 adjust statistics.
Quick Quiz 10:
How can we be sure that removing callbacks and then reinserting
them won't mix the order of some of the callbacks, thus invalidating
the rcu_barrier()
function's assumptions?
Answer
Line 57 checks to see if callbacks have drained to below
the low-water mark, and if so, line 58 resets the batch limit
back down to the original latency-friendly value.
Line 59 checks to see if the callbacks have drained completely
after __call_rcu()
took action to deal with excessive
numbers of callbacks, and if so, lines 60 and 61 reset
the __call_rcu()
state.
Otherwise, line 62 checks to see if a large number of callbacks
has drained since the last check, and if so, resets the snapshot
to the current (lower) number of callbacks, thus allowing a similar
increase to once again trigger emergency callback-reduction action.
Line 64 complains if this CPU is supposed to be offline but still
has callbacks.
Line 65 re-enables interrupts, and if line 66 determines that
there are more callbacks to be invoked, line 67 arranges for them
to be invoked at a later time.
The rcu_barrier()
function waits for all prior RCU callbacks to be invoked.
This is important when unloading modules that use call_rcu()
,
because otherwise one of the module's RCU callback might be invoked after
the module has been unloaded.
This callback would be fatally disappointed to find that its callback
function was no longer loaded in the kernel.
The rcu_barrier()
function is used to avoid this
problem.
The module first takes steps to ensure that it will not invoke
call_rcu()
anymore, then invokes rcu_barrier()
.
Once rcu_barrier()
returns, the module is guaranteed that
there will be no more RCU callbacks invoking its functions, and may
thus safely unload itself.
The underlying implementation of rcu_barrier()
is provided by the rcu_barrier_callback()
,
rcu_barrier_func()
, and
_rcu_barrier()
functions.
We first discuss rcu_barrier_callback()
and
rcu_barrier_func()
along with the data that they use:
1 static void rcu_barrier_callback(struct rcu_head *rhp) 2 { 3 struct rcu_data *rdp = container_of(rhp, struct rcu_data, barrier_head); 4 struct rcu_state *rsp = rdp->rsp; 5 6 if (atomic_dec_and_test(&rsp->barrier_cpu_count)) { 7 _rcu_barrier_trace(rsp, "LastCB", -1, rsp->n_barrier_done); 8 complete(&rsp->barrier_completion); 9 } else { 10 _rcu_barrier_trace(rsp, "CB", -1, rsp->n_barrier_done); 11 } 12 } 13 14 static void rcu_barrier_func(void *type) 15 { 16 struct rcu_state *rsp = type; 17 struct rcu_data *rdp = __this_cpu_ptr(rsp->rda); 18 19 _rcu_barrier_trace(rsp, "IRQ", -1, rsp->n_barrier_done); 20 atomic_inc(&rsp->barrier_cpu_count); 21 rsp->call(&rdp->barrier_head, rcu_barrier_callback); 22 }
The ->barrier_head
fields in the
per-CPU rcu_data
structure are posted to
each CPU that currently has callbacks queued.
A count of the number of CPUs yet to respond is maintained
in ->barrier_cpu_count
in the rcu_state
structure.
The ->barrier_mutex
field (also in the
rcu_state
structure) permits only one rcu_barrier()
operation at a time per RCU flavor,
the ->barrier_completion
field (yet again in the
rcu_state
structure)
is used to signal _rcu_barrier()
once all
pre-existing callbacks have been invoked, and finally
the n_barrier_done
field (still in the rcu_state
structure) allows concurrent barrier operations to piggyback off of each
others' work.
The rcu_barrier_callback()
function shown
on lines 1-12 is the
callback function used by the per-CPU callbacks registered by
_rcu_barrier()
.
Line 6 atomically decrements the
->barrier_cpu_count
, and if the result is zero
(in other words, this is the last of the callbacks),
line 8 wakes up _rcu_barrier()
.
In either case, line 7 or 10 does tracing.
The rcu_barrier_func()
function shown on lines 14-22
executes on each callback-bearing
CPU to register the specified flavor of RCU callback on each online CPU.
Line 19 does tracing,
line 20 increments the count of the number of outstanding callbacks,
line 21 registers the desired flavor of RCU callback.
Quick Quiz 11:
What happens if this task is migrated between the
time that rcu_barrier_func()
executes line 17
(where it gets a pointer to the per-CPU rcu_head
structure)
and the time it executes line 21 (where it registers the callback)?
Answer
1 static void _rcu_barrier(struct rcu_state *rsp) 2 { 3 int cpu; 4 struct rcu_data *rdp; 5 unsigned long snap = ACCESS_ONCE(rsp->n_barrier_done); 6 unsigned long snap_done; 7 8 _rcu_barrier_trace(rsp, "Begin", -1, snap); 9 mutex_lock(&rsp->barrier_mutex); 10 smp_mb(); 11 snap_done = ACCESS_ONCE(rsp->n_barrier_done); 12 _rcu_barrier_trace(rsp, "Check", -1, snap_done); 13 if (ULONG_CMP_GE(snap_done, ((snap + 1) & ~0x1) + 2)) { 14 _rcu_barrier_trace(rsp, "EarlyExit", -1, snap_done); 15 smp_mb(); 16 mutex_unlock(&rsp->barrier_mutex); 17 return; 18 } 19 ACCESS_ONCE(rsp->n_barrier_done)++; 20 WARN_ON_ONCE((rsp->n_barrier_done & 0x1) != 1); 21 _rcu_barrier_trace(rsp, "Inc1", -1, rsp->n_barrier_done); 22 smp_mb(); 23 init_completion(&rsp->barrier_completion); 24 atomic_set(&rsp->barrier_cpu_count, 1); 25 get_online_cpus(); 26 for_each_online_cpu(cpu) { 27 rdp = per_cpu_ptr(rsp->rda, cpu); 28 if (ACCESS_ONCE(rdp->qlen)) { 29 _rcu_barrier_trace(rsp, "OnlineQ", cpu, 30 rsp->n_barrier_done); 31 smp_call_function_single(cpu, rcu_barrier_func, rsp, 1); 32 } else { 33 _rcu_barrier_trace(rsp, "OnlineNQ", cpu, 34 rsp->n_barrier_done); 35 } 36 } 37 put_online_cpus(); 38 if (atomic_dec_and_test(&rsp->barrier_cpu_count)) 39 complete(&rsp->barrier_completion); 40 smp_mb(); 41 ACCESS_ONCE(rsp->n_barrier_done)++; 42 WARN_ON_ONCE((rsp->n_barrier_done & 0x1) != 0); 43 _rcu_barrier_trace(rsp, "Inc2", -1, rsp->n_barrier_done); 44 wait_for_completion(&rsp->barrier_completion); 45 mutex_unlock(&rsp->barrier_mutex); 46 }
The _rcu_barrier()
function shown above
causes rcu_barrier_func()
to be executed on each online
CPU that has callbacks and then waits for all of the resulting callbacks
to be invoked.
Line 8 does tracing and and line 9 acquires the mutex.
Quick Quiz 12:
What prevents massive lock contention on
rcu_barrier_mutex
if there are lots of concurrent
calls to the same flavor of rcu_barrier()
?
Answer
Line 10 ensures that any actions taken by this CPU
prior to the call to _rcu_barrier()
are seen to happen
before any checking of callback queue lengths.
Line 11 takes a snapshot of the n_barrier_done
field to enable piggybacking off of other concurrent _rcu_barrier()
instances,
and line 12 does tracing.
Line 13 checks to see if our lock acquisition took us through
an entire concurrent _rcu_barrier()
duration, with the
requirement being that we waited though two successive even values
of the counter.
(So we round the old snapshot up to an even value and, and if the
new value is at least two greater than that, someone else did our work
for us.)
If some other _rcu_barrier()
did indeed do our work for
us, then line 15 ensures that any subsequent work is seen by all
CPUs to come after the barrier operation, line 16 releases the
mutex, and line 17 returns to the caller.
Otherwise, we really do need to execute a barrier operation starting
on line 19, which increments the ->n_barrier_done
field, which should result in an odd number (or line 20 will
complain).
Line 21 does tracing, and line 22 ensures that the preceding
increment is seen as happening before any of the following
_rcu_barrier()
machinery.
Lines 23 and 24 do initialization
and line 25 prevents any CPUs from coming online or going offline.
Each pass through the loop spanning lines 26-36 deals with
one online CPU.
Line 27 picks up a pointer to the current CPU's rcu_data
structure.
If line 28 finds that this CPU has callbacks queued, then
lines 29 and 30 do tracing and line 31 causes
rcu_barrier_func()
to be invoked on the current CPU.
Otherwise, lines 33 and 34 do tracing.
Once all online CPUs have been dealt with, line 37
allows CPU-hotplug operations to resume.
Line 38 atomically decrements ->barrier_cpu_count
,
and if the result is zero, line 39 invokes complete()
to record the fact that all CPUs' callbacks have now been invoked.
Line 40 ensures that the _rcu_barrier()
mechanism
is seen as having completed before line 41 increments
->n_barrier_done
(which must now be an even number,
or line 42 will complain).
Line 43 does tracing and line 44 waits for all callbacks
to be invoked.
Finally, line 45 releases the mutex.
Quick Quiz 13:
Why not initialize ->barrier_cpu_count
to zero,
given that we clearly have not messed with any of the CPUs yet?
Answer
Quick Quiz 14:
Suppose that an offline CPU has callbacks queued.
What should _rcu_barrier()
do about that?
Answer
When a CPU goes offline, its callbacks are moved to some online CPU.
The rcu_send_cbs_to_orphanage()
and
rcu_adopt_orphan_cbs()
functions shown
below handle this callback movement.
1 static void 2 rcu_send_cbs_to_orphanage(int cpu, struct rcu_state *rsp, 3 struct rcu_node *rnp, struct rcu_data *rdp) 4 { 5 if (rdp->nxtlist != NULL) { 6 rsp->qlen_lazy += rdp->qlen_lazy; 7 rsp->qlen += rdp->qlen; 8 rdp->n_cbs_orphaned += rdp->qlen; 9 rdp->qlen_lazy = 0; 10 ACCESS_ONCE(rdp->qlen) = 0; 11 } 12 if (*rdp->nxttail[RCU_DONE_TAIL] != NULL) { 13 *rsp->orphan_nxttail = *rdp->nxttail[RCU_DONE_TAIL]; 14 rsp->orphan_nxttail = rdp->nxttail[RCU_NEXT_TAIL]; 15 *rdp->nxttail[RCU_DONE_TAIL] = NULL; 16 } 17 if (rdp->nxtlist != NULL) { 18 *rsp->orphan_donetail = rdp->nxtlist; 19 rsp->orphan_donetail = rdp->nxttail[RCU_DONE_TAIL]; 20 } 21 init_callback_list(rdp); 22 }
The rcu_send_cbs_to_orphanage()
function
moves RCU callbacks from the newly offlined CPU to the “orphanage”
This function is called from the CPU_DEAD
notifier,
after the CPU has gone completely offline.
Line 5 checks to see if the newly offlined CPU has
callbacks, and if so, lines 6-10 adjust the counts to reflect
the movement of the callbacks to the orphanage.
A memory barrier is not required because this function's caller
must hold the ->onofflock
, which is also held
across rcu_adopt_orphan_cbs()
, which adopts the
callbacks.
Line 12 checks to see if the newly offlined CPU has
any callbacks that still need a grace period, and if so, lines 13-15
moves those callbacks to the rcu_state
structure's
->orphan_nxttail
list.
The newly offlined CPU's rcu_data
structure's
->nxttail[]
array is unchanged because it is
reinitialized later.
Line 17 checks to see whether the newly offlined CPU has
any callbacks that are ready to invoke, and if so, lines 18
and 19 move those callbacks to the rcu_state
structure's
->nxtlist
list.
Line 21 then re-initializes the newly offlined CPU's callback list.
Quick Quiz 15:
But why do all of the outgoing CPU's not-ready-to-invoke callbacks need
to go through another full grace period?
We have all the segment pointers, so why not enqueue each segment
of the outgoing CPU's callbacks to follow the corresponding
segment of the online CPU's callbacks?
Answer
Next we look at callback adoption, which runs either in
a CPU_DEAD
notifier or is called by _rcu_barrier()
:
1 static void rcu_adopt_orphan_cbs(struct rcu_state *rsp) 2 { 3 int i; 4 struct rcu_data *rdp = __this_cpu_ptr(rsp->rda); 5 6 rdp->qlen_lazy += rsp->qlen_lazy; 7 rdp->qlen += rsp->qlen; 8 rdp->n_cbs_adopted += rsp->qlen; 9 if (rsp->qlen_lazy != rsp->qlen) 10 rcu_idle_count_callbacks_posted(); 11 rsp->qlen_lazy = 0; 12 rsp->qlen = 0; 13 if (rsp->orphan_donelist != NULL) { 14 *rsp->orphan_donetail = *rdp->nxttail[RCU_DONE_TAIL]; 15 *rdp->nxttail[RCU_DONE_TAIL] = rsp->orphan_donelist; 16 for (i = RCU_NEXT_SIZE - 1; i >= RCU_DONE_TAIL; i--) 17 if (rdp->nxttail[i] == rdp->nxttail[RCU_DONE_TAIL]) 18 rdp->nxttail[i] = rsp->orphan_donetail; 19 rsp->orphan_donelist = NULL; 20 rsp->orphan_donetail = &rsp->orphan_donelist; 21 } 22 if (rsp->orphan_nxtlist != NULL) { 23 *rdp->nxttail[RCU_NEXT_TAIL] = rsp->orphan_nxtlist; 24 rdp->nxttail[RCU_NEXT_TAIL] = rsp->orphan_nxttail; 25 rsp->orphan_nxtlist = NULL; 26 rsp->orphan_nxttail = &rsp->orphan_nxtlist; 27 } 28 }
Lines 6-12 adjust the callback counts, with line 10
noting the arrival of any non-lazy callbacks for the benefit of
RCU_FAST_NO_HZ
.
Line 13 checks to see if the rcu_state
structure has callbacks ready to invoke, and if so, lines 14-20
adopt them.
Line 14 and 15 splice the new callbacks at the end of
this CPU's list of callbacks that are ready to invoke.
Lines 16-18 update the ->nxttail[]
pointers
for any empty segments of the list to reference the new last
ready-to-invoke callback.
Lines 19 and 20 then initialize the rcu_state
structure's ->orphan_donelist
to empty.
Line 22 checks to see if the rcu_state
structure has callbacks that need to wait for a grace period,
and if so, lines 23-26 adopt them.
Lines 23 and 24 append the new callbacks to the end
of this CPU's list, and lines 25 and 26 initialize
the rcu_state
structure's ->orphan_nxtlist
to empty.
rcu_barrier()
and CPU-hotplug operations.
This work represents the view of the author and does not necessarily represent the view of IBM.
Linux is a registered trademark of Linus Torvalds.
Other company, product, and service names may be trademarks or service marks of others.
Quick Quiz 1: Suppose we (incorrectly) associated CB 4 with the next grace period number 3. Exactly how could problems result?
Answer: The following sequence of events would lead to failure, where CPU 0 is the one corresponding to CB 4:
rcu_dereference()
to obtain a pointer to
data element A.
call_rcu()
to schedule freeing
of data element A after some later grace period.
The moral of this story is that only the CPU starting a given grace period knows when it really starts, and only the CPU ending a given grace period knows when it really ends. Other CPUs therefore need to avoid making unfounded assumptions about the beginnings and endings of grace periods.
That said, a CPU could learn the current grace-period state by
acquiring the lock on the root rcu_node
structure.
Unfortunately, this lock must be acquired quite sparingly in order to
avoid massive lock contention on large systems.
Finally, had the callback arrived before the CPU passed through its
queiscent state, the CPU would be guaranteed that the current grace
period was not yet over, and would therefore have been able to add
this new callback to the group waiting for grace period number 4.
Quick Quiz 2: But we assigned CB 4 to the next grace period. Why is this safe?
Answer: It is safe because this CPU just now noticed the new grace period. It therefore cannot possibly have announced a quiescent state yet, and the new grace period therefore cannot possibly have ended, which in turn means that the next grace period cannot possibly have started.
Quick Quiz 3: Why not map everything across?
Answer: Grace periods are not synchronized across CPUs.
Quick Quiz 4:
Given that it just registered a new RCU callback, why wouldn't
__call_rcu()
always need to initiate
grace-period processing?
Answer: Here are a couple of reasons: (1) A grace period might already be in progress, in which case it is necessary to wait until it finishes before starting a new one, and (2) It is useful to wait a bit to start a grace period even if there isn't one in progress because that results in more callbacks being handled by a given grace period, reducing the per-update overhead of grace-period detection.
Quick Quiz 5:
Hey!!!
The cpu_has_callbacks_ready_to_invoke()
function does not
actually advance callbacks.
What is the deal here?
Answer: Glad to see that you are paying attention! However, this is a miscellaneous function, and therefore most certainly does not deserve its own section. So it landed in this section.
Quick Quiz 6: But what happens if some other CPU reports a quiescent state on behalf of this CPU, thus causing the current grace period to end, and possibly causing this CPU to report a quiescent state against the wrong CPU? And invalidating the callback-advancement optimization, for that matter?
Answer:
This cannot happen because the current CPU holds its
leaf rcu_node
structure's ->lock
.
The other CPU would therefore spin on this lock until after
the current CPU released it, by which time this CPU would have
cleared its rcu_node
structure's ->qsmask
bit, so that the other CPU would be caught by the check on line 22
of rcu_report_qs_rdp()
.
Or vice versa.
Quick Quiz 7: Why do the tracing before invoking the callback?
Answer: Because the callback function might well free the callback, in which case the tracing would be a use-after-free error.
Quick Quiz 8:
Why is the call to debug_rcu_head_unqueue()
before the callback invocation?
Won't it miss some bugs that way?
After all, some other CPU might (incorrectly) invoke call_rcu()
on this same callback just after this CPU finishes the
debug_rcu_head_unqueue()
.
Answer:
There are two reasons why ordering the
debug_rcu_head_unqueue()
after the
__rcu_reclaim()
would be a very bad idea:
call_rcu()
on its own callback,
but this suggested change would cause the debug-objects
subsystem to complain when this happened.
Quick Quiz 9:
What would happen if the current CPU went offline while
rcu_do_batch()
was invoking that CPU's callbacks?
Answer: This cannot happen because a CPU executing in softirq context cannot be placed offline.
Quick Quiz 10:
How can we be sure that removing callbacks and then reinserting
them won't mix the order of some of the callbacks, thus invalidating
the rcu_barrier()
function's assumptions?
Answer: The callbacks were removed from the head of the list and then are inserted back onto the head of the list, and each CPU invokes its own callbacks. The result is just the same as if the callbacks had not been removed in the first place.
Quick Quiz 11:
What happens if this task is migrated between the
time that rcu_barrier_func()
executes line 17
(where it gets a pointer to the per-CPU rcu_head
structure)
and the time it executes line 21 (where it registers the callback)?
Answer:
This cannot happen because rcu_barrier_func()
executes in hardware interrupt context, so cannot be migrated.
Quick Quiz 12:
What prevents massive lock contention on
rcu_barrier_mutex
if there are lots of concurrent
calls to the same flavor of rcu_barrier()
?
Answer:
Absolutely nothing.
If this ever becomes a problem, then line 9 will need to become
an mutex_trylock()
in a loop, with checking on each
pass to see if someone else did our work for us.
Quick Quiz 13:
Why not initialize ->barrier_cpu_count
to zero,
given that we clearly have not messed with any of the CPUs yet?
Answer: Initializing this field to zero results in the following failure scenario:
_rcu_barrier()
task invokes
rcu_barrier_func()
to enqueue a callback
on CPU 0.
The act of enqueuing the callback increments
->barrier_cpu_count
to one.
_rcu_barrier()
task is preempted.
rcu_barrier_callback()
decrements
->barrier_cpu_count
and finds that the
result is zero.
It therefore invokes complete()
.
_rcu_barrier()
task resumes execution,
and invokes rcu_barrier_func()
to enqueue
callbacks on the remaining CPUs.
_rcu_barrier()
task invokes
wait_for_completion()
, it returns immediately
due to completion()
having already been invoked.
Initializing ->barrier_cpu_count
to one avoids
this scenario.
Quick Quiz 14:
Suppose that an offline CPU has callbacks queued.
What should _rcu_barrier()
do about that?
Answer: Absolutely nothing, because callbacks are removed from CPUs that go offline. Therefore, offline CPUs will not have callbacks.
Quick Quiz 15: But why do all of the outgoing CPU's not-ready-to-invoke callbacks need to go through another full grace period? We have all the segment pointers, so why not enqueue each segment of the outgoing CPU's callbacks to follow the corresponding segment of the online CPU's callbacks?
Answer: That would be a bug! The two CPUs might well have different ideas about which grace period is currently in progress. If the outgoing CPU is up to date with the current grace period but the online CPU still thinks that an old grace period is in effect, then the online CPU could invoke the callbacks from the outgoing CPU too early.
That said, note that any callbacks whose grace period has completed
were placed on the rcu_state
structure's
->orphan_donelist
, so they will not need to go through
another RCU grace period.