May 12, 2013 (Linux 3.9+)

This article was contributed by Paul E. McKenney

Introduction

  1. RCU Callback-Handling Overview
  2. RCU Callback-Handling Operation
  3. RCU Callback-Handling Implementation
  4. Summary

And then there are of course the ineludible answers to the quick quizzes.

RCU Callback-Handling Overview

Each entity waiting for an RCU callback to complete has an RCU callback (rcu_head structure) queued. These callbacks are queued on the per-CPU rcu_data structure's ->nxtlist field in reverse time order, so that the oldest callbacks are at the head of the list. There is an array of pointers into this list, and these pointers segment the list based on the state of the callbacks in each segment with respect to RCU grace periods: Callbacks in the first segment have had their grace period complete and are ready to be invoked, those in the second segment are waiting for the current grace period to complete, those in the third segment will wait for the next grace period to complete, and those in the fourth and last segment have not yet been associated with a specific grace period, and thus might wait for the next grace period or some later grace period.

Callbacks are advanced from one segment to another by updating the elements in the pointer array, as will be demonstrated in the next section.

RCU Callback-Handling Operation

This section shows how a given CPU's callbacks are handled through the course of a series of grace periods.

Initially, each CPU's RCU callback list is empty, as shown below:

CBList00.png

The column of empty boxes will be used later in this example to hold the RCU grace-period number that will be in effect when the corresponding group of callbacks is ready to invoke, which are stored in the ->nxtcompleted array. However, these numbers apply only to the RCU_WAIT_TAIL and RCU_NEXT_READY_TAIL groups. The RCU_DONE_TAIL group is already ready to invoke on the one hand, and the RCU_NEXT_TAIL group has not yet been associated with a grace-period number on the other.

An invocation of call_rcu() would enqueue an RCU callback, but would not yet associate it with a specific grace period, resulting in the state shown below:

CBList01.png

A second invocation of call_rcu() would enqueue another RCU callback, but would still not associate either with a specific grace period, resulting in the state shown below:

CBList02.png

If this same CPU were to start a grace period, it would see that all the callbacks on its list were enqueued prior to the start of the new grace period, which would allow both of them to be handled by this new grace period, with grace period number 2. This results in the state shown below:

CBList03.png

The enqueuing of a third callback would result in the following state, with CBs 1 and 2 waiting for the current grace period and CB 3 not yet being assigned to a specific grace period:

CBList04.png

When this CPU reports a quiescent state up the rcu_node tree, it knows that the next grace period cannot possibly have started. It is therefore safe to associate CB 3 with the next grace period (which is grace period 3), as follows:

CBList05.png

When a fourth callback is enqueued, it is not possible to associate it with the next grace period: This CPU has already announced its quiescent state, so the current grace period could end (and a new one start) at any time. This fourth callback must therefore be left unassociated with any specific grace period, as shown below:

CBList06.png

Quick Quiz 1: Suppose we (incorrectly) associated CB 4 with the next grace period number 3. Exactly how could problems result?
Answer

When the current CPU notices that grace period number 2 has ended, it will advance its callbacks, resulting in the following state:

CBList07.png

Quick Quiz 2: But we assigned CB 4 to the next grace period. Why is this safe?
Answer

When a fifth callback is registered, we finally have all four lists non-empty:

CBList08.png

CBS 1 and 2 are now invoked, resulting in the following state:

CBList09.png

The completion of the current grace period and two additional grace periods would result in all three remaining callbacks being invoked.

Another way of looking at the flow of callbacks through the system is via a state diagram that shows the callbacks moving from one segment of the callback list to another:

CBState.png

Callbacks enter the queue in the “Next” state, which corresponds to the RCU_NEXT_TAIL segment of the list. If the CPU detects a change in the ->completed value or if this CPU either starts a grace period or reports a quiescent state, all callbacks in the “Next” advance to the “Next-Ready” state, which corresponds to the RCU_NEXT_READY_TAIL segment of the list. This same event will advance callbacks to the “Wait” state (RCU_WAIT_TAIL segment) and to the “Done” state (RCU_DONE_TAIL segment).

When a CPU goes offline, callbacks move from the “Next-Ready” and “Wait” states back to the “Next” state. Callbacks in the “Done” state remain there.

Quick Quiz 3: Why not map everything across?
Answer

Two additional complications are posed by rcu_barrier() and by CPU-hotplug operations:

  1. The rcu_barrier() function waits for all previously registered RCU callbacks to be invoked. It does this by registering a callback on each CPU, then waiting for all of those callbacks to be invoked. For this to work, the callbacks registered on a given CPU must be maintained in order.
  2. When a CPU goes offline, it might well have RCU callbacks queued. These “orphaned” callbacks are moved to some other online CPU. Of course, an rcu_barrier() callback might well move with them, so they must remain in order.
Finally, a given CPU can be marked at either build or boot time as a no-callbacks CPU. In this case, the callbacks are not queued on the CPU's ->nxtlist, but rather on a separate nocb_head list, where it is dequeued, processed, and invoked by a separate kthread dedicated to this purpose.

Given this background, it is time to look at the code itself.

RCU Callback-Handling Implementation

This section covers the code, starting with callback registration, continuing with grace-period callback processing, next callback invocation, then rcu_barrier() handling, and finally the handling of callbacks orphaned by CPU-hotplug operations.

Callback Registration

The __call_rcu() function registers new callbacks, although it is normally invoked through one of the call_rcu(), call_rcu_bh(), or call_rcu_sched() wrapper functions.

  1 static void
  2 __call_rcu(struct rcu_head *head, void (*func)(struct rcu_head *rcu),
  3      struct rcu_state *rsp, bool lazy)
  4 {
  5   unsigned long flags;
  6   struct rcu_data *rdp;
  7 
  8   WARN_ON_ONCE((unsigned long)head & 0x3);
  9   debug_rcu_head_queue(head);
 10   head->func = func;
 11   head->next = NULL;
 12   local_irq_save(flags);
 13   rdp = this_cpu_ptr(rsp->rda);
 14   if (unlikely(rdp->nxttail[RCU_NEXT_TAIL] == NULL)) {
 15     WARN_ON_ONCE(1);
 16     local_irq_restore(flags);
 17     return;
 18   }
 19   ACCESS_ONCE(rdp->qlen)++;
 20   if (lazy)
 21     rdp->qlen_lazy++;
 22   else
 23     rcu_idle_count_callbacks_posted();
 24   smp_mb();
 25   *rdp->nxttail[RCU_NEXT_TAIL] = head;
 26   rdp->nxttail[RCU_NEXT_TAIL] = &head->next;
 27 
 28   if (__is_kfree_rcu_offset((unsigned long)func))
 29     trace_rcu_kfree_callback(rsp->name, head, (unsigned long)func,
 30            rdp->qlen_lazy, rdp->qlen);
 31   else
 32     trace_rcu_callback(rsp->name, head, rdp->qlen_lazy, rdp->qlen);
 33   __call_rcu_core(rsp, rdp, head, flags);
 34   local_irq_restore(flags);
 35 }

Line 8 verifies that the RCU callback is properly aligned, which will be important when the pointer's low-order bits are used to mark “lazy” callbacks. Line 9 informs the debug-objects subsystem that the callback is being queued, and lines 10 and 11 initialize the rcu_head structure.

Line 12 disables interrupts and line 34 restores them. Line 13 obtains a pointer to this CPU's rcu_data structure. If line 14 sees that callback registry has been disabled due to this CPU being offline, lines 15-17 issue a warning, restore interrupts, and return to the caller, respectively. Line 19 increments the count of callbacks for this rcu_data structure. If line 20 sees that this is a lazy callback (e.g., one registered via kfree_rcu()), line 21 increments the count of lazy callbacks for this rcu_data structure, otherwise line 23 informs RCU_FAST_NO_HZ of a new non-lazy callback registered on this CPU. Line 24 ensures that the callback counts are updated before the new callback is queued (for the benefit of _rcu_barrier()), and it also ensures that any prior updates to RCU-protected data structures carried out by this CPU are seen by all CPUs as happening prior to any subsequent grace-period processing. Lines 25 and 26 enqueue the rcu_head structure to the tail of this CPU's callback list. Note that the new callback is not yet associated with any specific RCU grace period. Lines 28-32 trace the new callback, with lines 29 and 30 tracing the kfree_rcu() case and line 32 tracing the default invoke-a-function case. Finally, line 33 invokes __call_rcu_core() to handle any required special grace-period processing.

Quick Quiz 4: Given that it just registered a new RCU callback, why wouldn't __call_rcu() always need to initiate grace-period processing?
Answer

Grace-Period Callback Processing

Callbacks are advanced as grace periods progress by the cpu_has_callbacks_ready_to_invoke(), rcu_report_qs_rdp(), and __rcu_process_gp_end() functions.

  1 static int
  2 cpu_has_callbacks_ready_to_invoke(struct rcu_data *rdp)
  3 {
  4   return &rdp->nxtlist != rdp->nxttail[RCU_DONE_TAIL];
  5 }
  6 
  7 static void
  8 rcu_report_qs_rdp(int cpu, struct rcu_state *rsp, struct rcu_data *rdp, long lastgp)
  9 {
 10   unsigned long flags;
 11   unsigned long mask;
 12   struct rcu_node *rnp;
 13 
 14   rnp = rdp->mynode;
 15   raw_spin_lock_irqsave(&rnp->lock, flags);
 16   if (lastgp != rnp->gpnum || rnp->completed == rnp->gpnum) {
 17     rdp->passed_quiesce = 0;
 18     raw_spin_unlock_irqrestore(&rnp->lock, flags);
 19     return;
 20   }
 21   mask = rdp->grpmask;
 22   if ((rnp->qsmask & mask) == 0) {
 23     raw_spin_unlock_irqrestore(&rnp->lock, flags);
 24   } else {
 25     rdp->qs_pending = 0;
 26     rdp->nxttail[RCU_NEXT_READY_TAIL] = rdp->nxttail[RCU_NEXT_TAIL];
 27     rcu_report_qs_rnp(mask, rsp, rnp, flags);
 28   }
 29 }
 30 
 31 static void
 32 __rcu_process_gp_end(struct rcu_state *rsp, struct rcu_node *rnp, struct rcu_data *rdp)
 33 {
 34   if (rdp->completed != rnp->completed) {
 35     rdp->nxttail[RCU_DONE_TAIL] = rdp->nxttail[RCU_WAIT_TAIL];
 36     rdp->nxttail[RCU_WAIT_TAIL] = rdp->nxttail[RCU_NEXT_READY_TAIL];
 37     rdp->nxttail[RCU_NEXT_READY_TAIL] = rdp->nxttail[RCU_NEXT_TAIL];
 38     rdp->completed = rnp->completed;
 39     trace_rcu_grace_period(rsp->name, rdp->gpnum, "cpuend");
 40     if (ULONG_CMP_LT(rdp->gpnum, rdp->completed))
 41       rdp->gpnum = rdp->completed;
 42     if ((rnp->qsmask & rdp->grpmask) == 0)
 43       rdp->qs_pending = 0;
 44   }
 45 }

The cpu_has_callbacks_ready_to_invoke() function is shown on lines 1-5. It simply checks to see whether the RCU_DONE_TAIL pointer references the callback list header (the ->nxtlist field). If not, then there are callbacks in the first segmetn of the list, and these callbacks are ready to invoke.

Quick Quiz 5: Hey!!! The cpu_has_callbacks_ready_to_invoke() function does not actually advance callbacks. What is the deal here?
Answer

The rcu_report_qs_rdp() function shown on lines 7-29 is mostly dealt with elsewhere. The key line for callback advancement is line 26, which is invoked if the current CPU has passed through a quiescent state that counts against the current grace period (line 16) when the RCU core still needs a quiescent state from this CPU (line 22). It is therefore safe for the CPU to associate all its remaining unassociated callbacks with the next grace period, which line 26 does.

Quick Quiz 6: But what happens if some other CPU reports a quiescent state on behalf of this CPU, thus causing the current grace period to end, and possibly causing this CPU to report a quiescent state against the wrong CPU? And invalidating the callback-advancement optimization, for that matter?
Answer

The __rcu_process_gp_end() function on lines 31-45 advances callbacks when the CPU detects the end of a grace period. As noted elsewhere, line 34 checks to see if the current CPU is not yet aware that the grace period has ended, and, if so, line 35 advances the RCU_DONE_TAIL pointer to mark all callbacks waiting for the just-completed grace period as done, line 36 advances the RCU_WAIT_TAIL pointer to mark all callbacks associated with the next (possibly already started) grace period to wait for this same grace period, and finally line 37 advances RCU_NEXT_READY_TAIL to associate all remaining callbacks with the grace period after that. The rest of this function is discussed elsewhere.

Invoking RCU Callbacks

The invoke_rcu_callbacks(), __rcu_reclaim(), rcu_do_batch(), and rcu_preempt_do_callbacks() functions invoke RCU callbacks whose grace periods have ended.

  1 static void invoke_rcu_callbacks(struct rcu_state *rsp, struct rcu_data *rdp)
  2 {
  3   if (unlikely(!ACCESS_ONCE(rcu_scheduler_fully_active)))
  4     return;
  5   if (likely(!rsp->boost)) {
  6     rcu_do_batch(rsp, rdp);
  7     return;
  8   }
  9   invoke_rcu_callbacks_kthread();
 10 }
 11 
 12 static inline bool __rcu_reclaim(char *rn, struct rcu_head *head)
 13 {
 14   unsigned long offset = (unsigned long)head->func;
 15 
 16   if (__is_kfree_rcu_offset(offset)) {
 17     RCU_TRACE(trace_rcu_invoke_kfree_callback(rn, head, offset));
 18     kfree((void *)head - offset);
 19     return 1;
 20   } else {
 21     RCU_TRACE(trace_rcu_invoke_callback(rn, head));
 22     head->func(head);
 23     return 0;
 24   }
 25 }
 26 
 27 static void rcu_preempt_do_callbacks(void)
 28 {
 29   rcu_do_batch(&rcu_preempt_state, &__get_cpu_var(rcu_preempt_data));
 30 }

The invoke_rcu_callbacks() function is shown on lines 1-10, and it causes the callback-invocation function rcu_do_batch() to run. This must be done differently depending on the context. Line 3 checks to see if the scheduler has spawned the first non-idle task, and if not, line 4 returns. Line 5 checks to see if this flavor of RCU supports priority boosting, and if not, line 6 invokes rcu_do_batch() directly and line 7 returns. Otherwise, this flavor of RCU does support priority boosting, which means that callback invocation must be done from a kthread, so line 9 invokes invoke_rcu_callbacks_kthread() to wake that kthread up.

The __rcu_reclaim() function shown on lines 12-25 invokes the specified callback based on its type. Line 16 invokes __is_kfree_rcu_offset() to determine whether the callback was registered by kfree_rcu(), and if so, line 17 traces this fact, line 18 invokes kfree() on the callback, and line 19 returns 1 to indicate that this was a lazy callback. Otherwise, line 21 traces the fact that we are invoking a normal callback, line 22 invokes it, and line 23 returns 0 to indicate that this was a non-lazy callback.

Quick Quiz 7: Why do the tracing before invoking the callback?
Answer

The rcu_preempt_do_callbacks() function invokes rcu_do_batch() on the RCU-preempt flavor of RCU. If RCU-preempt is not configured in the kernel, for example, for CONFIG_PREEMPT=n kernel builds, then rcu_preempt_do_callbacks() is an empty function.

  1 static void rcu_do_batch(struct rcu_state *rsp, struct rcu_data *rdp)
  2 {
  3   unsigned long flags;
  4   struct rcu_head *next, *list, **tail;
  5   int bl, count, count_lazy;
  6 
  7   if (!cpu_has_callbacks_ready_to_invoke(rdp)) {
  8     trace_rcu_batch_start(rsp->name, rdp->qlen_lazy, rdp->qlen, 0);
  9     trace_rcu_batch_end(rsp->name, 0, !!ACCESS_ONCE(rdp->nxtlist),
 10                         need_resched(), is_idle_task(current),
 11             rcu_is_callbacks_kthread());
 12     return;
 13   }
 14   local_irq_save(flags);
 15   WARN_ON_ONCE(cpu_is_offline(smp_processor_id()));
 16   bl = rdp->blimit;
 17   trace_rcu_batch_start(rsp->name, rdp->qlen_lazy, rdp->qlen, bl);
 18   list = rdp->nxtlist;
 19   rdp->nxtlist = *rdp->nxttail[RCU_DONE_TAIL];
 20   *rdp->nxttail[RCU_DONE_TAIL] = NULL;
 21   tail = rdp->nxttail[RCU_DONE_TAIL];
 22   for (count = RCU_NEXT_SIZE - 1; count >= 0; count--)
 23     if (rdp->nxttail[count] == rdp->nxttail[RCU_DONE_TAIL])
 24       rdp->nxttail[count] = &rdp->nxtlist;
 25   local_irq_restore(flags);
 26   count = count_lazy = 0;
 27   while (list) {
 28     next = list->next;
 29     prefetch(next);
 30     debug_rcu_head_unqueue(list);
 31     if (__rcu_reclaim(rsp->name, list))
 32       count_lazy++;
 33     list = next;
 34     if (++count >= bl &&
 35         (need_resched() ||
 36          (!is_idle_task(current) && !rcu_is_callbacks_kthread())))
 37       break;
 38   }
 39 
 40   local_irq_save(flags);
 41   trace_rcu_batch_end(rsp->name, count, !!list, need_resched(),
 42                       is_idle_task(current),
 43                       rcu_is_callbacks_kthread());
 44   if (list != NULL) {
 45     *tail = rdp->nxtlist;
 46     rdp->nxtlist = list;
 47     for (count = 0; count < RCU_NEXT_SIZE; count++)
 48       if (&rdp->nxtlist == rdp->nxttail[count])
 49         rdp->nxttail[count] = tail;
 50       else
 51         break;
 52   }
 53   smp_mb();
 54   rdp->qlen_lazy -= count_lazy;
 55   rdp->qlen -= count;
 56   rdp->n_cbs_invoked += count;
 57   if (rdp->blimit == LONG_MAX && rdp->qlen <= qlowmark)
 58     rdp->blimit = blimit;
 59   if (rdp->qlen == 0 && rdp->qlen_last_fqs_check != 0) {
 60     rdp->qlen_last_fqs_check = 0;
 61     rdp->n_force_qs_snap = rsp->n_force_qs;
 62   } else if (rdp->qlen < rdp->qlen_last_fqs_check - qhimark)
 63     rdp->qlen_last_fqs_check = rdp->qlen;
 64   WARN_ON_ONCE((rdp->nxtlist == NULL) != (rdp->qlen == 0));
 65   local_irq_restore(flags);
 66   if (cpu_has_callbacks_ready_to_invoke(rdp))
 67     invoke_rcu_core();
 68 }

The rcu_do_batch() function gathers RCU callbacks whose grace period has ended and calls __rcu_reclaim on them to invoke them. This straightforward job is complicated by the need to avoid high-latency bursts of callback processing where possible, but to avoid out-of-memory conditions even if high latencies must be incurred to do so.

Line 7 checks to see if there are callbacks whose grace period has ended, and if not, lines 8-11 trace an empty burst of callback invocation and line 12 returns.

Otherwise, execution continues at line 14, which disables interrupts so that the callback lists can be manipulated safely. Line 15 complains if rcu_do_batch() finds itself running on an offline CPU. Line 16 picks up the current callback-invocation batch limit, line 17 traces the start of callback invocation, and lines 18-21 extract the callbacks ready for invocation onto a local list named, strangely enough, list. Lines 22-24 adjust any of the callback-list pointers that were pointing at the last ready-to-invoke callback so that they instead reference the list header. Line 25 restores interrupts, as the function will be invoking callbacks from its local list. Line 26 initializes both count and count_lazy.

Each pass through the loop spanning lines 27-38 invokes one callback. Line 28 obtains a pointer to the following callback, line 29 does a possibly-useless prefetch operation, line 30 tells the debug-objects subsystem that the callback has been invoked, and line 31 calls __rcu_reclaim() to invoke the callback. If the callback was lazy, line 32 counts it. Line 33 advances to the next callback, and lines 34-36 check to see if we need to exit the loop due to having exceeded the batch limit when some non-idle task wants to use this CPU, and if so, line 37 exits the loop.

Quick Quiz 8: Why is the call to debug_rcu_head_unqueue() before the callback invocation? Won't it miss some bugs that way? After all, some other CPU might (incorrectly) invoke call_rcu() on this same callback just after this CPU finishes the debug_rcu_head_unqueue().
Answer

Quick Quiz 9: What would happen if the current CPU went offline while rcu_do_batch() was invoking that CPU's callbacks?
Answer

Line 40 disables interrupts once more and lines 41-43 trace the end of this batch of callback invocation. Line 44 checks to see if there are callbacks left on the local list, and, if so, lines 45 and 46 requeue them and lines 47-51 adjust any of the callback pointers that were referencing the list header to instead reference the ->next pointer of the last callback that was re-inserted from the local list. Line 53 ensures that the callback list has been fully adjusted before the statistics are updated (again for the benefit of _rcu_barrier(), and lines 54-56 adjust statistics.

Quick Quiz 10: How can we be sure that removing callbacks and then reinserting them won't mix the order of some of the callbacks, thus invalidating the rcu_barrier() function's assumptions?
Answer

Line 57 checks to see if callbacks have drained to below the low-water mark, and if so, line 58 resets the batch limit back down to the original latency-friendly value. Line 59 checks to see if the callbacks have drained completely after __call_rcu() took action to deal with excessive numbers of callbacks, and if so, lines 60 and 61 reset the __call_rcu() state. Otherwise, line 62 checks to see if a large number of callbacks has drained since the last check, and if so, resets the snapshot to the current (lower) number of callbacks, thus allowing a similar increase to once again trigger emergency callback-reduction action. Line 64 complains if this CPU is supposed to be offline but still has callbacks. Line 65 re-enables interrupts, and if line 66 determines that there are more callbacks to be invoked, line 67 arranges for them to be invoked at a later time.

Waiting For All Prior RCU Callbacks

The rcu_barrier() function waits for all prior RCU callbacks to be invoked. This is important when unloading modules that use call_rcu(), because otherwise one of the module's RCU callback might be invoked after the module has been unloaded. This callback would be fatally disappointed to find that its callback function was no longer loaded in the kernel.

The rcu_barrier() function is used to avoid this problem. The module first takes steps to ensure that it will not invoke call_rcu() anymore, then invokes rcu_barrier(). Once rcu_barrier() returns, the module is guaranteed that there will be no more RCU callbacks invoking its functions, and may thus safely unload itself.

The underlying implementation of rcu_barrier() is provided by the rcu_barrier_callback(), rcu_barrier_func(), and _rcu_barrier() functions. We first discuss rcu_barrier_callback() and rcu_barrier_func() along with the data that they use:

  1 static void rcu_barrier_callback(struct rcu_head *rhp)
  2 {
  3   struct rcu_data *rdp = container_of(rhp, struct rcu_data, barrier_head);
  4   struct rcu_state *rsp = rdp->rsp;
  5 
  6   if (atomic_dec_and_test(&rsp->barrier_cpu_count)) {
  7     _rcu_barrier_trace(rsp, "LastCB", -1, rsp->n_barrier_done);
  8     complete(&rsp->barrier_completion);
  9   } else {
 10     _rcu_barrier_trace(rsp, "CB", -1, rsp->n_barrier_done);
 11   }
 12 }
 13 
 14 static void rcu_barrier_func(void *type)
 15 {
 16   struct rcu_state *rsp = type;
 17   struct rcu_data *rdp = __this_cpu_ptr(rsp->rda);
 18 
 19   _rcu_barrier_trace(rsp, "IRQ", -1, rsp->n_barrier_done);
 20   atomic_inc(&rsp->barrier_cpu_count);
 21   rsp->call(&rdp->barrier_head, rcu_barrier_callback);
 22 }

The ->barrier_head fields in the per-CPU rcu_data structure are posted to each CPU that currently has callbacks queued. A count of the number of CPUs yet to respond is maintained in ->barrier_cpu_count in the rcu_state structure. The ->barrier_mutex field (also in the rcu_state structure) permits only one rcu_barrier() operation at a time per RCU flavor, the ->barrier_completion field (yet again in the rcu_state structure) is used to signal _rcu_barrier() once all pre-existing callbacks have been invoked, and finally the  n_barrier_done field (still in the rcu_state structure) allows concurrent barrier operations to piggyback off of each others' work.

The rcu_barrier_callback() function shown on lines 1-12 is the callback function used by the per-CPU callbacks registered by _rcu_barrier(). Line 6 atomically decrements the ->barrier_cpu_count, and if the result is zero (in other words, this is the last of the callbacks), line 8 wakes up _rcu_barrier(). In either case, line 7 or 10 does tracing.

The rcu_barrier_func() function shown on lines 14-22 executes on each callback-bearing CPU to register the specified flavor of RCU callback on each online CPU. Line 19 does tracing, line 20 increments the count of the number of outstanding callbacks, line 21 registers the desired flavor of RCU callback.

Quick Quiz 11: What happens if this task is migrated between the time that rcu_barrier_func() executes line 17 (where it gets a pointer to the per-CPU rcu_head structure) and the time it executes line 21 (where it registers the callback)?
Answer

  1 static void _rcu_barrier(struct rcu_state *rsp)
  2 {
  3   int cpu;
  4   struct rcu_data *rdp;
  5   unsigned long snap = ACCESS_ONCE(rsp->n_barrier_done);
  6   unsigned long snap_done;
  7 
  8   _rcu_barrier_trace(rsp, "Begin", -1, snap);
  9   mutex_lock(&rsp->barrier_mutex);
 10   smp_mb();
 11   snap_done = ACCESS_ONCE(rsp->n_barrier_done);
 12   _rcu_barrier_trace(rsp, "Check", -1, snap_done);
 13   if (ULONG_CMP_GE(snap_done, ((snap + 1) & ~0x1) + 2)) {
 14     _rcu_barrier_trace(rsp, "EarlyExit", -1, snap_done);
 15     smp_mb();
 16     mutex_unlock(&rsp->barrier_mutex);
 17     return;
 18   }
 19   ACCESS_ONCE(rsp->n_barrier_done)++;
 20   WARN_ON_ONCE((rsp->n_barrier_done & 0x1) != 1);
 21   _rcu_barrier_trace(rsp, "Inc1", -1, rsp->n_barrier_done);
 22   smp_mb();
 23   init_completion(&rsp->barrier_completion);
 24   atomic_set(&rsp->barrier_cpu_count, 1);
 25   get_online_cpus();
 26   for_each_online_cpu(cpu) {
 27     rdp = per_cpu_ptr(rsp->rda, cpu);
 28     if (ACCESS_ONCE(rdp->qlen)) {
 29       _rcu_barrier_trace(rsp, "OnlineQ", cpu,
 30              rsp->n_barrier_done);
 31       smp_call_function_single(cpu, rcu_barrier_func, rsp, 1);
 32     } else {
 33       _rcu_barrier_trace(rsp, "OnlineNQ", cpu,
 34              rsp->n_barrier_done);
 35     }
 36   }
 37   put_online_cpus();
 38   if (atomic_dec_and_test(&rsp->barrier_cpu_count))
 39     complete(&rsp->barrier_completion);
 40   smp_mb();
 41   ACCESS_ONCE(rsp->n_barrier_done)++;
 42   WARN_ON_ONCE((rsp->n_barrier_done & 0x1) != 0);
 43   _rcu_barrier_trace(rsp, "Inc2", -1, rsp->n_barrier_done);
 44   wait_for_completion(&rsp->barrier_completion);
 45   mutex_unlock(&rsp->barrier_mutex);
 46 }

The _rcu_barrier() function shown above causes rcu_barrier_func() to be executed on each online CPU that has callbacks and then waits for all of the resulting callbacks to be invoked. Line 8 does tracing and and line 9 acquires the mutex.

Quick Quiz 12: What prevents massive lock contention on rcu_barrier_mutex if there are lots of concurrent calls to the same flavor of rcu_barrier()?
Answer

Line 10 ensures that any actions taken by this CPU prior to the call to _rcu_barrier() are seen to happen before any checking of callback queue lengths. Line 11 takes a snapshot of the  n_barrier_done field to enable piggybacking off of other concurrent _rcu_barrier() instances, and line 12 does tracing. Line 13 checks to see if our lock acquisition took us through an entire concurrent _rcu_barrier() duration, with the requirement being that we waited though two successive even values of the counter. (So we round the old snapshot up to an even value and, and if the new value is at least two greater than that, someone else did our work for us.) If some other _rcu_barrier() did indeed do our work for us, then line 15 ensures that any subsequent work is seen by all CPUs to come after the barrier operation, line 16 releases the mutex, and line 17 returns to the caller.

Otherwise, we really do need to execute a barrier operation starting on line 19, which increments the ->n_barrier_done field, which should result in an odd number (or line 20 will complain). Line 21 does tracing, and line 22 ensures that the preceding increment is seen as happening before any of the following _rcu_barrier() machinery. Lines 23 and 24 do initialization and line 25 prevents any CPUs from coming online or going offline.

Each pass through the loop spanning lines 26-36 deals with one online CPU. Line 27 picks up a pointer to the current CPU's rcu_data structure. If line 28 finds that this CPU has callbacks queued, then lines 29 and 30 do tracing and line 31 causes rcu_barrier_func() to be invoked on the current CPU. Otherwise, lines 33 and 34 do tracing.

Once all online CPUs have been dealt with, line 37 allows CPU-hotplug operations to resume. Line 38 atomically decrements ->barrier_cpu_count, and if the result is zero, line 39 invokes complete() to record the fact that all CPUs' callbacks have now been invoked. Line 40 ensures that the _rcu_barrier() mechanism is seen as having completed before line 41 increments ->n_barrier_done (which must now be an even number, or line 42 will complain). Line 43 does tracing and line 44 waits for all callbacks to be invoked. Finally, line 45 releases the mutex.

Quick Quiz 13: Why not initialize ->barrier_cpu_count to zero, given that we clearly have not messed with any of the CPUs yet?
Answer

Quick Quiz 14: Suppose that an offline CPU has callbacks queued. What should _rcu_barrier() do about that?
Answer

Handling Orphaned Callbacks

When a CPU goes offline, its callbacks are moved to some online CPU. The rcu_send_cbs_to_orphanage() and rcu_adopt_orphan_cbs() functions shown below handle this callback movement.

  1 static void
  2 rcu_send_cbs_to_orphanage(int cpu, struct rcu_state *rsp,
  3         struct rcu_node *rnp, struct rcu_data *rdp)
  4 {
  5   if (rdp->nxtlist != NULL) {
  6     rsp->qlen_lazy += rdp->qlen_lazy;
  7     rsp->qlen += rdp->qlen;
  8     rdp->n_cbs_orphaned += rdp->qlen;
  9     rdp->qlen_lazy = 0;
 10     ACCESS_ONCE(rdp->qlen) = 0;
 11   }
 12   if (*rdp->nxttail[RCU_DONE_TAIL] != NULL) {
 13     *rsp->orphan_nxttail = *rdp->nxttail[RCU_DONE_TAIL];
 14     rsp->orphan_nxttail = rdp->nxttail[RCU_NEXT_TAIL];
 15     *rdp->nxttail[RCU_DONE_TAIL] = NULL;
 16   }
 17   if (rdp->nxtlist != NULL) {
 18     *rsp->orphan_donetail = rdp->nxtlist;
 19     rsp->orphan_donetail = rdp->nxttail[RCU_DONE_TAIL];
 20   }
 21   init_callback_list(rdp);
 22 }

The rcu_send_cbs_to_orphanage() function moves RCU callbacks from the newly offlined CPU to the “orphanage” This function is called from the CPU_DEAD notifier, after the CPU has gone completely offline.

Line 5 checks to see if the newly offlined CPU has callbacks, and if so, lines 6-10 adjust the counts to reflect the movement of the callbacks to the orphanage. A memory barrier is not required because this function's caller must hold the ->onofflock, which is also held across rcu_adopt_orphan_cbs(), which adopts the callbacks.

Line 12 checks to see if the newly offlined CPU has any callbacks that still need a grace period, and if so, lines 13-15 moves those callbacks to the rcu_state structure's ->orphan_nxttail list. The newly offlined CPU's rcu_data structure's ->nxttail[] array is unchanged because it is reinitialized later.

Line 17 checks to see whether the newly offlined CPU has any callbacks that are ready to invoke, and if so, lines 18 and 19 move those callbacks to the rcu_state structure's ->nxtlist list. Line 21 then re-initializes the newly offlined CPU's callback list.

Quick Quiz 15: But why do all of the outgoing CPU's not-ready-to-invoke callbacks need to go through another full grace period? We have all the segment pointers, so why not enqueue each segment of the outgoing CPU's callbacks to follow the corresponding segment of the online CPU's callbacks?
Answer

Next we look at callback adoption, which runs either in a CPU_DEAD notifier or is called by _rcu_barrier():

  1 static void rcu_adopt_orphan_cbs(struct rcu_state *rsp)
  2 {
  3   int i;
  4   struct rcu_data *rdp = __this_cpu_ptr(rsp->rda);
  5 
  6   rdp->qlen_lazy += rsp->qlen_lazy;
  7   rdp->qlen += rsp->qlen;
  8   rdp->n_cbs_adopted += rsp->qlen;
  9   if (rsp->qlen_lazy != rsp->qlen)
 10     rcu_idle_count_callbacks_posted();
 11   rsp->qlen_lazy = 0;
 12   rsp->qlen = 0;
 13   if (rsp->orphan_donelist != NULL) {
 14     *rsp->orphan_donetail = *rdp->nxttail[RCU_DONE_TAIL];
 15     *rdp->nxttail[RCU_DONE_TAIL] = rsp->orphan_donelist;
 16     for (i = RCU_NEXT_SIZE - 1; i >= RCU_DONE_TAIL; i--)
 17       if (rdp->nxttail[i] == rdp->nxttail[RCU_DONE_TAIL])
 18         rdp->nxttail[i] = rsp->orphan_donetail;
 19     rsp->orphan_donelist = NULL;
 20     rsp->orphan_donetail = &rsp->orphan_donelist;
 21   }
 22   if (rsp->orphan_nxtlist != NULL) {
 23     *rdp->nxttail[RCU_NEXT_TAIL] = rsp->orphan_nxtlist;
 24     rdp->nxttail[RCU_NEXT_TAIL] = rsp->orphan_nxttail;
 25     rsp->orphan_nxtlist = NULL;
 26     rsp->orphan_nxttail = &rsp->orphan_nxtlist;
 27   }
 28 }

Lines 6-12 adjust the callback counts, with line 10 noting the arrival of any non-lazy callbacks for the benefit of RCU_FAST_NO_HZ.

Line 13 checks to see if the rcu_state structure has callbacks ready to invoke, and if so, lines 14-20 adopt them. Line 14 and 15 splice the new callbacks at the end of this CPU's list of callbacks that are ready to invoke. Lines 16-18 update the ->nxttail[] pointers for any empty segments of the list to reference the new last ready-to-invoke callback. Lines 19 and 20 then initialize the rcu_state structure's ->orphan_donelist to empty.

Line 22 checks to see if the rcu_state structure has callbacks that need to wait for a grace period, and if so, lines 23-26 adopt them. Lines 23 and 24 append the new callbacks to the end of this CPU's list, and lines 25 and 26 initialize the rcu_state structure's ->orphan_nxtlist to empty.

Summary

This article has described the flow of callbacks through RCU, including their interactions with rcu_barrier() and CPU-hotplug operations.

Acknowledgments

I am grateful to @@@

Legal Statement

This work represents the view of the author and does not necessarily represent the view of IBM.

Linux is a registered trademark of Linus Torvalds.

Other company, product, and service names may be trademarks or service marks of others.

Answers to Quick Quizzes

Quick Quiz 1: Suppose we (incorrectly) associated CB 4 with the next grace period number 3. Exactly how could problems result?

Answer: The following sequence of events would lead to failure, where CPU 0 is the one corresponding to CB 4:

  1. CPU 1 ends the current grace period and starts a new one.
  2. CPU 2 notices the new grace period and announces a quiescent state.
  3. CPU 2 enters an RCU read-side critical section and uses rcu_dereference() to obtain a pointer to data element A.
  4. CPU 0 removes data element A from its enclosing data structure and uses call_rcu() to schedule freeing of data element A after some later grace period.
  5. CPU 1 announces a quiescent state.
  6. CPU 0 notices the new grace period and announces a quiescent state.
  7. The new grace period is now complete, and CPU 0 is now free to invoke CB 4, which will free data element A. While CPU 2 is still using it.

The moral of this story is that only the CPU starting a given grace period knows when it really starts, and only the CPU ending a given grace period knows when it really ends. Other CPUs therefore need to avoid making unfounded assumptions about the beginnings and endings of grace periods.

That said, a CPU could learn the current grace-period state by acquiring the lock on the root rcu_node structure. Unfortunately, this lock must be acquired quite sparingly in order to avoid massive lock contention on large systems. Finally, had the callback arrived before the CPU passed through its queiscent state, the CPU would be guaranteed that the current grace period was not yet over, and would therefore have been able to add this new callback to the group waiting for grace period number 4.

Back to Quick Quiz 1.

Quick Quiz 2: But we assigned CB 4 to the next grace period. Why is this safe?

Answer: It is safe because this CPU just now noticed the new grace period. It therefore cannot possibly have announced a quiescent state yet, and the new grace period therefore cannot possibly have ended, which in turn means that the next grace period cannot possibly have started.

Back to Quick Quiz 2.

Quick Quiz 3: Why not map everything across?

Answer: Grace periods are not synchronized across CPUs.

Back to Quick Quiz 3.

Quick Quiz 4: Given that it just registered a new RCU callback, why wouldn't __call_rcu() always need to initiate grace-period processing?

Answer: Here are a couple of reasons: (1) A grace period might already be in progress, in which case it is necessary to wait until it finishes before starting a new one, and (2) It is useful to wait a bit to start a grace period even if there isn't one in progress because that results in more callbacks being handled by a given grace period, reducing the per-update overhead of grace-period detection.

Back to Quick Quiz 4.

Quick Quiz 5: Hey!!! The cpu_has_callbacks_ready_to_invoke() function does not actually advance callbacks. What is the deal here?

Answer: Glad to see that you are paying attention! However, this is a miscellaneous function, and therefore most certainly does not deserve its own section. So it landed in this section.

Back to Quick Quiz 5.

Quick Quiz 6: But what happens if some other CPU reports a quiescent state on behalf of this CPU, thus causing the current grace period to end, and possibly causing this CPU to report a quiescent state against the wrong CPU? And invalidating the callback-advancement optimization, for that matter?

Answer: This cannot happen because the current CPU holds its leaf rcu_node structure's ->lock. The other CPU would therefore spin on this lock until after the current CPU released it, by which time this CPU would have cleared its rcu_node structure's ->qsmask bit, so that the other CPU would be caught by the check on line 22 of rcu_report_qs_rdp().

Or vice versa.

Back to Quick Quiz 6.

Quick Quiz 7: Why do the tracing before invoking the callback?

Answer: Because the callback function might well free the callback, in which case the tracing would be a use-after-free error.

Back to Quick Quiz 7.

Quick Quiz 8: Why is the call to debug_rcu_head_unqueue() before the callback invocation? Won't it miss some bugs that way? After all, some other CPU might (incorrectly) invoke call_rcu() on this same callback just after this CPU finishes the debug_rcu_head_unqueue().

Answer: There are two reasons why ordering the debug_rcu_head_unqueue() after the __rcu_reclaim() would be a very bad idea:

  1. This would in many cases be a use-after-free error, and
  2. It is perfectly legal for an RCU callback function to invoke call_rcu() on its own callback, but this suggested change would cause the debug-objects subsystem to complain when this happened.

Back to Quick Quiz 8.

Quick Quiz 9: What would happen if the current CPU went offline while rcu_do_batch() was invoking that CPU's callbacks?

Answer: This cannot happen because a CPU executing in softirq context cannot be placed offline.

Back to Quick Quiz 9.

Quick Quiz 10: How can we be sure that removing callbacks and then reinserting them won't mix the order of some of the callbacks, thus invalidating the rcu_barrier() function's assumptions?

Answer: The callbacks were removed from the head of the list and then are inserted back onto the head of the list, and each CPU invokes its own callbacks. The result is just the same as if the callbacks had not been removed in the first place.

Back to Quick Quiz 10.

Quick Quiz 11: What happens if this task is migrated between the time that rcu_barrier_func() executes line 17 (where it gets a pointer to the per-CPU rcu_head structure) and the time it executes line 21 (where it registers the callback)?

Answer: This cannot happen because rcu_barrier_func() executes in hardware interrupt context, so cannot be migrated.

Back to Quick Quiz 11.

Quick Quiz 12: What prevents massive lock contention on rcu_barrier_mutex if there are lots of concurrent calls to the same flavor of rcu_barrier()?

Answer: Absolutely nothing. If this ever becomes a problem, then line 9 will need to become an mutex_trylock() in a loop, with checking on each pass to see if someone else did our work for us.

Back to Quick Quiz 12.

Quick Quiz 13: Why not initialize ->barrier_cpu_count to zero, given that we clearly have not messed with any of the CPUs yet?

Answer: Initializing this field to zero results in the following failure scenario:

  1. The _rcu_barrier() task invokes rcu_barrier_func() to enqueue a callback on CPU 0. The act of enqueuing the callback increments ->barrier_cpu_count to one.
  2. The= _rcu_barrier() task is preempted.
  3. A grace period elapses, the callback is invoked, and rcu_barrier_callback() decrements ->barrier_cpu_count and finds that the result is zero. It therefore invokes complete().
  4. The _rcu_barrier() task resumes execution, and invokes rcu_barrier_func() to enqueue callbacks on the remaining CPUs.
  5. When the _rcu_barrier() task invokes wait_for_completion(), it returns immediately due to completion() having already been invoked.

Initializing ->barrier_cpu_count to one avoids this scenario.

Back to Quick Quiz 13.

Quick Quiz 14: Suppose that an offline CPU has callbacks queued. What should _rcu_barrier() do about that?

Answer: Absolutely nothing, because callbacks are removed from CPUs that go offline. Therefore, offline CPUs will not have callbacks.

Back to Quick Quiz 14.

Quick Quiz 15: But why do all of the outgoing CPU's not-ready-to-invoke callbacks need to go through another full grace period? We have all the segment pointers, so why not enqueue each segment of the outgoing CPU's callbacks to follow the corresponding segment of the online CPU's callbacks?

Answer: That would be a bug! The two CPUs might well have different ideas about which grace period is currently in progress. If the outgoing CPU is up to date with the current grace period but the online CPU still thinks that an old grace period is in effect, then the online CPU could invoke the callbacks from the outgoing CPU too early.

That said, note that any callbacks whose grace period has completed were placed on the rcu_state structure's ->orphan_donelist, so they will not need to go through another RCU grace period.

Back to Quick Quiz 15.