September 26, 2011

This article was contributed by Paul E. McKenney

Introduction

  1. RCU Expedited Grace-Period Overview
  2. RCU Expedited Grace-Period Operation
  3. RCU Expedited Grace-Period Implementation
  4. Summary

And what kind of RCU documentation would this be without the answers to the quick quizzes?

RCU Expedited Grace-Period Overview

The normal RCU grace periods prioritize low overhead over latency. In fact, the longer the latency, the greater the number of RCU updates that will be served by a single RCU grace period, so that the overhead of detecting this single RCU grace period may be amortized over a larger number of updates.

In contrast, for an expedited RCU grace period, speed is of the essence. The expedited RCU implementation therefore takes extreme measures, for example, sending IPIs to all CPUs, in order to reduce grace-period latency. These measures are described in more detail in the next section.

RCU Expedited Grace-Period Operation

RCU-sched and RCU-bh Expedited Grace Periods

The basic operation of the RCU-sched and RCU-bh expedited grace periods is straightforward. The idea is to use the stop-CPU subsystem, which uses high-priority kthreads to interrupt processing on each online CPU. This forces every CPU to undergo a context switch, which results in a grace period for both RCU-sched and RCU-bh.

The actual implementation uses additional optimizations, for example, to allow multiple concurrent requests to be satisfied by a single stop-CPU operation. These optimizations will be described in the implementation section that steps through the actual code.

RCU-preempt Expedited Grace Periods

The stop-CPU approach outlined above does not suffice for RCU-preempt due to the possibility of tasks preempted in RCU-preempt read-side critical sections. However, the RCU-preempt expedited grace periods still uses the stop-CPU subsystem (by invoking synchronize_sched_expedited()) to force a context switch on every CPU. This means that every task that started an RCU-preempt read-side critical section prior to the expedited grace period will now be enqueued on some rcu_node structure's ->blkd_tasks list.

When one of these tasks exits its RCU read-side critical section, it removes itself from the ->blkd_tasks list that it is queued on. Therefore, once all of these tasks have removed themselves, the expedited grace period will be complete.

Of course, a task might remain preempted in its RCU read-side critical section for an extended period of time. Therefore, in kernel configurations supporting RCU priority boosting, all tasks blocking a given RCU-preempt expedited grace period are subjected to RCU priority boosting.

Regardless of whether or not priority boosting is available, the end of the expedited grace period is detected using the rcu_node tree as a combining tree, in a manner similar to that of the normal grace periods, but with no per-CPU component. This process is illustrated by the following sequence of events, which are illustrated below by the corresponding sequence of diagrams with commentary:

  1. Task A blocks within an RCU read-side critical section while running on CPU 0.
  2. Task B enters an RCU read-side critical section while running on CPU 3.
  3. Task C enters an RCU read-side critical section while running on CPU 1.
  4. CPU 0 initiates an expedited RCU grace period.
  5. Task D enters an RCU read-side critical section while running on CPU 2, and then blocks.
  6. Task B exits its RCU read-side critical section.
  7. Task A resumes on CPU 3 and exits its RCU read-side critical section.
  8. Task C exits its RCU read-side critical section.
These events are illustrated in the following sections.
Initial State

The initial state of the expedited grace period combining tree is as follows:

EGPExample00.png

The rcu_data and rcu_dynticks structures are omitted from this diagram because they play no part in the expedited grace period. However, their absence requires us to keep track of the fact that the leftmost rcu_node structure covers CPUs 0 and 1 while the rightmost rcu_node structure covers CPUs 2 and 3.

The “em..” in each rcu_node structure represents the state of the ->expmask field, each bit of which indicates whether or not there are tasks in the corresponding subtree of the rcu_node tree blocking an expedited grace period, with “?” indicating that there are and “.” indicating that there are not. The “b” represents the ->blkd_tasks field that heads the list of tasks blocked within an RCU read-side critical section, and the “e” field represents the ->exp_tasks field that indicates which blocked tasks are blocking the current expedited grace period.

Task A Blocks

When task A blocks within an RCU read-side critical section while running on CPU 0, it is queued on the leftmost rcu_node structure, as shown below:

EGPExample01.png

Because there is no expedited grace period in progress, task A is not blocking any expedited grace period.

Several Tasks Enter RCU Read-Side Critical Sections

When tasks B and C enter their RCU read-side critical sections, there is no immediate visible change in expedited grace period state. However, as we will see, the fact that these tasks are in RCU read-side critical sections will influence any future expedited grace period.

CPU 0 Initiates Expedited Grace Period

CPU 0's initiation of an expedited grace period will be illustrated in several steps. The first step is an invocation of synchronize_sched_expedited(), which is invoked as a way of forcing each CPU to undergo a context switch, thus forcing each task in an RCU read-side critical section to enqueue itself on one of the rcu_node structures as shown below:

EGPExample02.png

Quick Quiz 1: Why put task C at the head of the list? That is just backwards!!!
Answer

CPU 0 now initializes the ->expmask fields of all non-leaf rcu_node structures, resulting in the following state:

EGPExample03.png

Quick Quiz 2: Why don't the leaf rcu_node structures also have their ->expmask fields initialized?
Answer

Next, CPU 0 scans all the rcu_node structures, pointing the ->exp_tasks field to the head of any non-empty ->blkd_tasks lists, as shown below.

EGPExample04.png

As can be seen from this diagram, tasks A, B, and C are blocking the expedited grace period.

Quick Quiz 3: Given that it was not running at the time that the expedited grace period started, why is task A blocking the expedited grace period?
Answer

Because all of the leaf rcu_node structures have at least one task blocking the expedited grace period, the root rcu_node structure's two ->expmask bits both remain set. CPU 0 now blocks waiting for the expedited grace period to complete.

Task D Blocks Within An RCU Read-Side Critical Section

If a task D blocks within an RCU read-side critical section while running on CPU 2, it will be queued on the rightmost rcu_node, as shown below:

EGPExample05.png

Because task D started its RCU read-side critical section after the expedited grace period started, it does not block the expedited grace period.

Task B Exits Its RCU Read-Side Critical Section

When task B exits its RCU read-side critical section, it will remove itself from its ->blkd_tasks list. Because that list then has no more tasks blocking the current expedited grace period, the corresponding bit in the root rcu_node structure's ->expmask field is cleared, as shown below:

EGPExample06.png

Now tasks A and C are the only ones left blocking the expedited grace period.

Task A Exits Its RCU Read-Side Critical Section

When task A exits its RCU read-side critical section, it will also remove itself from its ->blkd_tasks list. However, this list still contains another task blocking the current expedited grace period, so no further action is taken. The state is then as shown below:

EGPExample07.png

Now only task C blocks the expedited grace period.

Task C Exits Its RCU Read-Side Critical Section

When task C exits its RCU read-side critical section, it too will remove itself from its ->blkd_tasks list. Because that list then has no more tasks blocking the current expedited grace period, the corresponding bit in the root rcu_node structure's ->expmask field is cleared, as shown below:

EGPExample08.png

The expedited grace period has now completed.

Given this background, we are now ready to take a look at the code.

RCU Expedited Grace-Period Implementation

The RCU-sched and RCU-bh implementations, being handled by the same code, are described in the following section. The RCU-preempt implementation is described in the section after that.

RCU-sched and RCU-bh Implementation

The RCU-sched and RCU-bh implementations are provided by the synchronize_sched_expedited_cpu_stop() and synchronize_sched_expedited() functions shown below:

  1 static atomic_t sync_sched_expedited_started = ATOMIC_INIT(0);
  2 static atomic_t sync_sched_expedited_done = ATOMIC_INIT(0);
  3 
  4 static int synchronize_sched_expedited_cpu_stop(void *data)
  5 {
  6   smp_mb();
  7   return 0;
  8 }
  9 
 10 void synchronize_sched_expedited(void)
 11 {
 12   int firstsnap, s, snap, trycount = 0;
 13 
 14   firstsnap = snap = atomic_inc_return(&sync_sched_expedited_started);
 15   get_online_cpus();
 16   while (try_stop_cpus(cpu_online_mask,
 17            synchronize_sched_expedited_cpu_stop,
 18            NULL) == -EAGAIN) {
 19     put_online_cpus();
 20     if (trycount++ < 10)
 21       udelay(trycount * num_online_cpus());
 22     else {
 23       synchronize_sched();
 24       return;
 25     }
 26     s = atomic_read(&sync_sched_expedited_done);
 27     if (UINT_CMP_GE((unsigned)s, (unsigned)firstsnap)) {
 28       smp_mb();
 29       return;
 30     }
 31     get_online_cpus();
 32     snap = atomic_read(&sync_sched_expedited_started);
 33     smp_mb();
 34   }
 35   do {
 36     s = atomic_read(&sync_sched_expedited_done);
 37     if (UINT_CMP_GE((unsigned)s, (unsigned)snap)) {
 38       smp_mb();
 39       break;
 40     }
 41   } while (atomic_cmpxchg(&sync_sched_expedited_done, s, snap) != s);
 42   put_online_cpus();
 43 }

The sync_sched_expedited_started and sync_sched_expedited_done variables on lines 1 and 2 respectively act somewhat like a ticket lock. These are used to allow a given synchronize_sched_expedited() call to determine whether it can rely on concurrent calls to synchronize_sched_expedited() having done its work.

The synchronize_sched_expedited_cpu_stop() function shown on lines 4-8 is invoked in stop-CPU context, and simply does a memory barrier.

Quick Quiz 4: Given that the scheduler has done a full context switch in order to allow the stop-CPU context to start running, why bother with the memory barrier.
Answer

The synchronize_sched_expedited() function shown on lines 10-43 does RCU-sched expedited grace periods, which also serves for the RCU-bh expedited grace periods. Line 14 atomically increments the sync_sched_expedited_started counter, returning the new value, which will be used later to determine if some other task did our work for us. Line 15 holds off CPU-hotplug operations for the duration of the expedited grace period.

Each pass through the loop spanning lines 16-34 attempts to do a stop-CPUs operation, which is used in this case solely for the fact that it forces a context switch on every CPU, thus forcing both an RCU-sched and an RCU-bh grace period. Lines 16-18 invoke try_stop_cpus(), exiting the loop if this succeeds. Otherwise, the loop body is executed. Line 19 re-enables CPU-hotplug operations. Line 20 increments the number of attempts, and if there have not been too many, line 21 delays to avoid memory contention that might otherwise occur in the case of multiple concurrent calls to synchronize_sched_expedited(). Otherwise, we have spent too much time trying to expedite a grace period, so line 23 simply invokes synchronize_sched() and line 24 returns. Line 26 reads the sync_sched_expedited_done and line 27 checks to see if some concurrent execution of synchronize_sched_expedited() ran after we started, in which case our work is done for use, so line 28 executes a memory barrier to ensure that the caller's later actions happen after the expedited grace period, and line 29 returns. We get to line 31 when we need to try try_stop_cpus() once again. Line 31 holds off CPU-hotplug operations, line 32 gets a new snapshot of sync_sched_expedited_started, and line 33 ensures that the snapshot happens before the try_stop_cpu() call that will be executed on the next pass through the loop.

If try_stop_cpus() ever succeeds, we exit the loop, thus starting atomic_cmpxchg() loop spanning lines 35-41. Line 36 picks up the current value of sync_sched_expedited_done, and then line 37 checks to see if this counter has already passed our most recent snapshot of sync_sched_expedited_started, and, if so, line 38 executes a memory barrier to ensure that the caller's subsequent actions are seen by all to occur after the expedited grace period, and line 39 exits the loop. Otherwise, line 41 updates sync_sched_expedited_done to our last snapshot of sync_sched_expedited_started, but only if no other CPU has updated it in the meantime.

Upon exit from the atomic_cmpxchg() loop, line 42 re-enables CPU-hotplug operations.

Quick Quiz 5: Wouldn't it be a lot simpler to call stop_cpus() instead of dealing with failure from try_stop_cpus()?
Answer

RCU-preempt Implementation

The RCU-preempt implementation is built on top of the synchronize_sched_expedited() implementation described in the previous section, using the sync_rcu_preempt_exp_done(), rcu_report_exp_rnp(), and sync_rcu_preempt_exp_init() functions shown below, along with the synchronize_rcu_expedited() function discussed later.

  1 static int sync_rcu_preempt_exp_done(struct rcu_node *rnp)
  2 {
  3   return !rcu_preempted_readers_exp(rnp) &&
  4          ACCESS_ONCE(rnp->expmask) == 0;
  5 }
  6 
  7 static void rcu_report_exp_rnp(struct rcu_state *rsp, struct rcu_node *rnp)
  8 {
  9   unsigned long flags;
 10   unsigned long mask;
 11 
 12   raw_spin_lock_irqsave(&rnp->lock, flags);
 13   for (;;) {
 14     if (!sync_rcu_preempt_exp_done(rnp)) {
 15       raw_spin_unlock_irqrestore(&rnp->lock, flags);
 16       break;
 17     }
 18     if (rnp->parent == NULL) {
 19       raw_spin_unlock_irqrestore(&rnp->lock, flags);
 20       wake_up(&sync_rcu_preempt_exp_wq);
 21       break;
 22     }
 23     mask = rnp->grpmask;
 24     raw_spin_unlock(&rnp->lock);
 25     rnp = rnp->parent;
 26     raw_spin_lock(&rnp->lock);
 27     rnp->expmask &= ~mask;
 28   }
 29 }
 30 
 31 static void
 32 sync_rcu_preempt_exp_init(struct rcu_state *rsp, struct rcu_node *rnp)
 33 {
 34   unsigned long flags;
 35   int must_wait = 0;
 36 
 37   raw_spin_lock_irqsave(&rnp->lock, flags);
 38   if (list_empty(&rnp->blkd_tasks))
 39     raw_spin_unlock_irqrestore(&rnp->lock, flags);
 40   else {
 41     rnp->exp_tasks = rnp->blkd_tasks.next;
 42     rcu_initiate_boost(rnp, flags);
 43     must_wait = 1;
 44   }
 45   if (!must_wait)
 46     rcu_report_exp_rnp(rsp, rnp);
 47 }

The sync_rcu_preempt_exp_done() function shown on lines 1-5 checks to see if all of the specified rcu_node structure's readers blocking the current RCU-preempt expedited grace period have exited their RCU read-side sections. Line 8 checks to see whether all readers blocking the current expedited grace period queued directly on this rcu_node structure have finished, while line 9 carries out the same check for all rcu_node structures subordinate to the one specified.

Quick Quiz 6: How could any task possibly be queued on other than a leaf rcu_node structure?
Answer

The rcu_report_exp_rnp() function shown on lines 7-29 propagates exits from RCU read-side critical sections up the rcu_node tree. Line 12 acquires the specified rcu_node structure's ->lock. Each pass through the loop spanning lines 18-28 handles one level of the rcu_node tree. Line 14 checks to see if all tasks associated with the current rcu_node structure or one of its decendants have completed, and if not, line 15 releases the rcu_node structure's ->lock and line 16 exits the loop. Otherwise, line 18 checks to see if this rcu_node has a parent, and if not, line 19 releases the rcu_node structure's ->lock, line 20 wakes up the task that initiated the expedited grace period, and line 21 exits the loop. Otherwise, it is necessary to propagate up the tree. Line 23 records the current rcu_node structure's bit position in its parent's ->expmask field and line 24 releases the curreent rcu_node structure's ->lock. Line 25 moves up to the parent and line 26 acquires its ->lock. Finally, line 27 clears the child rcu_node structure's bit in the parent's ->expmask, followed by another pass through the loop.

Quick Quiz 7: If control reaches line 19 of rcu_report_exp_rnp(), how do we know that the expedited grace period really is completed?
Answer

The sync_rcu_preempt_exp_init() function shown on lines 31-47 initializes the specified rcu_node structure for a new expedited grace period. Line 37 acquires the rcu_node structure's ->lock. Line 38 then checks to see if there are any tasks blocked on this rcu_node structure, and if not, line 39 releases the ->lock. Otherwise, line 41 points the rcu_node structure's ->exp_tasks pointer to the first blocked task in the list, line 42 initiates RCU priority boosting for kernels supporting this notion, and line 43 records the fact that the expedited grace period will have to wait on this rcu_node structure. Either way, if line 45 sees that we need to wait on this rcu_node structure, and if not, line 46 reports that fact up the rcu_node tree.

Quick Quiz 8: What happens if there are no tasks blocking the current expedited grace period? Won't that result in the wake_up() happening before the initiating task blocks, in turn resulting in a hang?
Answer

Now on to the synchronize_rcu_expedited() function, along with its data variables, all shown below:

  1 static DECLARE_WAIT_QUEUE_HEAD(sync_rcu_preempt_exp_wq);
  2 static long sync_rcu_preempt_exp_count;
  3 static DEFINE_MUTEX(sync_rcu_preempt_exp_mutex);
  4 
  5 void synchronize_rcu_expedited(void)
  6 {
  7   unsigned long flags;
  8   struct rcu_node *rnp;
  9   struct rcu_state *rsp = &rcu_preempt_state;
 10   long snap;
 11   int trycount = 0;
 12 
 13   smp_mb();
 14   snap = ACCESS_ONCE(sync_rcu_preempt_exp_count) + 1;
 15   smp_mb();
 16   while (!mutex_trylock(&sync_rcu_preempt_exp_mutex)) {
 17     if (trycount++ < 10)
 18       udelay(trycount * num_online_cpus());
 19     else {
 20       synchronize_rcu();
 21       return;
 22     }
 23     if ((ACCESS_ONCE(sync_rcu_preempt_exp_count) - snap) > 0)
 24       goto mb_ret;
 25   }
 26   if ((ACCESS_ONCE(sync_rcu_preempt_exp_count) - snap) > 0)
 27     goto unlock_mb_ret;
 28   synchronize_sched_expedited();
 29   raw_spin_lock_irqsave(&rsp->onofflock, flags);
 30   rcu_for_each_nonleaf_node_breadth_first(rsp, rnp) {
 31     raw_spin_lock(&rnp->lock);
 32     rnp->expmask = rnp->qsmaskinit;
 33     raw_spin_unlock(&rnp->lock);
 34   }
 35   rcu_for_each_leaf_node(rsp, rnp)
 36     sync_rcu_preempt_exp_init(rsp, rnp);
 37   if (NUM_RCU_NODES > 1)
 38     sync_rcu_preempt_exp_init(rsp, rcu_get_root(rsp));
 39   raw_spin_unlock_irqrestore(&rsp->onofflock, flags);
 40   rnp = rcu_get_root(rsp);
 41   wait_event(sync_rcu_preempt_exp_wq,
 42        sync_rcu_preempt_exp_done(rnp));
 43   smp_mb();
 44   ACCESS_ONCE(sync_rcu_preempt_exp_count)++;
 45 unlock_mb_ret:
 46   mutex_unlock(&sync_rcu_preempt_exp_mutex);
 47 mb_ret:
 48   smp_mb();
 49 }

Line 1 shows the sync_rcu_preempt_exp_wq wait queue on which synchronize_rcu_expedited() blocks when needed, line 2 shows the sync_rcu_preempt_exp_count counter that enables concurrent calls to synchronize_rcu_expedited() to share a single expedited grace period, and line 3 defines the sync_rcu_preempt_exp_mutex used for mutual exclusion.

Lines 13-15 take a snapshot of sync_rcu_preempt_exp_count so that we can later determine if someone else did our work for us. Each pass through the loop spanning lines 16-25 makes an attempt to acquire sync_rcu_preempt_exp_mutex. If this attempt fails, line 17 checks to see if the number of tries has been excessive, and, if not, line 18 delays for a short time, otherwise, line 20 waits for a full grace period and line 21 returns.

Line 23 checks to see if a full expedited grace period has elapsed since we started, and if so, line 27 goes to clean up and exit, piggybacking on this other expedited grace period.

Quick Quiz 9: But this check cannot succeed unless sync_rcu_preempt_exp_count has been incremented twice since we first sampled it on line 14 of synchronize_rcu_expedited. Since each expedited grace period increments this counter only once, this means that two expedited grace periods have completed during this interval. So why shouldn't the comparison on line 23 be for greater-than-or-equal rather than strictly greater-than?
Answer

Once the loop exits, execution reaches line 26 with the sync_rcu_preempt_exp_mutex mutex held. Line 26 performs the same check as did line 23, and if some other expedited grace period started after we did and has already completed, then line 27 goes to clean up and exit. Line 28 invokes synchronize_sched_expedited(), which has the side-effect of forcing each currently-executing RCU-preempt read-side critical section to be enqueued on one of the leaf rcu_node structure's ->blkd_tasks lists. Once this is complete, it is only necessary to wait for each queued task to dequeue itself.

Quick Quiz 10: Suppose that there is a continual stream of new tasks blocking within RCU-preempt read-side critical sections. Won't that prevent the expedited grace period from ever completing?
Answer

Line 29 acquires the rcu_state structure's ->onofflock, holding off changes to RCU's idea of which CPUs are online until line 39, where this lock is released. Lines 30-34 set up all of the non-leaf rcu_node structures to wait for all queued tasks to complete by setting each ->expmask field to the corresponding ->qsmaskinit field under the protection of the corresponding ->lock. Lines 35-38 invokes sync_rcu_preempt_exp_init() on each leaf and the root rcu_node structures, which records which portions of the rcu_node tree contain queued tasks that block the current expedited grace period.

Line 40 obtains a pointer to the root rcu_node structure so that lines 41 and 42 can wait for all queued tasks to exit their RCU read-side critical sections. Line 43 executes a memory barrier to ensure that the expedited grace-period computations are seen to precede incrementing of sync_rcu_preempt_exp_count on line 44. Line 46 releases sync_rcu_preempt_exp_mutex and line 48 executes a memory barrier to ensure that accesses to sync_rcu_preempt_exp_count are seen to happen before any actions that the caller might take after return from synchronize_rcu_expedited().

Summary

And that is the long and the short of expedited grace periods. Alert readers who are familiar with some of the Linux kernel facilities used by the expedited primitives might have noticed some potential scalability issues on systems with extremely large numbers of CPUs. If current multicore trends continue, this issue will likely need attention at some point.

Acknowledgments

I am grateful to @@@ for their help in rendering this article human readable.

Legal Statement

This work represents the view of the author and does not necessarily represent the view of IBM.

Linux is a registered trademark of Linus Torvalds.

Other company, product, and service names may be trademarks or service marks of others.

Answers to Quick Quizzes

Quick Quiz 1: Why put task C at the head of the list? That is just backwards!!!

Answer: The reason will become apparent later in this example when task D blocks.

Back to Quick Quiz 1.

Quick Quiz 2: Why don't the leaf rcu_node structures also have their ->expmask fields initialized?

Answer: Because there cannot be any tasks queued below the leaf rcu_node structures, so there is no need for the leaf rcu_node structures to track anything. The reason for this asymmetry is that we use the same rcu_node tree that is used by the normal grace periods, which do need the leaf rcu_node structures to track per-CPU status.

Back to Quick Quiz 2.

Quick Quiz 3: Given that it was not running at the time that the expedited grace period started, why is task A blocking the expedited grace period?

Answer: Because it entered an RCU read-side critical section before the expedited grace period started, and it remains in that critical section. Therefore it must by definition block the expedited grace period. As well as any later non-expedited grace period, for that matter.

Back to Quick Quiz 3.

Quick Quiz 4: Given that the scheduler has done a full context switch in order to allow the stop-CPU context to start running, why bother with the memory barrier.

Answer: Pure paranoia. Just in case someone comes up with a hyper-optimized code path through the scheduler...

Back to Quick Quiz 4.

Quick Quiz 5: Wouldn't it be a lot simpler to call stop_cpus() instead of dealing with failure from try_stop_cpus()?

Answer: Ah, but the alternative is massive contention on the stop_cpus_mutex that is unconditionally acquired by stop_cpus(). Such contention would be a very bad idea on systems with large numbers of CPUs. In addition, using stop_cpus() would prevent a single stop-CPU operation from benefiting an arbitrarily large number of concurrent synchronize_sched_expedited() invocations.

Back to Quick Quiz 5.

Quick Quiz 6: How could any task possibly be queued on other than a leaf rcu_node structure?

Answer: If a task is queued on a given leaf rcu_node structure, but then all CPUs corresponding to that rcu_node structure go offline, that task will be moved to the root rcu_node structure.

Back to Quick Quiz 6.

Quick Quiz 7: If control reaches line 19 of rcu_report_exp_rnp(), how do we know that the expedited grace period really is completed?

Answer: We reach line 19 if we are at the root rcu_node and if there are no tasks blocking the current expedited grace period on this or any subordinate rcu_node. This means that there are no longer any tasks blocking the current expedited grace period, so it is by definition done.

Back to Quick Quiz 7.

Quick Quiz 8: What happens if there are no tasks blocking the current expedited grace period? Won't that result in the wake_up() happening before the initiating task blocks, in turn resulting in a hang?

Answer: The wake_up() might well happen before the initiating task blocks, but this cannot result in a hang. The race conditions are resolved by use of wait_event().

Back to Quick Quiz 8.

Quick Quiz 9: But this check cannot succeed unless sync_rcu_preempt_exp_count has been incremented twice since we first sampled it on line 14 of synchronize_rcu_expedited. Since each expedited grace period increments this counter only once, this means that two expedited grace periods have completed during this interval. So why shouldn't the comparison on line 23 be for greater-than-or-equal rather than strictly greater-than?

Answer: Suppose that the comparison was greater-than-or-equal. Then the following sequence of events could occur:

  1. Task A starts an expedited grace period, which reaches line 43 of synchronize_rcu_expedited().
  2. Task B starts an RCU read-side critical section.
  3. Task C starts an expedited grace period, and on line 14 of synchronize_rcu_expedited() sees sync_rcu_preempt_exp_count equal to (say) 5. It therefore sets local variable snap to 6.
  4. Task A completes its execution of synchronize_rcu_expedited(), incrementing sync_rcu_preempt_exp_count.
  5. Task C acquires sync_rcu_preempt_exp_mutex and sees that the value of sync_rcu_preempt_exp_count is now 6. It therefore immediately exits synchronize_rcu_expedited() despite Task B still being in a pre-existing RCU read-side critical section.
Arbitrarily bad breakage then ensues. We therefore require that sync_rcu_preempt_exp_count change twice, which guarantees that a full expedited grace period will have completed.

Back to Quick Quiz 9.

Quick Quiz 10: Suppose that there is a continual stream of new tasks blocking within RCU-preempt read-side critical sections. Won't that prevent the expedited grace period from ever completing?

Answer: No, because the expedited grace period only waits on tasks that are already enqueued. It does not wait on tasks that enqueue themselves later.

Back to Quick Quiz 10.