August 21, 2011

This article was contributed by Paul E. McKenney

Introduction

  1. RCU Updater Overview
  2. RCU Updater API
  3. RCU Updater Design
  4. RCU Updater Implementation

As always, at the end come answers to the quick quizzes.

RCU Updater Overview

They also serve who stand and wait, and the service provided by RCU to updaters is to wait (standing or not) for all pre-existing readers. This fits the general RCU updater pattern, with removal from a linked list being the most familiar example:

  1. Remove a data element from the list, but without modifying the element itself.
  2. Wait for all pre-existing readers.
  3. Free the element previously removed, which is safe because there can no longer be any readers referencing it.

RCU implements waiting for readers using callbacks, which RCU manages and eventually invokes. The job of RCU's update-side primitives is therefore registering callbacks, and, for the synchronous primitives, waiting for them to be invoked. RCU provides the ability to wait for pre-existing callbacks as well as for pre-existing readers.

This article covers normal grace periods, which are implemented with an eye towards amortizing overhead over many RCU updaters.

RCU Updater API

RCU's grace-period API includes the following primitives:

  1. Synchronous: synchronize_rcu(), synchronize_rcu_bh(), and synchronize_sched().
  2. Asynchronous: call_rcu(), call_rcu_bh(), and call_rcu_sched().
  3. Deferred free: kfree_rcu().
  4. Barrier: rcu_barrier(), rcu_barrier_bh(), and rcu_barrier_sched().
  5. Expedited: synchronize_rcu_expedited(), synchronize_rcu_bh_expedited(), and synchronize_sched_expedited().

The remainder of this article will look at RCU updaters from an implementation viewpoint, first from a design viewpoint and then from an implementation viewpoint.

RCU Updater Design

The CONFIG_TREE_RCU and CONFIG_TREE_PREEMPT_RCU implementations are based on RCU callbacks. This means that all non-expedited APIs, even those based on blocking, operate by registering RCU callbacks. For example, synchronize_rcu() and friends registers an RCU callback that does a wakeup, then sleeps waiting for the wakeup. The rcu_barrier() group of APIs goes one step further, registering an RCU callback on each CPU and then waiting for all of them to complete.

In contrast, the expedited APIs are specially crafted to reduce update-side latency. Their design and implementation is described in a separate article in this series.

Quick Quiz 1: Why not implement the expedited APIs in terms of callbacks as well?
Answer

RCU Updater Implementation

The following sections cover the implementation of the synchronous, asynchronous, deferred-free, and barrier, and expedited APIs.

Synchronous Implementation

The synchronous wait-for-RCU-reader functions are synchronize_rcu(), synchronize_rcu_bh(), and synchronize_sched(), all of which are implemented in terms of the corresponding asynchronous functions, namely call_rcu(), call_rcu_bh(), and call_rcu_sched(), respectively. The time required to wait for all pre-existing RCU readers is called a grace period and different RCU flavors can have different independent grace periods. This implementation is carried out through use of the wakeme_after_rcu(), wait_rcu_gp(), and rcu_blocking_is_gp(), which are shown below:

  1 struct rcu_synchronize {
  2   struct rcu_head head;
  3   struct completion completion;
  4 };
  5 
  6 static void wakeme_after_rcu(struct rcu_head  *head)
  7 {
  8   struct rcu_synchronize *rcu;
  9 
 10   rcu = container_of(head, struct rcu_synchronize, head);
 11   complete(&rcu->completion);
 12 }
 13 
 14 void wait_rcu_gp(call_rcu_func_t crf)
 15 {
 16   struct rcu_synchronize rcu;
 17 
 18   init_rcu_head_on_stack(&rcu.head);
 19   init_completion(&rcu.completion);
 20   crf(&rcu.head, wakeme_after_rcu);
 21   wait_for_completion(&rcu.completion);
 22   destroy_rcu_head_on_stack(&rcu.head);
 23 }
 24 
 25 static inline int rcu_blocking_is_gp(void)
 26 {
 27   return num_online_cpus() == 1;
 28 }

The rcu_synchronize structure is used to mediate wakeups. It contains an rcu_head structure for use by call_rcu(), call_rcu_bh(), and call_rcu_sched(), along with a completion structure that is used to wake up the task that is waiting on an RCU grace period.

The wakeme_after_rcu() helper function is shown on lines 6-12, and is an RCU callback function that is invoked at the end of an RCU grace period. This function is not invoked directly; it is instead passed to call_rcu(), call_rcu_bh(), or call_rcu_sched(). Given that it is an RCU callback function, it takes a pointer to an rcu_head structure as its sole argument. This rcu_head structure must be part of an rcu_synchronize structure.

This function is quite straightforward: line 10 obtains a pointer to the enclosing completion structure, and line 11 invokes the complete() function to wake up the task that was waiting for an RCU grace period to complete.

The wait_rcu_gp() helper function is shown on lines 14-23. It registers an RCU callback so as to wake up at the end of a subsequent grace period, using the flavor of RCU specified by the member of the call_rcu() family of functions specified by argument crf. Line 16 declares a local rcu_synchronize structure that mediates the wakeup process. Line 18 informs the debug-objects system that we are using the on-stack RCU callback (namely the rcu_head structure within the rcu_synchronize structure declared on line 16) for debugging purposes, and, similarly, line 22 informs the debug-objects system that we have finished with this callback. Line 19 initializes the completion structure within this same rcu_synchronize structure, and line 20 passes a pointer to its rcu_head structure to the specified function (for example, synchronize_rcu() passes call_rcu() to wait_rcu_gp()). Line 21 then waits for this RCU callback to be invoked.

Quick Quiz 2: But wait_rcu_gp() is also used by CONFIG_TINY_RCU and CONFIG_TINY_PREEMPT_RCU, but not for things like synchronize_rcu(). What gives?
Answer

The rcu_blocking_is_gp() returns true if there is only one online CPU in the system. This is important for those flavors of RCU where a context switch is a quiescent state: If there is only one CPU, and that CPU is willing to block waiting for a grace period, then that willingness to block automatically constitutes a grace period, as shown below.

Quick Quiz 3: What if the CPU is executing within an RCU read-side critical section when it invokes rcu_blocking_is_gp()?
Answer

The following code shows how the synchronize_rcu(), synchronize_rcu_bh(), and synchronize_sched() functions are implemented in terms of wait_rcu_gp() and in terms of each other.

  1 #ifdef CONFIG_RCU_TREE
  2 
  3 static inline void synchronize_rcu(void)
  4 {
  5   synchronize_sched();
  6 }
  7 
  8 #else /* #ifdef CONFIG_RCU_TREE */
  9 
 10 void synchronize_rcu(void)
 11 {
 12   if (!rcu_scheduler_active)
 13     return;
 14   wait_rcu_gp(call_rcu);
 15 }
 16 
 17 #endif /* else #ifdef CONFIG_RCU_TREE */
 18 
 19 void synchronize_rcu_bh(void)
 20 {
 21   if (rcu_blocking_is_gp())
 22     return;
 23   wait_rcu_gp(call_rcu_bh);
 24 }
 25 
 26 void synchronize_sched(void)
 27 {
 28   if (rcu_blocking_is_gp())
 29     return;
 30   wait_rcu_gp(call_rcu_sched);
 31 }

There are two different implementations of synchronize_rcu(), one for CONFIG_TREE_RCU and the other for CONFIG_TREE_PREEMPT_RCU. The CONFIG_TREE_RCU implementation on lines 3-6 simply invokes synchronize_sched(). In contrast, the CONFIG_TREE_PREEMPT_RCU implementation on lines 10-15 actually does some real work. Line 12 checks to see if the scheduler is running, and, if not, line 13 simply does a short-circuit return. The reason for this short-circuit return is that tasks cannot be preempted before the scheduler is running, which in turn means that they cannot be preempted within RCU read-side critical sections. Therefore, if the boot-time task is willing to context switch, it cannot legally be in an RCU read-side critical section. On the other hand, if the system is booted up to the point that the scheduler is running, line 14 passes call_rcu() to wait_rcu_gp(), which has the effect of blocking until a subsequent preemptible RCU grace period completes.

Lines 19-24 shows the implementation of synchronize_rcu_bh(), which waits for a subsequent RCU-bh grace period to complete. (Recall that RCU-bh's quiescent states occur in any code where bottom halves are enabled.) Line 21 checks to see if blocking constitutes a grace period, and takes a short-circuit exit if so (recall that it is illegal to block in code where bottom halves are disabled). On the other hand, if blocking does not constitute a grace period (for example, if multiple CPUs are online), then line 23 passes call_rcu_bh() to wait_rcu_gp(), which again has the effect of blocking until a subsequent RCU grace period completes.

Lines 26-31 shows the implementation of synchronize_sched(), which differs from that of synchronize_rcu_bh() only in passing call_rcu_sched() rather than call_rcu_bh() to wait_rcu_gp().

Having dealt with the synchronous implementation, it is now time to turn to the asynchronous implementation.

Asynchronous Implementation

The asynchronous implementation is a straightforward set of wrappers as follows:

  1 #ifdef CONFIG_RCU_TREE
  2 
  3 #define call_rcu call_rcu_sched
  4 
  5 #else /* #ifdef CONFIG_RCU_TREE */
  6 
  7 void call_rcu(struct rcu_head *head, void (*func)(struct rcu_head *rcu))
  8 {
  9   __call_rcu(head, func, &rcu_preempt_state);
 10 }
 11 
 12 #endif /* else #ifdef CONFIG_RCU_TREE */
 13 
 14 void call_rcu_bh(struct rcu_head *head, void (*func)(struct rcu_head *rcu))
 15 {
 16   __call_rcu(head, func, &rcu_bh_state);
 17 }
 18 
 19 void call_rcu_sched(struct rcu_head *head, void (*func)(struct rcu_head *rcu))
 20 {
 21   __call_rcu(head, func, &rcu_sched_state);
 22 }

Similar to the synchronous implementation, for CONFIG_TREE_RCU, call_rcu() maps to call_rcu_sched() as shown on line 3. In all other cases, the asynchronous implementation is a straightforward wrapper for __call_rcu().

Deferred-Free Implementation

Many of earlier uses of call_rcu() in the Linux kernel were passed RCU callback functions that simply kfree()ed the enclosing data structure, for example:

  1 static void free_css_set_rcu(struct rcu_head *obj)
  2 {
  3   struct css_set *cg = container_of(obj, struct css_set, rcu_head);
  4   kfree(cg);
  5 }

This function was invoked as follows:

  1 call_rcu(&cg->rcu_head, free_css_set_rcu);

The kfree_rcu() function was created for this situation, allowing free_css_set_rcu() to be dispensed with entirely, and further allowing the call to kfree_rcu() to be replaced with the following:

  1 kfree_rcu(cg, rcu_head);

The implementation, which is shared with CONFIG_TINY_RCU and CONFIG_TINY_PREEMPT_RCU is as follows:

  1 static __always_inline bool __is_kfree_rcu_offset(unsigned long offset)
  2 {
  3   return offset < 4096;
  4 }
  5 
  6 static __always_inline
  7 void __kfree_rcu(struct rcu_head *head, unsigned long offset)
  8 {
  9   typedef void (*rcu_callback)(struct rcu_head *);
 10 
 11   BUILD_BUG_ON(!__builtin_constant_p(offset));
 12   BUILD_BUG_ON(!__is_kfree_rcu_offset(offset));
 13   call_rcu(head, (rcu_callback)offset);
 14 }
 15 
 16 #define kfree_rcu(ptr, rcu_head) \
 17   __kfree_rcu(&((ptr)->rcu_head), offsetof(typeof(*(ptr)), rcu_head))

Any structure passed to kfree_rcu() must contain an rcu_head structure, but the caller is not required to place the rcu_head structure at any particular offset within the enclosing RCU-protected structure. However, this rcu_head structure is the only place that RCU can safely store any information, and therefore RCU must track this structure via a pointer to its rcu_head structure, in a manner similar to the way the Linux kernel's lists are tracked using pointers to the list_head structure contained in each list element. This in turn means that the offset of the rcu_head structure within the enclosing RCU-protected data structure must be placed somewhere in that rcu_head structure.

The only field available is the rcu_head structure's ->func field, which normally contains a pointer to a function. This implementation therefore relies on the fact that a function in the Linux kernel is never loaded into the lower 4096 bytes of memory, so that we can store a number that is less than 4096 into the rcu_head structure's ->func field and distinguish it from a function pointer.

The __is_kfree_rcu_offset() helper function on lines  1-4 returns true if the argument is an offset and false if it is instead a pointer to a function.

The __kfree_rcu() helper function shown on lines 6-14 does most of the kfree_rcu() function's work. Line 11 emits a compiler error if the offset argument is not a compile-time constant, and line 12 emits a compiler error if the offset argument is too large for __is_kfree_rcu_offset() to distinguish from a pointer to a function. Finally, line 13 invokes call_rcu() with a pointer to the rcu_head structure and the offset cast to an RCU callback function type.

The kfree_rcu() macro itself is shown on lines 16 and 17. It takes as arguments a pointer to the enclosing RCU-protected structure and the name of the field containing that structure's rcu_head field. It then computes the pointer to the rcu_head structure and its offset within the enclosing RCU-protected structure, and passes these two quantities to __kfree_rcu().

Quick Quiz 4: Why not make a single kfree_rcu() function that does the work of both kfree_rcu() and __kfree_rcu()?
Answer

Barrier Implementation

RCU's barrier functions wait not just for all pre-existing RCU read-side critical sections, but also for all pre-existing RCU callbacks to be invoked. The mappings from the updater-visible APIs to this helper function are as follows:

  1 #ifdef CONFIG_RCU_TREE
  2 
  3 void rcu_barrier(void)
  4 {
  5   rcu_barrier_sched();
  6 }
  7 #else /* #ifdef CONFIG_RCU_TREE */
  8 
  9 void rcu_barrier(void)
 10 {
 11   _rcu_barrier(&rcu_preempt_state, call_rcu);
 12 }
 13 
 14 #endif /* else #ifdef CONFIG_RCU_TREE */
 15 
 16 void rcu_barrier_bh(void)
 17 {
 18   _rcu_barrier(&rcu_bh_state, call_rcu_bh);
 19 }
 20 EXPORT_SYMBOL_GPL(rcu_barrier_bh);
 21 
 22 void rcu_barrier_sched(void)
 23 {
 24   _rcu_barrier(&rcu_sched_state, call_rcu_sched);
 25 }

In what should by now be a familiar manner, CONFIG_RCU_TREE maps rcu_barrier() into rcu_barrier_sched() and all other cases define trivial wrappers around the underlying _rcu_barrier() function, which is described in a separate article.

Expedited Implementation

The expedited implementation is quite straightforward:

  1 #ifdef CONFIG_TREE_RCU
  2 
  3 void synchronize_rcu_expedited(void)
  4 {
  5   synchronize_sched_expedited();
  6 }
  7 
  8 #endif
  9 
 10 static inline void synchronize_rcu_bh_expedited(void)
 11 {
 12   synchronize_sched_expedited();
 13 }

The synchronize_sched_expedited() function directly implements expedited grace periods for RCU-sched, and for CONFIG_TREE_PREEMPT_RCU, synchronize_rcu_expedited() directly implements expedited grace periods for preemptible RCU. Because any RCU-sched grace period is also an RCU-bh grace period, synchronize_rcu_bh_expedited() is a trivial wrapper for synchronize_sched_expedited().

Quick Quiz 5: Why is any grace period for RCU-sched also a grace period for RCU-bh?
Answer

Summary

This section has covered RCU's update-side logic, which contains the more slow and stately portions of RCU. The fastpaths are owned by RCU readers.

Acknowledgments

I owe thanks to Cheng Xu for his help in increasing the human readability of this article.

Legal Statement

This work represents the view of the author and does not necessarily represent the view of IBM.

Linux is a registered trademark of Linus Torvalds.

Other company, product, and service names may be trademarks or service marks of others.

Answers to Quick Quizzes

Quick Quiz 1: Why not implement the expedited APIs in terms of callbacks as well?

Answer: Suppose that an expedited API is called just after a grace period has started. If normal callbacks were to be used, it would be necessary to wait for the remainder of the just-started grace period and then to wait for a grace period after that. This is not what we mean by “expedited”. By handling expedited requests separately, we can avoid waiting for such additional partial grace periods.

Back to Quick Quiz 1.

Quick Quiz 2: But wait_rcu_gp() is also used by CONFIG_TINY_RCU and CONFIG_TINY_PREEMPT_RCU, but not for things like synchronize_rcu(). What gives?

Answer: This function is used by TINY_RCU for the barrier primitives. Same code, different environment, different use case. Some times you get lucky!

Back to Quick Quiz 2.

Quick Quiz 3: What if the CPU is executing within an RCU read-side critical section when it invokes rcu_blocking_is_gp()?

Answer: That would be illegal. The synchronous grace-period primitives may only be invoked from outside of an RCU read-side critical section.

Back to Quick Quiz 3.

Quick Quiz 4: Why not make a single kfree_rcu() function that does the work of both kfree_rcu() and __kfree_rcu()?

Answer: The separate kfree_rcu() macro is required because the C language does not permit struct fields to be passed to functions. It is possible to combine them, but only as a big ugly macro. By comparison, the current two-line macro and static inline function seem preferable.

Back to Quick Quiz 4.

Quick Quiz 5: Why is any grace period for RCU-sched also a grace period for RCU-bh?

Answer: Because RCU-sched's quiescent states are a subset of those for RCU-bh. Therefore, any set of quiescent states satisfying RCU-sched must by definition also satisfy RCU-bh.

Back to Quick Quiz 5.