August 21, 2011
This article was contributed by Paul E. McKenney
As always, at the end come answers to the quick quizzes.
They also serve who stand and wait, and the service provided by RCU to updaters is to wait (standing or not) for all pre-existing readers. This fits the general RCU updater pattern, with removal from a linked list being the most familiar example:
RCU implements waiting for readers using callbacks, which RCU manages and eventually invokes. The job of RCU's update-side primitives is therefore registering callbacks, and, for the synchronous primitives, waiting for them to be invoked. RCU provides the ability to wait for pre-existing callbacks as well as for pre-existing readers.
This article covers normal grace periods, which are implemented with an eye towards amortizing overhead over many RCU updaters.
RCU's grace-period API includes the following primitives:
synchronize_rcu()
,
synchronize_rcu_bh()
, and
synchronize_sched()
.
call_rcu()
,
call_rcu_bh()
, and
call_rcu_sched()
.
kfree_rcu()
.
rcu_barrier()
,
rcu_barrier_bh()
, and
rcu_barrier_sched()
.
synchronize_rcu_expedited()
,
synchronize_rcu_bh_expedited()
, and
synchronize_sched_expedited()
.
The remainder of this article will look at RCU updaters from an implementation viewpoint, first from a design viewpoint and then from an implementation viewpoint.
The CONFIG_TREE_RCU
and CONFIG_TREE_PREEMPT_RCU
implementations are based on RCU callbacks.
This means that all non-expedited APIs, even those based on blocking, operate
by registering RCU callbacks.
For example, synchronize_rcu()
and friends registers
an RCU callback that does a wakeup, then sleeps waiting for the
wakeup.
The rcu_barrier()
group of APIs goes one step further,
registering an RCU callback on each CPU and then waiting for all of
them to complete.
In contrast, the expedited APIs are specially crafted to reduce update-side latency. Their design and implementation is described in a separate article in this series.
Quick Quiz 1:
Why not implement the expedited APIs in terms of callbacks as well?
Answer
The following sections cover the implementation of the synchronous, asynchronous, deferred-free, and barrier, and expedited APIs.
The synchronous wait-for-RCU-reader functions are
synchronize_rcu()
,
synchronize_rcu_bh()
, and
synchronize_sched()
, all of which are implemented in terms
of the corresponding asynchronous functions, namely
call_rcu()
,
call_rcu_bh()
, and
call_rcu_sched()
, respectively.
The time required to wait for all pre-existing RCU readers is
called a grace period and different RCU flavors can have
different independent grace periods.
This implementation is carried out through use of the
wakeme_after_rcu()
, wait_rcu_gp()
,
and rcu_blocking_is_gp()
, which are shown below:
1 struct rcu_synchronize { 2 struct rcu_head head; 3 struct completion completion; 4 }; 5 6 static void wakeme_after_rcu(struct rcu_head *head) 7 { 8 struct rcu_synchronize *rcu; 9 10 rcu = container_of(head, struct rcu_synchronize, head); 11 complete(&rcu->completion); 12 } 13 14 void wait_rcu_gp(call_rcu_func_t crf) 15 { 16 struct rcu_synchronize rcu; 17 18 init_rcu_head_on_stack(&rcu.head); 19 init_completion(&rcu.completion); 20 crf(&rcu.head, wakeme_after_rcu); 21 wait_for_completion(&rcu.completion); 22 destroy_rcu_head_on_stack(&rcu.head); 23 } 24 25 static inline int rcu_blocking_is_gp(void) 26 { 27 return num_online_cpus() == 1; 28 }
The rcu_synchronize
structure is used to mediate
wakeups.
It contains an rcu_head
structure for use by
call_rcu()
,
call_rcu_bh()
, and
call_rcu_sched()
,
along with a completion
structure that is used to wake
up the task that is waiting on an RCU grace period.
The wakeme_after_rcu()
helper function is shown
on lines 6-12, and is an RCU callback function that is invoked at
the end of an RCU grace period.
This function is not invoked directly; it is instead passed to
call_rcu()
,
call_rcu_bh()
, or
call_rcu_sched()
.
Given that it is an RCU callback function, it takes a pointer to an
rcu_head
structure as its sole argument.
This rcu_head
structure must be part of an
rcu_synchronize
structure.
This function is quite straightforward: line 10 obtains a
pointer to the enclosing completion
structure,
and line 11 invokes the complete()
function to wake
up the task that was waiting for an RCU grace period to complete.
The wait_rcu_gp()
helper function is shown on
lines 14-23.
It registers an
RCU callback so as to wake up at the end of a subsequent grace period,
using the flavor of RCU specified by the member of the call_rcu()
family of functions specified by argument crf
.
Line 16 declares a local rcu_synchronize
structure
that mediates the wakeup process.
Line 18 informs the debug-objects system that we are using
the on-stack RCU callback (namely the rcu_head
structure
within the rcu_synchronize
structure declared on line 16)
for debugging purposes, and,
similarly, line 22 informs the debug-objects system that we
have finished with this callback.
Line 19 initializes the completion
structure within
this same rcu_synchronize
structure, and line 20
passes a pointer to its rcu_head
structure to the
specified function (for example, synchronize_rcu()
passes
call_rcu()
to wait_rcu_gp()
).
Line 21 then waits for this RCU callback to be invoked.
Quick Quiz 2:
But wait_rcu_gp()
is also used by
CONFIG_TINY_RCU
and CONFIG_TINY_PREEMPT_RCU
,
but not for things like synchronize_rcu()
.
What gives?
Answer
The rcu_blocking_is_gp()
returns true if there is
only one online CPU in the system.
This is important for those flavors of RCU where a context switch is
a quiescent state: If there is only one CPU, and that CPU is willing
to block waiting for a grace period, then that willingness to block
automatically constitutes a grace period, as shown below.
Quick Quiz 3:
What if the CPU is executing within an RCU read-side critical section
when it invokes rcu_blocking_is_gp()
?
Answer
The following code shows how the
synchronize_rcu()
,
synchronize_rcu_bh()
, and
synchronize_sched()
functions are implemented in terms of wait_rcu_gp()
and
in terms of each other.
1 #ifdef CONFIG_RCU_TREE 2 3 static inline void synchronize_rcu(void) 4 { 5 synchronize_sched(); 6 } 7 8 #else /* #ifdef CONFIG_RCU_TREE */ 9 10 void synchronize_rcu(void) 11 { 12 if (!rcu_scheduler_active) 13 return; 14 wait_rcu_gp(call_rcu); 15 } 16 17 #endif /* else #ifdef CONFIG_RCU_TREE */ 18 19 void synchronize_rcu_bh(void) 20 { 21 if (rcu_blocking_is_gp()) 22 return; 23 wait_rcu_gp(call_rcu_bh); 24 } 25 26 void synchronize_sched(void) 27 { 28 if (rcu_blocking_is_gp()) 29 return; 30 wait_rcu_gp(call_rcu_sched); 31 }
There are two different implementations of synchronize_rcu()
,
one for CONFIG_TREE_RCU
and the other for
CONFIG_TREE_PREEMPT_RCU
.
The CONFIG_TREE_RCU
implementation on lines 3-6
simply invokes synchronize_sched()
.
In contrast, the CONFIG_TREE_PREEMPT_RCU
implementation
on lines 10-15 actually does some real work.
Line 12 checks to see if the scheduler is running, and, if not,
line 13 simply does a short-circuit return.
The reason for this short-circuit return is that tasks cannot be
preempted before the scheduler is running, which in turn means that
they cannot be preempted within RCU read-side critical sections.
Therefore, if the boot-time task is willing to context switch, it
cannot legally be in an RCU read-side critical section.
On the other hand, if the system is booted up to the point that the
scheduler is running, line 14 passes call_rcu()
to wait_rcu_gp()
, which has the effect of blocking
until a subsequent preemptible RCU grace period completes.
Lines 19-24 shows the implementation of
synchronize_rcu_bh()
, which waits for a subsequent RCU-bh grace
period to complete.
(Recall that RCU-bh's quiescent states occur in any code where
bottom halves are enabled.)
Line 21 checks to see if blocking constitutes a grace period,
and takes a short-circuit exit if so (recall that it is illegal
to block in code where bottom halves are disabled).
On the other hand, if blocking does not constitute a grace
period (for example, if multiple CPUs are online),
then line 23 passes call_rcu_bh()
to
wait_rcu_gp()
, which again has the effect of blocking
until a subsequent RCU grace period completes.
Lines 26-31 shows the implementation of
synchronize_sched()
, which differs from that
of synchronize_rcu_bh()
only in passing
call_rcu_sched()
rather than call_rcu_bh()
to wait_rcu_gp()
.
Having dealt with the synchronous implementation, it is now time to turn to the asynchronous implementation.
The asynchronous implementation is a straightforward set of wrappers as follows:
1 #ifdef CONFIG_RCU_TREE 2 3 #define call_rcu call_rcu_sched 4 5 #else /* #ifdef CONFIG_RCU_TREE */ 6 7 void call_rcu(struct rcu_head *head, void (*func)(struct rcu_head *rcu)) 8 { 9 __call_rcu(head, func, &rcu_preempt_state); 10 } 11 12 #endif /* else #ifdef CONFIG_RCU_TREE */ 13 14 void call_rcu_bh(struct rcu_head *head, void (*func)(struct rcu_head *rcu)) 15 { 16 __call_rcu(head, func, &rcu_bh_state); 17 } 18 19 void call_rcu_sched(struct rcu_head *head, void (*func)(struct rcu_head *rcu)) 20 { 21 __call_rcu(head, func, &rcu_sched_state); 22 }
Similar to the synchronous implementation, for CONFIG_TREE_RCU
,
call_rcu()
maps to call_rcu_sched()
as
shown on line 3.
In all other cases, the asynchronous implementation is a straightforward
wrapper for __call_rcu()
.
Many of earlier uses of call_rcu()
in the Linux kernel
were passed RCU callback functions that simply kfree()
ed
the enclosing data structure, for example:
1 static void free_css_set_rcu(struct rcu_head *obj) 2 { 3 struct css_set *cg = container_of(obj, struct css_set, rcu_head); 4 kfree(cg); 5 }
This function was invoked as follows:
1 call_rcu(&cg->rcu_head, free_css_set_rcu);
The kfree_rcu()
function was created for this situation,
allowing free_css_set_rcu()
to be dispensed with entirely,
and further allowing the call to kfree_rcu()
to be
replaced with the following:
1 kfree_rcu(cg, rcu_head);
The implementation, which is shared with CONFIG_TINY_RCU
and CONFIG_TINY_PREEMPT_RCU
is as follows:
1 static __always_inline bool __is_kfree_rcu_offset(unsigned long offset) 2 { 3 return offset < 4096; 4 } 5 6 static __always_inline 7 void __kfree_rcu(struct rcu_head *head, unsigned long offset) 8 { 9 typedef void (*rcu_callback)(struct rcu_head *); 10 11 BUILD_BUG_ON(!__builtin_constant_p(offset)); 12 BUILD_BUG_ON(!__is_kfree_rcu_offset(offset)); 13 call_rcu(head, (rcu_callback)offset); 14 } 15 16 #define kfree_rcu(ptr, rcu_head) \ 17 __kfree_rcu(&((ptr)->rcu_head), offsetof(typeof(*(ptr)), rcu_head))
Any structure passed to kfree_rcu()
must contain
an rcu_head
structure, but the caller
is not required to place the rcu_head
structure at any
particular offset within the enclosing RCU-protected structure.
However, this rcu_head
structure is the only place
that RCU can safely store any information, and therefore RCU must
track this structure via a pointer to its rcu_head
structure,
in a manner similar to the way the Linux kernel's lists are tracked
using pointers to the list_head
structure contained in
each list element.
This in turn means that the offset of the rcu_head
structure
within the enclosing RCU-protected data structure must be placed somewhere
in that rcu_head
structure.
The only field available is the rcu_head
structure's
->func
field, which normally contains a pointer to a
function.
This implementation therefore relies on the fact that a function
in the Linux kernel is never loaded into the lower 4096 bytes of memory,
so that we can store a number that is less than 4096 into the
rcu_head
structure's ->func
field
and distinguish it from a function pointer.
The __is_kfree_rcu_offset()
helper function on
lines 1-4 returns true if the argument is an offset and false
if it is instead a pointer to a function.
The __kfree_rcu()
helper function shown on
lines 6-14 does most of the kfree_rcu()
function's work.
Line 11 emits a compiler error if the offset
argument
is not a compile-time constant, and
line 12 emits a compiler error if the offset
argument
is too large for __is_kfree_rcu_offset()
to distinguish
from a pointer to a function.
Finally, line 13 invokes call_rcu()
with a pointer
to the rcu_head
structure and the offset cast to
an RCU callback function type.
The kfree_rcu()
macro itself is shown on
lines 16 and 17.
It takes as arguments a pointer to the enclosing RCU-protected structure
and the name of the field containing that structure's rcu_head
field.
It then computes the pointer to the rcu_head
structure and
its offset within the enclosing RCU-protected structure, and passes these
two quantities to __kfree_rcu()
.
Quick Quiz 4:
Why not make a single kfree_rcu()
function that
does the work of both kfree_rcu()
and
__kfree_rcu()
?
Answer
RCU's barrier functions wait not just for all pre-existing RCU read-side critical sections, but also for all pre-existing RCU callbacks to be invoked. The mappings from the updater-visible APIs to this helper function are as follows:
1 #ifdef CONFIG_RCU_TREE 2 3 void rcu_barrier(void) 4 { 5 rcu_barrier_sched(); 6 } 7 #else /* #ifdef CONFIG_RCU_TREE */ 8 9 void rcu_barrier(void) 10 { 11 _rcu_barrier(&rcu_preempt_state, call_rcu); 12 } 13 14 #endif /* else #ifdef CONFIG_RCU_TREE */ 15 16 void rcu_barrier_bh(void) 17 { 18 _rcu_barrier(&rcu_bh_state, call_rcu_bh); 19 } 20 EXPORT_SYMBOL_GPL(rcu_barrier_bh); 21 22 void rcu_barrier_sched(void) 23 { 24 _rcu_barrier(&rcu_sched_state, call_rcu_sched); 25 }
In what should by now be a familiar manner, CONFIG_RCU_TREE
maps rcu_barrier()
into rcu_barrier_sched()
and all other cases define trivial wrappers around the
underlying _rcu_barrier()
function,
which is described in a separate article.
The expedited implementation is quite straightforward:
1 #ifdef CONFIG_TREE_RCU 2 3 void synchronize_rcu_expedited(void) 4 { 5 synchronize_sched_expedited(); 6 } 7 8 #endif 9 10 static inline void synchronize_rcu_bh_expedited(void) 11 { 12 synchronize_sched_expedited(); 13 }
The synchronize_sched_expedited()
function directly
implements expedited grace periods for RCU-sched,
and for CONFIG_TREE_PREEMPT_RCU
,
synchronize_rcu_expedited()
directly implements expedited
grace periods for preemptible RCU.
Because any RCU-sched grace period is also an RCU-bh grace period,
synchronize_rcu_bh_expedited()
is a trivial wrapper
for synchronize_sched_expedited()
.
Quick Quiz 5:
Why is any grace period for RCU-sched also a grace period for RCU-bh?
Answer
This work represents the view of the author and does not necessarily represent the view of IBM.
Linux is a registered trademark of Linus Torvalds.
Other company, product, and service names may be trademarks or service marks of others.
Quick Quiz 1: Why not implement the expedited APIs in terms of callbacks as well?
Answer: Suppose that an expedited API is called just after a grace period has started. If normal callbacks were to be used, it would be necessary to wait for the remainder of the just-started grace period and then to wait for a grace period after that. This is not what we mean by “expedited”. By handling expedited requests separately, we can avoid waiting for such additional partial grace periods.
Quick Quiz 2:
But wait_rcu_gp()
is also used by
CONFIG_TINY_RCU
and CONFIG_TINY_PREEMPT_RCU
,
but not for things like synchronize_rcu()
.
What gives?
Answer: This function is used by TINY_RCU for the barrier primitives. Same code, different environment, different use case. Some times you get lucky!
Quick Quiz 3:
What if the CPU is executing within an RCU read-side critical section
when it invokes rcu_blocking_is_gp()
?
Answer: That would be illegal. The synchronous grace-period primitives may only be invoked from outside of an RCU read-side critical section.
Quick Quiz 4:
Why not make a single kfree_rcu()
function that
does the work of both kfree_rcu()
and
__kfree_rcu()
?
Answer:
The separate kfree_rcu()
macro is required because the
C language does not permit struct fields to be passed to functions.
It is possible to combine them, but only as a big ugly macro.
By comparison, the current two-line macro and static inline function
seem preferable.
Quick Quiz 5: Why is any grace period for RCU-sched also a grace period for RCU-bh?
Answer: Because RCU-sched's quiescent states are a subset of those for RCU-bh. Therefore, any set of quiescent states satisfying RCU-sched must by definition also satisfy RCU-bh.