October 4, 2011
This article was contributed by Paul E. McKenney
At the end we have the usual answers to the quick quizzes.
The purpose of an RCU read-side critical section is to prevent subsequent grace periods from ending, which in turn prevents RCU updaters from carrying out destructive actions that might affect the data structures that the reader traverses. It turns out that there are a number of ways of implementing this functionality, a number of which are extremely fast and scalable.
One class of mechanisms, used heavily within operating-system kernels, relies on disabling preemption throughout RCU read-side critical sections. These mechanisms can be thought of as approximating RCU read-side critical sections with periods of times between scheduler context switches. Because disabling preemption is a very low-cost operation (in fact, zero cost in run-to-block or non-preemptible kernels), this class of mechanisms offers the best performance and scalability.
Unfortunately, long RCU read-side critical sections are not uncommon, and on real-time systems, the longest RCU read-side critical section sets a lower bound for scheduling latency. This is a serious problem for Linux users who need deep sub-millisecond scheduling latency. This proved to be a difficult problem that was solved incrementally over a period of years, but a series of algorithms leading to a good solution was inspired by a counter scheme suggested by Esben Nielsen in 2005. Although the original scheme was subject to starvation, it lead to an production-capable implementation. This implementation had the disadvantage of featuring atomic instructions and memory barriers in the read-side fastpaths.
A couple of years later, an implementation free of atomic instructions and memory barriers on the read-side fastpaths appeared and was accepted into mainline. Although this was a great improvement over the previous version, its read-side fastpaths were still quite slow, and its scalability was quite limited. Given the inexorable rise in CPU counts for both servers and embedded systems, scalability was a significant long-term concern.
Experience with user-space RCU implementations led to a much simpler and faster in-kernel implementation, TREE_PREEMPT_RCU. It is this implementation that is described in this article.
RCU's read-side API includes the following pairs of primitives to delimit different types of RCU read-side critical section:
rcu_read_lock_bh()
and
rcu_read_unlock_bh()
for uses subject to softirq-based
denial-of-service attacks.
rcu_read_lock_sched()
and
rcu_read_unlock_sched()
for uses where readers
disable preemption and/or interrupts.
rcu_read_lock()
and rcu_read_unlock()
for generic use.
This maps to rcu_read_lock_sched()
and
rcu_read_unlock_sched()
in !CONFIG_PREEMPT
kernels and to a low-latency-friendly counter-based implementation
otherwise.
The remainder of this article will look at RCU readers from an implementation viewpoint, first from a design viewpoint and then from an implementation viewpoint.
The design for rcu_read_lock_bh()
and
rcu_read_unlock_bh()
are trivial:
rcu_read_lock_bh()
invokes local_bh_disable()
and rcu_read_unlock_bh()
invokes
local_bh_enable()
.
Similarly, rcu_read_lock_sched()
invokes
preempt_disable()
and
rcu_read_unlock_sched()
invokes preempt_enable()
.
Quick Quiz 1:
I just looked at the source code, and
rcu_read_lock_bh()
,
rcu_read_unlock_bh()
,
rcu_read_lock_sched()
, and
rcu_read_unlock_sched()
actually do a lot more than you say they do.
Who do you think that you are fooling?
Answer
Preemptible RCU readers are a bit more involved. The key insight is to track running RCU readers separately from RCU readers that have blocked or been preempted within their RCU read-side critical sections. This approach allows the RCU readers to simply non-atomically increment and decrement a counter in the common case, and to place the task on a queue only when a task executes a context switch from within an RCU read-side critical section. Therefore, although enqueuing the task is quite expensive, as it requires acquiring a lock, it is also rather rare. The flow of control is shown below:
When entering an RCU read-side critical section,
rcu_read_lock()
simply increments a local counter
in the task structure.
If there is no context switch during that RCU read-side critical
section, then rcu_read_unlock()
need only decrement
that same local counter, as shown down the left-hand side of
the diagram above.
However, if there are context switches during that RCU read-side
critical section, then the first such context switch enqueues the
task on the leaf rcu_node
structure corresponding to the
CPU on which the context switch occurred.
The rcu_read_unlock()
will then need to both decrement
the local counter and remove the task from the queue, as shown on
the right-hand side of the diagram.
Of course, the actual implementation must deal with complications that include long-running RCU read-side critical sections, RCU priority boosting, and nesting of RCU read-side critical sections, both due to nesting within the code and due to interrupts and non-maskable interrupts (NMIs).
The RCU-bh flavor of RCU is intended for use in code that runs primarily in bottom-half context, especially if that code is subject to denial-of-service (DoS) attacks. The default non-preemptible “vanilla” flavor of RCU requires that all online non-dyntick-idle CPUs enter the scheduler at least once per grace period. Unfortunately, some types of DoS attacks can overload the Linux kernel to the point that at least one of the CPUs spends all its time in softirq (also known as “bottom half”) context. This CPU will never enter a vanilla-RCU quiescent state, and therefore the vanilla-RCU grace period will never complete.
In contrast, RCU-bh only requires that each CPU either
enter the scheduler or complete at least one softirq handler.
This means that RCU-bh grace periods will not be stalled by these
types of DoS attacks.
The downside is that rcu_read_lock_bh()
and
rcu_read_unlock_bh()
are not free, as can be
seen by the following code:
1 static inline void rcu_read_lock_bh(void) 2 { 3 local_bh_disable(); 4 __acquire(RCU_BH); 5 rcu_read_acquire_bh(); 6 } 7 8 static inline void rcu_read_unlock_bh(void) 9 { 10 rcu_read_release_bh(); 11 __release(RCU_BH); 12 local_bh_enable(); 13 }
The code for rcu_read_lock_bh()
is shown on lines 1-6.
Line 3 is the only line that generates executable code in
production kernels, and it disables bottom-half processing.
This is lightweight, doing a non-atomic addition to a per-task
counter, but is not free.
In addition, too-long RCU-bh read-side critical sections will delay
softirq processing, and thus delay networking traffic.
Nevertheless, RCU-bh is a valuable tool for short read-side critical
sections that need to interact with softirq handlers.
Line 4 never generates executable code, but is used by
sparse to detect
unbalanced rcu_read_lock_bh()
and rcu_read_unlock_bh()
pairs.
Line 5 does not generate executable code except in kernels
built with
lockdep, which has been
adapted to work with RCU.
Therefore, in production Linux-kernel builds,
rcu_read_lock_bh()
compiles down to
local_bh_disable()
.
The code for rcu_read_unlock_bh()
is shown on
lines 8-13.
This is the inverse of rcu_read_lock_bh()
and proceeds
in the reverse order: Line 10 keeps lockdep informed,
line 11 is for sparse, and finally line 12 re-enables
bottom-half processing, and is the only executable line in
production kernels.
The RCU-sched flavor of RCU implements “classic RCU”
in all kernel builds.
Its implementation is similar to that of RCU-bh, with
local_bh_disable()
replaced by preempt_disable()
and local_bh_enable()
replaced by preempt_enable()
.
1 static inline void rcu_read_lock_sched(void) 2 { 3 preempt_disable(); 4 __acquire(RCU_SCHED); 5 rcu_read_acquire_sched(); 6 } 7 8 static inline notrace void rcu_read_lock_sched_notrace(void) 9 { 10 preempt_disable_notrace(); 11 __acquire(RCU_SCHED); 12 } 13 14 static inline void rcu_read_unlock_sched(void) 15 { 16 rcu_read_release_sched(); 17 __release(RCU_SCHED); 18 preempt_enable(); 19 } 20 21 static inline notrace void rcu_read_unlock_sched_notrace(void) 22 { 23 __release(RCU_SCHED); 24 preempt_enable_notrace(); 25 }
The implementation of the vanilla rcu_read_lock()
and
rcu_read_unlock()
depends on the kernel configuration.
For CONFIG_TREE_RCU
kernels, these primitives are equivalent
to rcu_read_lock_sched()
and rcu_read_unlock_sched()
,
as can be seen below:
1 static inline void __rcu_read_lock(void) 2 { 3 preempt_disable(); 4 } 5 6 static inline void __rcu_read_unlock(void) 7 { 8 preempt_enable(); 9 } 10 11 static inline void rcu_read_lock(void) 12 { 13 __rcu_read_lock(); 14 __acquire(RCU); 15 rcu_read_acquire(); 16 } 17 18 static inline void rcu_read_unlock(void) 19 { 20 rcu_read_release(); 21 __release(RCU); 22 __rcu_read_unlock(); 23 }
However, for CONFIG_TREE_PREEMPT_RCU
kernels,
rcu_read_lock()
and rcu_read_unlock()
allow RCU read-side critical sections to be preempted, which
requires a very different implementation of __rcu_read_lock()
and __rcu_read_unlock()
.
1 void __rcu_read_lock(void) 2 { 3 current->rcu_read_lock_nesting++; 4 barrier(); 5 } 6 7 void __rcu_read_unlock(void) 8 { 9 struct task_struct *t = current; 10 11 if (t->rcu_read_lock_nesting != 1) 12 --t->rcu_read_lock_nesting; 13 else { 14 barrier(); 15 t->rcu_read_lock_nesting = INT_MIN; 16 barrier(); 17 if (unlikely(ACCESS_ONCE(t->rcu_read_unlock_special))) 18 rcu_read_unlock_special(t); 19 barrier(); 20 t->rcu_read_lock_nesting = 0; 21 } 22 #ifdef CONFIG_PROVE_LOCKING 23 { 24 int rrln = ACCESS_ONCE(t->rcu_read_lock_nesting); 25 26 WARN_ON_ONCE(rrln < 0 && rrln > INT_MIN / 2); 27 } 28 #endif /* #ifdef CONFIG_PROVE_LOCKING */ 29 }
The __rcu_read_lock()
implementation is on
lines 1-5 above.
Line 3 increments the per-task
->rcu_read_lock_nesting
counter and line 4 ensures
that the compiler does not bleed code from the following RCU read-side
critical section out before the __rcu_read_lock()
.
In short, __rcu_read_lock()
does nothing more than
to increment a nesting-level counter.
The __rcu_read_unlock()
function is on lines 7-29
above.
Line 11 checks to see if this is a nested RCU read-side critical
section, and if so, line 12 decrements the per-task
->rcu_read_lock_nesting
counter.
Otherwise, lines 14-20 handle exiting from the outermost
RCU read-side critical section.
Line 14 prevents the compiler from reordering the
code from the RCU read-side critical section past the invocation
of __rcu_read_unlock()
.
Line 15 sets the per-task ->rcu_read_lock_nesting
counter
counter to a large negative number so that the context-switch code
(along with any instances of
__rcu_read_unlock()
invoked from interrupt handlers)
realize that we are in the process of exiting a top-level
RCU read-side critical section.
Line 16 prevents the compiler from moving code generated from
line 15 to follow that generated from lines 17 and 18.
Line 17 checks to see if anything unusual happened during the
just-ended RCU read-side critical section, and, if so,
line 18 invokes rcu_read_unlock_special()
to clean up after that unusual something.
Line 19 prevents the compiler from moving code generated from
lines 17 and 18 to follow that generated from line 20.
Finally, line 20 zeros the per-task
->rcu_read_lock_nesting
counter, which indicates
that we have fully exited from the RCU read-side critical section.
Regardless of the path taken through
__rcu_read_unlock()
, if lockdep is enabled
(CONFIG_PROVE_LOCKING
), then lines 23-27
checks to make sure that the new value of the per-task
->rcu_read_lock_nesting
variable is not
a small negative number, which it might be in case of
unbalanced rcu_read_lock()
and
rcu_read_unlock()
calls.
The next section looks at how RCU readers interact with context switches.
CONFIG_TREE_PREEMPT_RCU
permits RCU
read-side critical sections to be preempted, special processing
is required when such preemption occurs.
This processing is carried out by
rcu_preempt_note_context_switch()
.
RCU uses the per-task ->rcu_read_lock_nesting()
variable to determine whether or not any task currently running on
a CPU is running in an RCU read-side critical section.
This begs the question of how RCU is supposed to figure out whether
other tasks are within RCU read-side critical sections.
The answer to this question, as shown by the “Enqueue Task”
block of the diagram in the
RCU Reader Design section,
is that tasks that context-switch while
in RCU read-side critical sections are enqueued onto the
->blkd_tasks
list of the rcu_node
structure corresponding to the CPU that the task was running on.
The scheduler invokes the rcu_preempt_note_context_switch()
function shown below in order
to allow RCU to carry out this enqueuing operation.
1 static void rcu_preempt_note_context_switch(int cpu) 2 { 3 struct task_struct *t = current; 4 unsigned long flags; 5 struct rcu_data *rdp; 6 struct rcu_node *rnp; 7 8 if (t->rcu_read_lock_nesting > 0 && 9 (t->rcu_read_unlock_special & RCU_READ_UNLOCK_BLOCKED) == 0) { 10 rdp = per_cpu_ptr(rcu_preempt_state.rda, cpu); 11 rnp = rdp->mynode; 12 raw_spin_lock_irqsave(&rnp->lock, flags); 13 t->rcu_read_unlock_special |= RCU_READ_UNLOCK_BLOCKED; 14 t->rcu_blocked_node = rnp; 15 WARN_ON_ONCE((rdp->grpmask & rnp->qsmaskinit) == 0); 16 WARN_ON_ONCE(!list_empty(&t->rcu_node_entry)); 17 if ((rnp->qsmask & rdp->grpmask) && rnp->gp_tasks != NULL) { 18 list_add(&t->rcu_node_entry, rnp->gp_tasks->prev); 19 rnp->gp_tasks = &t->rcu_node_entry; 20 #ifdef CONFIG_RCU_BOOST 21 if (rnp->boost_tasks != NULL) 22 rnp->boost_tasks = rnp->gp_tasks; 23 #endif /* #ifdef CONFIG_RCU_BOOST */ 24 } else { 25 list_add(&t->rcu_node_entry, &rnp->blkd_tasks); 26 if (rnp->qsmask & rdp->grpmask) 27 rnp->gp_tasks = &t->rcu_node_entry; 28 } 29 trace_rcu_preempt_task(rdp->rsp->name, 30 t->pid, 31 (rnp->qsmask & rdp->grpmask) 32 ? rnp->gpnum 33 : rnp->gpnum + 1); 34 raw_spin_unlock_irqrestore(&rnp->lock, flags); 35 } else if (t->rcu_read_lock_nesting < 0 && 36 t->rcu_read_unlock_special) { 37 rcu_read_unlock_special(t); 38 } 39 local_irq_save(flags); 40 rcu_preempt_qs(cpu); 41 local_irq_restore(flags); 42 }
Lines 8-34 handle the case where there was a context
switch within an RCU read-side critical section, while lines 35-37
handle the case where there was a context switch while the outgoing
task was in the process of leaving an outermost RCU read-side
critical section.
Finally, lines 39-41 inform RCU's quiescent-state-detection
code of the context switch, which is a quiescent state for
both RCU-bh and RCU-sched.
Each of these three sections of rcu_preempt_note_context_switch()
is discussed below.
The first block of code (lines 8-27, which enqueues tasks) is the longest and is discussed first. Note that it is necessary to enqueue a task only if it context switches within an RCU read-side critical section and if that task is not already enqueued. To see this, refer to the diagram in the RCU Reader Design section and note that tasks are enqueue upon context switch within an RCU read-side critical section, but not dequeued until the end of that RCU read-side critical section. Because we don't dequeue the task when it resumes execution in its RCU read-side critical section, there is no need to requeue it if it context switches a second time within the same RCU read-side critical section.
Therefore, line 8 checks to see if the outgoing task
was executing in (but not yet in the process of exiting)
an RCU read-side critical section and line 9 checks
that this task is not already queued.
If both conditions hold, execution proceeds to lines 10-37.
Lines 10 and 11 obtain pointers to this CPU's
rcu_data
and rcu_node
structures,
respectively.
Line 12 acquires the rcu_node
structure's
->lock
and line 34 releases it.
Line sets the per-task flag to inform the subsequent
rcu_read_unlock()
invocation that the task will
need to dequeue itself, and line 11 records the
rcu_node
structure on which the task is enqueued.
Line 15 complains if the task is context-switching away
from a CPU that is already offline, and line 16
complains if the task is already enqueued.
Lines 17-28 enqueue the task, an operation that must be carried out differently depending on how this task's RCU read-side critical section relates to the current grace period:
To handle the first case,
line 17 checks to see whether the current CPU still needs
to pass through a quiescent state (indicating that the outgoing
task's RCU read-side critical section is blocking the current
grace period) and whether the rcu_node
structure's
->gp_tasks
pointer is non-NULL (which indicates
that there is already a blocked task blocking the current
grace period).
If so, line 18 enqueues the outgoing task on this
rcu_node
structure's ->blkd_tasks
list to precede the most-recent already-queued task that is
blocking the current grace period.
Line 19 then points the rcu_node
structure's
->gp_tasks
pointer to reference the outgoing
task.
An example of this update is shown in the following diagram:
In this diagram, the CPU's bit is set in the ->qsmask
field, and the ->gp_tasks
field is non-NULL
,
so the outgoing task T3 is added to the queue preceding
task T1, which is the task previously referenced by
->gp_tasks
.
A key property preserved by this update is that current grace period is
blocked by the task referenced
by ->gp_tasks
and by all later tasks in the list.
Lines 21-22 are present only if RCU priority boosting is enabled.
If so, line 21 checks to see if RCU priority boosting is in
progress for this rcu_node
structure, and if so,
line 22 updates the ->boost_tasks
field to
include the outgoing task in the list to be boosted.
RCU priority boosting is described in more detail in a separate
article.
The second case from the above list is the case where the
outgoing task does not block the current grace period or is the
first task to do so.
Line 25 adds the outgoing task to the head of the
->blkd_tasks
list.
Line 26 then checks to see if RCU is waiting for a quiescent
state from the current CPU, and, if so, the ->gp_tasks
field is set to reference the outgoing task.
This latter case, where the outgoing task is the first one on this
rcu_node
structure to block the current grace period,
is shown in the diagram below:
The other case, where there are not yet any tasks blocking
the current grace period, and this task is not blocking it either,
as indicated by the all-zero bits in the ->qsmask
bitmask:
The state of the list is the same as the earlier diagram, except
that the ->gp_tasks
is NULL
in this
last case.
Quick Quiz 2:
Suppose this is the first task to block on this
rcu_node
, but RCU priority boosting is already
underway.
Wouldn't this situation require that line 27 of
rcu_preempt_note_context_switch()
initiate RCU priority
boosting for the newly enqueued task?
Answer
Once the outgoing task is queued in whatever way that it is queued, lines 29-33 trace the task-enqueue event.
Recall that the just-discussed lines 8-34 handled
the case where a task does a context switch within an RCU
read-side critical section.
We are now ready to look at the case where a task does a context
switch while in the process of exiting from an RCU read-side
critical section (from lines 15 and 20 of
rcu_read_unlock()
, which includes the call to
rcu_read_unlock_special()
).
Line 35 checks to see whether the outgoing task was preempted
while exiting its RCU read-side critical section, and, if so,
line 36 checks to see if there are any special cleanup
required, and, again if so, line 37 invokes
rcu_read_unlock_special()
to do the cleanup.
The rcu_read_unlock_special()
function
called from line 18 of __rcu_read_unlock()
cleans up after any unusual situations that might have occured
during the RCU read-side critical section.
The rcu_read_unlock_special()
function uses
a helper function named rcu_next_node_entry()
as
follows:
1 static struct list_head *rcu_next_node_entry(struct task_struct *t, 2 struct rcu_node *rnp) 3 { 4 struct list_head *np; 5 6 np = t->rcu_node_entry.next; 7 if (np == &rnp->blkd_tasks) 8 np = NULL; 9 return np; 10 }
Given a pointer to a task_struct
and the rcu_node
structure that it is queued on, this function returns a pointer to
the next task on the list if there is one or NULL
if not.
Line 6 gets a pointer to the task's list_head
structure used for queuing on this list.
Line 7 checks whether this is the last task in the list, returning
NULL
if so and a pointer to the next task's
list_head
structure if not.
As noted earlier, it is sometimes necessary to carry out cleanup actions when exiting an RCU read-side critical sections. These actions are:
RCU_READ_UNLOCK_NEED_QS
bit in the
->rcu_read_unlock_special
field in the
task structure.
->blkd_tasks
list.
This action is required when the
RCU_READ_UNLOCK_BLOCKED
bit is set in the
->rcu_read_unlock_special
field in the
task structure.
->rcu_boosted
field is set in the task structure.
These actions are carried out by rcu_read_unlock_special()
,
which is shown below:
1 static noinline void rcu_read_unlock_special(struct task_struct *t) 2 { 3 int empty; 4 int empty_exp; 5 unsigned long flags; 6 struct list_head *np; 7 #ifdef CONFIG_RCU_BOOST 8 struct rt_mutex *rbmp = NULL; 9 #endif /* #ifdef CONFIG_RCU_BOOST */ 10 struct rcu_node *rnp; 11 int special; 12 13 if (in_nmi()) 14 return; 15 local_irq_save(flags); 16 special = t->rcu_read_unlock_special; 17 if (special & RCU_READ_UNLOCK_NEED_QS) { 18 rcu_preempt_qs(smp_processor_id()); 19 } 20 if (in_irq() || in_serving_softirq()) { 21 local_irq_restore(flags); 22 return; 23 } 24 if (special & RCU_READ_UNLOCK_BLOCKED) { 25 t->rcu_read_unlock_special &= ~RCU_READ_UNLOCK_BLOCKED; 26 for (;;) { 27 rnp = t->rcu_blocked_node; 28 raw_spin_lock(&rnp->lock); 29 if (rnp == t->rcu_blocked_node) 30 break; 31 raw_spin_unlock(&rnp->lock); 32 } 33 empty = !rcu_preempt_blocked_readers_cgp(rnp); 34 empty_exp = !rcu_preempted_readers_exp(rnp); 35 smp_mb(); 36 np = rcu_next_node_entry(t, rnp); 37 list_del_init(&t->rcu_node_entry); 38 t->rcu_blocked_node = NULL; 39 trace_rcu_unlock_preempted_task("rcu_preempt", 40 rnp->gpnum, t->pid); 41 if (&t->rcu_node_entry == rnp->gp_tasks) 42 rnp->gp_tasks = np; 43 if (&t->rcu_node_entry == rnp->exp_tasks) 44 rnp->exp_tasks = np; 45 #ifdef CONFIG_RCU_BOOST 46 if (&t->rcu_node_entry == rnp->boost_tasks) 47 rnp->boost_tasks = np; 48 if (t->rcu_boost_mutex) { 49 rbmp = t->rcu_boost_mutex; 50 t->rcu_boost_mutex = NULL; 51 } 52 #endif /* #ifdef CONFIG_RCU_BOOST */ 53 if (!empty && !rcu_preempt_blocked_readers_cgp(rnp)) { 54 trace_rcu_quiescent_state_report("preempt_rcu", 55 rnp->gpnum, 56 0, rnp->qsmask, 57 rnp->level, 58 rnp->grplo, 59 rnp->grphi, 60 !!rnp->gp_tasks); 61 rcu_report_unblock_qs_rnp(rnp, flags); 62 } else 63 raw_spin_unlock_irqrestore(&rnp->lock, flags); 64 #ifdef CONFIG_RCU_BOOST 65 if (rbmp) 66 rt_mutex_unlock(rbmp); 67 #endif /* #ifdef CONFIG_RCU_BOOST */ 68 if (!empty_exp && !rcu_preempted_readers_exp(rnp)) 69 rcu_report_exp_rnp(&rcu_preempt_state, rnp); 70 } else { 71 local_irq_restore(flags); 72 } 73 }
RCU read-side critical sections can be used in NMI handlers, but NMI handlers cannot block nor can they be interrupted. Therefore, lines 13 and 14 check for running in NMI context, and return if so.
Line 15 disables interrupts and line 16 takes a snapshot of the
->rcu_read_unlock_special
field of the task structure.
Line 17 checks for the RCU_READ_UNLOCK_NEED_QS
flag (which indicates that the RCU core has been waiting awhile for
a quiescent state from this CPU), and if set, line 18 invokes
rcu_preempt_qs()
to initiate reporting of a quiescent state
to the RCU core.
Quick Quiz 3:
Is it really safe to report a quiescent state from
rcu_read_unlock_special()
?
Answer
Quick Quiz 4:
Suppose that this call to rcu_read_unlock()
is within an RCU-bh read-side critical section.
Wouldn't we need to avoid reporting a quiescent state in that case
in order to avoid messing up the enclosing RCU-bh read-side critical
section?
Answer
Line 20 checks for interrupt or softirq contexts, in which case line 21 restores interrupts and line 22 returns. This prevents uselessly checking for having blocked, which is not possible within these contexts.
Line 24 checks for the RCU_READ_UNLOCK_BLOCKED
flag (which indicates that there was at least one context switch
during the just-ended RCU read-side critical section).
If so, lines 25-69 dequeue the task and deboost the task if required,
otherwise, line 71 restores interrupts.
Looking in more detail at lines 25-69, line 25
clears the RCU_READ_UNLOCK_BLOCKED
in the
->rcu_read_unlock_special
field in the task structure.
Lines 26-31 acquire the corresponding rcu_node
structure's ->lock
.
This lock acquisition requires a loop because CPU-hotplug events
can cause the list of tasks to move from the leaf rcu_node
to the root rcu_node
.
If we attempt to acquire the lock concurrently with such an event,
we might find ourselves holding the leaf lock with the task now queued
at the root.
But only one such move is possible, so at most two passes through this
loop are required.
Lines 33 and 34 records whether there are currently any tasks blocking any current normal and expedited grace periods, respectively. Line 35 executes a memory barrier to ensure that the just-ended RCU read-side critical section is seen by all CPUs to precede any operations following an expedited grace period that happens to take the fastpath.
Line 36 finds the task following the current one on
the ->blkd_tasks
list, or NULL
if the
current task is last on the list.
Lines 37 and 38 then removes the current task from the list,
(and lines 39-40 trace the removal) after
which any of this rcu_node
structure's pointers
referencing this task must be advanced to the next task (or set
to NULL
, as the case may be).
Lines 41-42 advance ->gp_tasks
,
lines 43-44 advance ->exp_tasks
, and
lines 46-47 advance ->boost_tasks
(for kernels built with CONFIG_RCU_BOOST
).
CONFIG_RCU_BOOST
kernels also take a snapshot of
the ->rcu_boost_mutex
field of the task structure,
before NULL
ing it out on lines 48-50.
Lines 53-63 check to see if this task was the last
one queued on this rcu_node
structure that was
blocking the current grace period.
If so, lines 54-60 trace this transition and line 61
reports the quiescent state to the RCU core (which has the side
effect of releasing this rcu_node
structure's
->lock
and restoring interrupts).
Otherwise, line 63 releases this rcu_node
structure's
->lock
and restores interrupts.
CONFIG_RCU_BOOST
kernels also check to see if
priority boosting occurred on line 65, releasing the
rt_mutex
on line 66 if so.
The act of release the rt_mutex
ends the priority inheritance.
Line 68 checks to see if this task was the last one
on this rcu_node
that was blocking the current
expedited grace period, and if so, line 69 reports this to
the RCU core.
Quick Quiz 5:
Line 69 of rcu_read_unlock_special()
invokes
rcu_report_exp_rnp()
after the rcu_node
structure's ->lock
has been released and interrupts
have been re-enabled.
So how can you be sure that the decisions leading to this point
are still valid?
Answer
This work represents the view of the author and does not necessarily represent the view of IBM.
Linux is a registered trademark of Linus Torvalds.
Other company, product, and service names may be trademarks or service marks of others.
Quick Quiz 1:
I just looked at the source code, and
rcu_read_lock_bh()
,
rcu_read_unlock_bh()
,
rcu_read_lock_sched()
, and
rcu_read_unlock_sched()
actually do a lot more than you say they do.
Who do you think you are fooling?
Answer:
Look again, but more carefully this time.
The additional lines matter only to sparse or to kernel hackers who
choose to build the kernel with CONFIG_PROVE_RCU
for testing and debugging purposes.
Quick Quiz 2:
Suppose this is the first task to block on this
rcu_node
, but RCU priority boosting is already
underway.
Wouldn't this situation require that line 27 of
rcu_preempt_note_context_switch()
initiate RCU priority
boosting for the newly enqueued task?
Answer:
No.
Each rcu_node
structure manages RCU priority boosting.
The next invocation of force_quiescent_state()
will
initiate RCU priority boosting if appropriate.
Quick Quiz 3:
Is it really safe to report a quiescent state from
rcu_read_unlock_special()
?
Answer: Yes, we just exited the outermost RCU read-side critical section, and are therefore by definition in a quiescent state.
Quick Quiz 4:
Suppose that this call to rcu_read_unlock()
is within an RCU-bh read-side critical section.
Wouldn't we need to avoid reporting a quiescent state in that case
in order to avoid messing up the enclosing RCU-bh read-side critical
section?
Answer: No. The RCU-preempt, RCU-bh, and RCU-sched read-side critical sections and quiescent states are independent. Reporting a quiescent state in one flavor of RCU does not affect any of the other flavors. (That said, any quiescent state for RCU-sched is also a quiescent state for RCU-bh, but the handling of quiescent states is still carried out independently for the different flavors of RCU.)
Quick Quiz 5:
Line 69 of rcu_read_unlock_special()
invokes
rcu_report_exp_rnp()
after the rcu_node
structure's ->lock
has been released and interrupts
have been re-enabled.
So how can you be sure that the decisions leading to this point
are still valid?
Answer:
The decision is guarded by sync_rcu_preempt_exp_mutex
that is held across the grace period by
synchronize_rcu_expedited()
.
This mutex cannot be released until all rcu_node
structures have been accounted for, including the current
rcu_node
structure.
So the fact that this function has not yet called
rcu_report_exp_rnp()
guarantees that the decision
criteria in lines 68-69 are stable.