September 4, 2011
This article was contributed by Paul E. McKenney
And the end would simply not be the end without the answers to the quick quizzes.
RCU updaters must wait for readers to get done with pre-existing RCU read-side critical sections, the ends of which are communicated to RCU in a more or less timely manner via quiescent states. Once each and every task in the system has been seen in a quiescent state, the system is said to have passed through an RCU grace period. Therefore, RCU updaters can wait for a grace period to elapse between (for example) removing an item from an RCU-protected data structure and freeing it.
Detecting when RCU readers are in quiescent states is clearly critically important, and is the topic of this article. Different flavors of RCU have different quiescent states:
local_bh_disable()
) are
inside of bottom-half context.
Quick Quiz 1:
Give an example of a task that is in a quiescent state for RCU-sched,
but not for RCU-preempt.
Answer
Quick Quiz 2:
Is a quiescent state a property of a task, a CPU, a kernel thread, or
of something else?
Answer
Regardless of how you choose to think about quiescent states, an important part of RCU's operation is efficiently detecting them. Because every blocked and preempted task is in both an RCU-bh and an RCU-sched quiescent state, these two RCU flavors need concern themselves only with the tasks, interrupts, and NMIs that are actually running at a given point in time. These RCU flavors therefore focus almost entirely on the CPUs, ignoring the potentially much larger number of tasks that are not currently running.
In contrast, RCU-preempt must also track any tasks blocked within an RCU-preempt read-side critical section. This flavor of RCU nevertheless avoids time-consuming scans of the entire task list by checking each outgoing task at context-switch time, explicitly tracking those that do so within RCU read-side critical sections. This approach permits RCU-preempt to use fast CPU/task-local counters in the common case, and to use more expensive operations in the much less-common situation where a task is preempted or blocks within an RCU-preempt read-side critical section.
Given this overview, we are now ready to examine the API.
RCU's quiescent-state API is reserved for scheduler-like parts of the kernel; it is not intended for use by normal RCU users. That said, here it is:
rcu_bh_qs(int cpu)
rcu_check_callbacks(int cpu, int user)
rcu_note_context_switch(int cpu)
rcu_virt_note_context_switch(int cpu)
The rcu_bh_qs()
API member is called to announce an
RCU-bh quiescent state to RCU.
It is invoked by the softirq
scheduler between consecutive softirq handler invocations,
and from rcu_check_callbacks()
when it notes that
the scheduling-clock interrupt occurred in a section of code
where bottom halves were enabled.
Quick Quiz 3:
Why aren't rcu_sched_qs()
and rcu_preempt_qs()
part of this API?
Answer
The rcu_check_callbacks()
API member is called
from within the scheduling-clock interrupt.
Its purposes are as follows:
Quick Quiz 4:
What RCU-related actions might the current CPU need to respond to?
Answer
The rcu_note_context_switch()
API member is called
from the scheduler and from run_ksoftirqd
to report
a context switch.
In the case of run_ksoftirqd()
, this
is a fake context switch, but is nevertheless helpful in cases when
ksoftirqd decides to process lots of softirq handlers.
The rcu_virt_note_context_switch()
API member
is used by KVM to inform RCU when a CPU enters KVM guest mode,
which as far as RCU is concerned is equivalent to user-mode
execution.
With the exception of RCU-preempt, the recording of RCU's quiescent states is straightforward: quiescent states are recorded as they occur, and the RCU core indicates when quiescent states are needed.
RCU-preempt is a bit more complex due to its need to
track tasks that block within RCU read-side critical sections.
RCU-preempt tracks quiescent states for CPUs in much the same
way that RCU-sched does.
However, when a CPU is context-switching away from a given task,
RCU-preempt also checks to see if that task is in an RCU read-side
critical section.
If so, RCU-preempt queues the task on that CPU's leaf
rcu_node
structure's ->blkd_tasks
list.
The rcu_node
structure maintains internal pointers into
this list indicating which of the tasks are blocking the current
grace period: if the CPU's RCU read-side critical section was blocking
the current grace period, then the blocked task must also still be
blocking that grace period.
Only after RCU-preempt has done any needed task queuing does it record
a quiescent state for the CPU.
Quick Quiz 5:
But suppose that a CPU that has already passed through a quiescent
state for the current RCU grace period resumes executing this
task that is blocking the current RCU grace period.
We then have a task that is blocking the current RCU grace period
running on a CPU that is no longer blocking that same RCU grace period.
How can that possibly make any sense?
Answer
It turns out that RCU-preempt can check a single pointer
(->gp_tasks
) to determine
whether or not there is a task blocking the current RCU grace period on
a given rcu_node
structure.
Furthermore, RCU-preempt can check a single bitmask
(->qsmask
) to determine whether or
not there is a CPU associated with a given rcu_node
structure
that needs to pass thought a quiescent state.
These tricks permit RCU-preempt to efficiently check for quiescent states.
The quiescent-state implementation is presented in three pieces:
The implementation of the first category, momentary quiescent states, is as follows:
1 void rcu_bh_qs(int cpu) 2 { 3 struct rcu_data *rdp = &per_cpu(rcu_bh_data, cpu); 4 5 rdp->passed_quiesce_gpnum = rdp->gpnum; 6 barrier(); 7 if (rdp->passed_quiesce == 0) 8 trace_rcu_grace_period("rcu_bh", rdp->gpnum, "cpuqs"); 9 rdp->passed_quiesce = 1; 10 } 11 12 static void rcu_preempt_qs(int cpu) 13 { 14 struct rcu_data *rdp = &per_cpu(rcu_preempt_data, cpu); 15 16 rdp->passed_quiesce_gpnum = rdp->gpnum; 17 barrier(); 18 if (rdp->passed_quiesce == 0) 19 trace_rcu_grace_period("rcu_preempt", rdp->gpnum, "cpuqs"); 20 rdp->passed_quiesce = 1; 21 current->rcu_read_unlock_special &= ~RCU_READ_UNLOCK_NEED_QS; 22 } 23 24 void rcu_sched_qs(int cpu) 25 { 26 struct rcu_data *rdp = &per_cpu(rcu_sched_data, cpu); 27 28 rdp->passed_quiesce_gpnum = rdp->gpnum; 29 barrier(); 30 if (rdp->passed_quiesce == 0) 31 trace_rcu_grace_period("rcu_sched", rdp->gpnum, "cpuqs"); 32 rdp->passed_quiesce = 1; 33 }
The rcu_bh_qs()
function, whose purpose is to announce an
RCU-bh queiscent state to RCU, is shown on lines 1-10.
Line 3 gets a pointer to this CPU's rcu_data
structure for RCU-bh (hence rcu_bh_data
), which means
that the caller must have at least disabled preemption.
Line 5 records this CPU's idea of the current grace-period
number and line 6 ensures that the compiler doesn't reorder
line 5 with the remainder of the function.
Line 7 checks to see if the is the first quiescent state for
the current grace period, and, if so, line 8 traces it.
Finally, line 9 records the fact that we have encountered
an RCU-bh quiescent state, which will eventually be noticed by the
grace-period machinery.
Quick Quiz 6:
Why is the barrier()
in rcu_bh_qs()
on line 6 required?
Answer
Quick Quiz 7:
Why trace only the first quiescent state of a given grace period?
Answer
The rcu_preempt_qs()
function, whose purpose is
to announce a CPU's RCU-preempt quiescent state to RCU, is shown on
lines 12-22.
(In CONFIG_TREE_RCU
kernels, which do not implement
RCU-preempt, rcu_preempt_qs()
is omitted.)
As noted earlier, just because a given CPU is in an RCU-preempt
quiescent state does not mean that the task that this CPU just
context-switched away from is also in an RCU-preempt quiescent
state—and RCU-preempt needs quiescent states from all tasks
as well as all CPUs.
Quick Quiz 8:
Why don't the other RCU flavors also need quiescent states
from all tasks as well as all CPUs?
Answer
That aside, the rcu_preempt_qs()
function
is analogous to rcu_bh_qs()
except for line 21,
which responds to any outstanding request to finish the RCU
read-side critical section.
The rcu_sched_qs()
function is shown on
lines 24-33.
This function is called to announce
an RCU-sched quiescent state to RCU.
It is invoked from rcu_note_context_switch()
,
rcu_check_callbacks()
, and, when CONFIG_RCU_FAST_NO_HZ
is set, rcu_needs_cpu()
.
Is is analogous to rcu_bh_qs()
, so it will not be described
separately.
The implementation of the second category, extended quiescent states, is as follows:
1 void rcu_check_callbacks(int cpu, int user) 2 { 3 trace_rcu_utilization("Start scheduler-tick"); 4 if (user || 5 (idle_cpu(cpu) && rcu_scheduler_active && 6 !in_softirq() && hardirq_count() <= (1 << HARDIRQ_SHIFT))) { 7 rcu_sched_qs(cpu); 8 rcu_bh_qs(cpu); 9 } else if (!in_softirq()) { 10 rcu_bh_qs(cpu); 11 } 12 rcu_preempt_check_callbacks(cpu); 13 if (rcu_pending(cpu)) 14 invoke_rcu_core(); 15 trace_rcu_utilization("End scheduler-tick"); 16 } 17 18 static void rcu_preempt_check_callbacks(int cpu) 19 { 20 struct task_struct *t = current; 21 22 if (t->rcu_read_lock_nesting == 0) { 23 rcu_preempt_qs(cpu); 24 return; 25 } 26 if (t->rcu_read_lock_nesting > 0 && 27 per_cpu(rcu_preempt_data, cpu).qs_pending) 28 t->rcu_read_unlock_special |= RCU_READ_UNLOCK_NEED_QS; 29 }
The rcu_check_callbacks()
function, which checks
for user-mode execution, idle CPUs, and (for RCU-bh) execution between
softirq handlers, is shown on lines 1-16.
Quick Quiz 9:
But why doesn't this code check for dyntick-idle and offline CPUs,
both of which are also extended quiescent states?
Answer
Lines 3 and 15 trace the beginning and end, respectively,
of rcu_check_callbacks()
.
Lines 4-6 check for RCU-sched quiescent states as follows:
Quick Quiz 10:
Why isn't the second term on line 6 simply
“!hardirq_count()
”?
Why do we need the comparison involving HARDIRQ_COUNT
?
Answer
If lines 4-6 determine that we are in an RCU-sched
quiescent state, then line 7 notes an RCU-sched quiescent
state and line 8 notes an RCU-bh quiescent state.
(Recall that any RCU-sched quiescent state is also an RCU-bh
quiescent state.)
Line 9 checks to see if we are running in softirq context,
in other words, if we interrupted a softirq handler or a region
of code with bottom halves disabled
(e.g., via local_bh_disable()
).
If we are not running in softirq context, line 10 records
the fact that we are in an RCU-bh
quiescent state.
Line 12 invokes rcu_preempt_check_callbacks()
in
order to check for RCU-preempt quiescent states in
CONFIG_TREE_PREEMPT_RCU
kernels.
Line 12 checks to see if RCU needs anything from this CPU,
and, if so, line 14 causes the RCU core code to run on this
CPU in softirq context.
The rcu_preempt_check_callbacks()
function is shown
on lines 18-29 above.
Again, the purpose of this function is to check for RCU-preempt
quiescent states, so in CONFIG_RCU_TREE
kernels,
which do not implement RCU-preempt, rcu_preempt_check_callbacks()
is an empty function.
On the other hand, in CONFIG_RCU_TREE_PREEMPT
kernels,
line 20 picks up a pointer to the current task,
and line 22 checks to see if this task is executing outside of
any RCU-preempt read-side critical section, and, if so,
line 23 announces a quiescent state to the
RCU core code.
Lines 26 and 27 check to see if we are in an RCU-preempt read-side
critical section on a CPU that has not yet passed through a quiescent
state for the current RCU-preempt grace period, and, if so,
line 28 sets the RCU_READ_UNLOCK_NEED_QS
bit
so that the next outermost rcu_read_unlock()
will
announce a quiescent state.
The implementation of the third category, context-switch handling, is as follows:
1 void rcu_note_context_switch(int cpu) 2 { 3 trace_rcu_utilization("Start context switch"); 4 rcu_sched_qs(cpu); 5 rcu_preempt_note_context_switch(cpu); 6 trace_rcu_utilization("End context switch"); 7 } 8 9 static inline void rcu_virt_note_context_switch(int cpu) 10 { 11 rcu_note_context_switch(cpu); 12 } 13 14 static void rcu_preempt_note_context_switch(int cpu) 15 { 16 struct task_struct *t = current; 17 unsigned long flags; 18 struct rcu_data *rdp; 19 struct rcu_node *rnp; 20 21 if (t->rcu_read_lock_nesting > 0 && 22 (t->rcu_read_unlock_special & RCU_READ_UNLOCK_BLOCKED) == 0) { 23 rdp = per_cpu_ptr(rcu_preempt_state.rda, cpu); 24 rnp = rdp->mynode; 25 raw_spin_lock_irqsave(&rnp->lock, flags); 26 t->rcu_read_unlock_special |= RCU_READ_UNLOCK_BLOCKED; 27 t->rcu_blocked_node = rnp; 28 WARN_ON_ONCE((rdp->grpmask & rnp->qsmaskinit) == 0); 29 WARN_ON_ONCE(!list_empty(&t->rcu_node_entry)); 30 if ((rnp->qsmask & rdp->grpmask) && rnp->gp_tasks != NULL) { 31 list_add(&t->rcu_node_entry, rnp->gp_tasks->prev); 32 rnp->gp_tasks = &t->rcu_node_entry; 33 #ifdef CONFIG_RCU_BOOST 34 if (rnp->boost_tasks != NULL) 35 rnp->boost_tasks = rnp->gp_tasks; 36 #endif 37 } else { 38 list_add(&t->rcu_node_entry, &rnp->blkd_tasks); 39 if (rnp->qsmask & rdp->grpmask) 40 rnp->gp_tasks = &t->rcu_node_entry; 41 } 42 trace_rcu_preempt_task(rdp->rsp->name, 43 t->pid, 44 (rnp->qsmask & rdp->grpmask) 45 ? rnp->gpnum 46 : rnp->gpnum + 1); 47 raw_spin_unlock_irqrestore(&rnp->lock, flags); 48 } else if (t->rcu_read_lock_nesting < 0 && 49 t->rcu_read_unlock_special) { 50 rcu_read_unlock_special(t); 51 } 52 local_irq_save(flags); 53 rcu_preempt_qs(cpu); 54 local_irq_restore(flags); 55 }
The rcu_note_context_switch()
function is shown
on lines 1-7.
Lines 3 and 6 trace entry and exit, line 4
invokes rcu_sched_qs()
in order to announce the
RCU-sched quiescent state, and line 5 invokes
rcu_preempt_note_context_switch()
in order to
allow RCU-preempt to do context-switch-time processing for
CONFIG_TREE_PREEMPT_RCU
kernels.
Quick Quiz 11:
Why not also call rcu_bh_qs()
from rcu_note_context_switch()
?
After all, any RCU-sched quiescent state is also an RCU-bh
quiescent state.
Answer
The rcu_virt_note_context_switch()
function,
shown on lines 9-12, allows KVM to inform RCU of the beginning
of guest-mode operation, allowing RCU to treat guest-OS execution
in a manner similar to the way that it treats user-mode execution.
Quick Quiz 12:
Suppose that the guest OS is Linux: Then the guest OS will have
RCU activity.
So how can RCU safely treat guest-OS execution in the same way
that it treats user-mode execution?
Answer
Quick Quiz 13:
Yes, rcu_virt_note_context_switch()
causes RCU
to note a quiescent state when a KVM guest begins executing.
But what if that guest continues executing indefinitely?
Wouldn't that indefinitely extend the RCU grace period, eventually
resulting in out-of-memory conditions?
Answer
The rcu_preempt_note_context_switch()
function
on lines 14-55 handles context-switch-time processing for
RCU-preempt in CONFIG_TREE_PREEMPT_RCU
kernels
(in CONFIG_TREE_RCU
kernels,
rcu_preempt_note_context_switch()
is an empty function).
This function has three major jobs:
rcu_read_unlock_special()
processing
for tasks that are preempted in the middle of an outermost
rcu_read_unlock()
(lines 48-51).
Quick Quiz 14:
So what happens to tasks that are blocking, but which
happen to be in neither an RCU-preempt read-side critical section
nor in the middle of an outermost rcu_read_unlock()
?
Answer
The first job is the biggest and is also the one we start with.
Lines 21-22 checks to see if this task is blocking for the first
time within an RCU-preempt read-side critical section, and, if so,
lines 23-47 enqueue the task on the current CPU's leaf
rcu_node
structure.
Line 23 picks up a pointer to the current CPU's rcu_data
structure (which means that the caller must have at least disabled
preemption), line 24 picks up a pointer to the current CPU's
rcu_node
structure, and line 25 acquires that
rcu_node
structure's ->lock
.
Line&nsp;26 sets the RCU_READ_UNLOCK_BLOCKED
so as to
cause the next outermost __rcu_read_unlock()
to invoke
rcu_read_unlock_special()
in order to dequeue this task,
and line 27 records a pointer to the rcu_node
structure on which the task was enqueued.
Quick Quiz 15:
What prevents the task from entering
rcu_read_unlock_special()
before rcu_preempt_note_context_switch()
has finished
enqueuing it, thereby corrupting the lists?
Answer
Quick Quiz 16:
Why must rcu_preempt_note_context_switch()
record the pointer to the rcu_node
structure?
After all, list_del_rcu()
works just fine given
only the list element, the list header is not required.
Answer
Line 28 complains if the current CPU is offline (if it is offline, how is it that this task is currently running on it?), and line 29 complains if this task is already queued somewhere.
We are now ready for lines 30-41 to actually queue the task. However, we first need to figure out where to enqueue it, and there are three cases that must be properly handled:
->gp_tasks
pointer must be updated to
point to the task that is just now context switching.
If RCU priority boosting is in progress, then this new
task will also need to be priority boosted.
->blkd_tasks
list, and the ->gp_tasks
pointer must be updated to reference it.
->blkd_tasks
list, but no other action need be
taken.
Starting with the first case above, line 30 checks to see
if the current CPU is blocking the current grace period
(rnp->qsmask & rdp->grpmask
) and there is at least
one other task on this rcu_node
structure doing so
(rnp->gp_tasks != NULL
).
If so, lines 31-36 handle this case.
Line 31 adds the task to the list so as to immediately precede the
first task blocking the current grace period in this rcu_node
structure's ->blkd_tasks
list,
and line 32 points this rcu_node
structure's
->gp_tasks
to reference the new task.
If RCU priority boosting has been configured, then line 34
checks to see if this rcu_node
structure is currently
being priority boosted, and, if so, line 35 makes this
rcu_node
structure's
->boost_tasks
pointer reference the task undergoing
a context switch.
Otherwise, we are in either case 2 or 3.
In both of these cases, line 38 enqueues the task at the head of this
rcu_node
structure's ->blkd_tasks
list.
Line 39 checks to see if this task is blocking the current
grace period.
If so, we are in case 2, so line 40 makes this
rcu_node
structure's ->gp_tasks
pointer
reference this task.
Regardless of which of the three cases applies, lines 42-46
trace the fact that this task blocked, and line 47
releases this rcu_node
structure's
->lock
.
We have now completed our review of lines 21-47, which
implement the rcu_preempt_note_context_switch()
function's
first job, which was to enqueue tasks that block within RCU-preempt
read-side critical sections.
We are therefore ready to proceed to the second job, which is
finishing rcu_read_unlock_special()
processing for tasks
that are preempted in the middle of an outermost
rcu_read_unlock()
, which is covered by lines 48-50.
Line 48 checks to see if this task was in the middle
of an outermost __rcu_read_unlock()
(recall that
__rcu_read_unlock()
sets its task_struct
structure's ->rcu_read_lock_nesting
to
INT_MIN
before invoking rcu_read_unlock_special()
and to zero after rcu_read_unlock_special()
returns).
If so, line 49 checks this task_struct
structure's
->rcu_read_unlock_special()
field, which, if nonzero,
indicates that there is some work for rcu_read_unlock_special()
to do, in which case line 50 invokes
rcu_read_unlock_special()
to get that work done.
Quick Quiz 17:
Say what???
Why does rcu_preempt_note_context_switch()
need to
invoke rcu_read_unlock_special()
?
Answer
We have now completed our review of lines 48-51, which
do the rcu_preempt_note_context_switch()
function's
second job of dealing with tasks preempted in the middle of an
outermost rcu_read_unlock()
.
We are now ready to look at the final job, which is announcing an
RCU-preempt quiescent state.
This is straightforward: Line 52 disables interrupts, line 53
calls rcu_preempt_qs
to make the announcement, and
line 54 restores interrupts.
rcu_virt_note_context_switch()
handles the
quiescent-state interaction between KVM and RCU, and to Cheng Xu and
Peter Zijlstra
for greatly increasing the human readability of this article.
@@@
This work represents the view of the author and does not necessarily represent the view of IBM.
Linux is a registered trademark of Linus Torvalds.
Other company, product, and service names may be trademarks or service marks of others.
Quick Quiz 1: Give an example of a task that is in a quiescent state for RCU-sched, but not for RCU-preempt.
Answer: A task that is blocked or preempted while within an RCU-preempt read-side critical section.
Quick Quiz 2: Is a quiescent state a property of a task, a CPU, a kernel thread, or of something else?
Answer: Within the confines of the Linux kernel, the answer to this question is rather fuzzy. To see this, consider the possibilities:
This situation might be a bit frustrating to those who prefer rigorous definitions, but the fact is that the design of high-performance production quality RCU implementations depends critically on blurring the distinctions between these contexts. Alternatively, feel free to think in terms of RCU needing to separately handle each of these contexts—and boot-time execution as well!
Quick Quiz 3:
Why aren't rcu_sched_qs()
and rcu_preempt_qs()
part of this API?
Answer: Because they are called from within the RCU implementation only, so they do not qualify as external-to-RCU API members.
Quick Quiz 4: What RCU-related actions might the current CPU need to respond to?
Answer: Other CPUs might start new grace periods, end old grace periods, or fail to pass through an RCU quiescent state in a timely manner. Each of these actions (or, in the last case, inaction) might require a response from the current CPU.
If another CPU starts a new grace period, then the current CPU must set up so that it will report passage through some later quiescent state.
If another CPU ends an old grace period, then the current CPU must advance its callbacks and check whether another grace period is required.
If another CPU fails to pass through an RCU quiescent state in a timely manner, then the current CPU needs to check for it being in dyntick-idle mode, being offline, or being in need of an IPI.
Quick Quiz 5: But suppose that a CPU that has already passed through a quiescent state for the current RCU grace period resumes executing this task that is blocking the current RCU grace period. We then have a task that is blocking the current RCU grace period running on a CPU that is no longer blocking that same RCU grace period. How can that possibly make any sense?
Answer:
The trick is that the task remains queued on an rcu_node
structure's ->blkd_tasks
list.
Therefore, RCU-preempt understands that it still needs to wait for
the task to finish its RCU read-side critical section even if it
doesn't think that it needs to wait on the CPU.
And as long as RCU waits long enough, it doesn't really matter exactly
what it thinks that it is waiting on.
Quick Quiz 6:
Why is the barrier()
in rcu_bh_qs()
on line 6 required?
Answer: Because otherwise the compiler could reorder line 9 with line 5, which could cause a quiescent state from an old grace period being applied to this new grace period, which could in turn cause this new grace period to end too soon. Which could result in random memory corruption, which none of us want!
Exactly how could this happen? As follows:
barrier()
statement on line 6
was omitted, the compiler took it upon itself to perform
a code-motion optimization that caused the assembly code for
line 9 to appear before that of line 5.
rcu_bh_qs()
, executes the
code for line 9 and then takes an interrupt.
force_quiescent_state()
, which in turn deduces
that CPU 0 spent some time in dyntick-idle mode during
the current grace period.
rcu_check_callbacks()
, which in turn invokes
rcu_pending()
, which notices that there is a
new grace period.
RCU_SOFTIRQ
.
RCU_SOFTIRQ
fires, invoking
rcu_process_callbacks()
, which initiates processing
that copies the new ->gpnum
to CPU 0's
rcu_data
structure.
rcu_bh_qs
, where it executes the compiler-misordered
line 5, recording the new grace-period number for a
quiescent state that was detected during the old grace period.
The moral of this story is to always make very sure that you carefully track which grace period a given quiescent state corresponds to.
Quick Quiz 7: Why trace only the first quiescent state of a given grace period?
Answer: I tried that. Once. Then I changed it because it isn't helpful to have the trace full of irrelevant quiescent states.
Quick Quiz 8: Why don't the other RCU flavors also need quiescent states from all tasks as well as all CPUs?
Answer: Actually, the other RCU flavors really do need quiescent states from all tasks as well as all CPUs. It is just that all blocked tasks are by definition in extended quiescent states with respect to both RCU-bh and RCU-preempt. Therefore, these two flavors of RCU can focus their attention entirely on quiescent states for the CPUs, despite needing quiescent states from both CPUs and tasks.
Quick Quiz 9: But why doesn't this code check for dyntick-idle and offline CPUs, both of which are also extended quiescent states?
Answer:
Dyntick-idle mode and offline state are indeed quiescent states,
but it does not make sense for a CPU to check to see if it is in
dyntick-idle mode or if it is
offline because such CPUs are not executing code.
Therefore, because the rcu_check_callbacks()
and
rcu_preempt_check_callbacks()
functions execute on
the CPU being checked for quiescent states, these functions cannot
check for dyntick-idle or offline mode.
Checks for CPUs in these modes are invoked from when forcing quiescent
states, and so are discussed in a separate article.
Quick Quiz 10:
Why isn't the second term on line 6 simply
“!hardirq_count()
”?
Why do we need the comparison involving HARDIRQ_COUNT
?
Answer: Because the scheduling clock interrupt is a hardware interrupt, this code is guaranteed to see hardirq nesting at least one deep. So we must be in a hardirq handler nested within another hardirq handler to disqualify this call from being an idle-task quiescent state. The check in the second term of line 6 therefore returns true if the hardirq nesting is less than or equal to 1.
Quick Quiz 11:
Why not also call rcu_bh_qs()
from rcu_note_context_switch()
?
After all, any RCU-sched quiescent state is also an RCU-bh
quiescent state.
Answer:
It would be theoretically correct to invoke rcu_bh_qs()
from rcu_note_context_switch()
, but it is important to
keep in mind that this is on the context-switch fastpath, so we must
avoid any unnecessary overhead.
Given that an RCU-bh quiescent state is observed between each softirq
handler and during any scheduling-clock interrupt, it currently
appears that invoking rcu_bh_qs()
from rcu_note_context_switch()
would represent
unnecessary overhead.
That said, it is quite possible that some future workload will require
a change in this area.
Quick Quiz 12: Suppose that the guest OS is Linux: Then the guest OS will have RCU activity. So how can RCU safely treat guest-OS execution in the same way that it treats user-mode execution?
Answer: Two different OSes, two different independent instances of RCU. So the host kernel's RCU can ignore the guest OS's RCU in much the same way that the host kernel's RCU can ignore any use of user-mode RCU in user-mode applications.
Quick Quiz 13:
Yes, rcu_virt_note_context_switch()
causes RCU
to note a quiescent state when a KVM guest begins executing.
But what if that guest continues executing indefinitely?
Wouldn't that indefinitely extend the RCU grace period, eventually
resulting in out-of-memory conditions?
Answer:
The scheduling-clock interrupt continues ticking while a KVM
guest is executing.
When the scheduling-clock-interrupt handler returns to the KVM guest,
KVM will once again invoke rcu_virt_note_context_switch()
,
which will report another quiescent state to RCU, permitting any
RCU grace periods to complete in a timely manner.
Quick Quiz 14:
So what happens to tasks that are blocking, but which
happen to be in neither an RCU-preempt read-side critical section
nor in the middle of an outermost rcu_read_unlock()
?
Answer: Absolutely nothing. Such tasks can safely be ignored by RCU-preempt.
Quick Quiz 15:
What prevents the task from entering
rcu_read_unlock_special()
before rcu_preempt_note_context_switch()
has finished
enqueuing it, thereby corrupting the lists?
Answer:
Nothing prevents the task from entering
rcu_read_unlock_special()
before
rcu_preempt_note_context_switch()
has finished enqueuing it, but use of the rcu_node
structure's ->lock
field prevents the lists
from being corrupted.
Quick Quiz 16:
Why must rcu_preempt_note_context_switch()
record the pointer to the rcu_node
structure?
After all, list_del_rcu()
works just fine given
only the list element, the list header is not required.
Answer:
Because in this case, use of list_del_rcu()
is unsafe
unless the rcu_node
structure's ->lock
is held.
Quick Quiz 17:
Say what???
Why does rcu_preempt_note_context_switch()
need to
invoke rcu_read_unlock_special()
?
Answer:
In principle, it does not.
The task will eventually run again, and will finish executing
rcu_read_unlock_special()
at that time.
However, if the task is blocked for too long, it will be needlessly
priority boosted (recall that it has already completed its
RCU read-side critical section and is in the final throes of cleanup).
The call to rcu_read_unlock_special()
is reasonably
cheap and is quite infrequent, so it makes sense to call it
to avoid the priority boosting operation—and perhaps more
important, to simplify RCU's state space by eliminating the possibility
of tasks blocked waiting to do final rcu_read_unlock_special()
cleanup.