August 10, 2012 (v3.5+)
This article was contributed by Paul E. McKenney
And then there are of course the inevitable answers to the quick quizzes.
CONFIG_PREEMPT=n
kernels, an infinite
loop anywhere in the kernel that does not include a
call to schedule()
or some similar function.
CONFIG_RCU_BOOST
is designed
to handle.)
Quick Quiz 1:
How could a hardware failure result in an RCU CPU stall warning?
Answer
RCU issues stall warnings when the grace period has extended for too long, and uses RCU's data structures to identify which CPUs and tasks are responsible for the stall.
RCU CPU stall detection is invoked from rcu_pending()
,
which in turn is invoked from within the scheduling-clock
interrupt and, in CONFIG_RCU_FAST_NO_HZ
kernels,
upon entry to idle.
Each time that it is invoked, it scans the rcu_node
structures, using the ->qsmask
bits to identify
stalled CPUs and the ->blkd_tasks
lists
(along with the ->gp_tasks
pointer) to identify
stalled tasks.
Quick Quiz 2:
Why not instead invoke RCU CPU stall detection from
the grace-period-detection kthread?
Answer
The stalled CPUs and tasks are then printed out, along with additional information if so configured.
With that background, we are now ready to look at the code.
The stall-detection process is controlled by the
rcu_cpu_stall_suppress
and rcu_cpu_stall_timeout
variables.
These may be set via boot-time kernel parameters or via sysfs.
The rcu_cpu_stall_suppress
variable, as its name suggests,
suppresses further RCU CPU stall warnings when its value is nonzero.
It default initial value is zero, but it is set during panics and other
error conditions to avoid the corresponding diagnostics from being
interspersed with RCU CPU stall warnings.
The rcu_cpu_stall_timeout
variable contains the
number of seconds that an RCU grace period may be stalled before
stall warnings are issued.
Its default is controlled by the CONFIG_RCU_CPU_STALL_TIMEOUT
kernel configuration parameter.
The first functions, jiffies_till_stall_check()
and
record_gp_stall_check_time
, compute the time until the
next check for CPU stalls.
1 static int jiffies_till_stall_check(void) 2 { 3 int till_stall_check = ACCESS_ONCE(rcu_cpu_stall_timeout); 4 5 if (till_stall_check < 3) { 6 ACCESS_ONCE(rcu_cpu_stall_timeout) = 3; 7 till_stall_check = 3; 8 } else if (till_stall_check > 300) { 9 ACCESS_ONCE(rcu_cpu_stall_timeout) = 300; 10 till_stall_check = 300; 11 } 12 return till_stall_check * HZ + RCU_STALL_DELAY_DELTA; 13 } 14 15 static void record_gp_stall_check_time(struct rcu_state *rsp) 16 { 17 rsp->gp_start = jiffies; 18 rsp->jiffies_stall = jiffies + jiffies_till_stall_check(); 19 }
The jiffies_till_stall_check()
function is shown on
lines 1-13 above.
Line 3 fetches the current value of rcu_cpu_stall_timeout
,
which is subject to concurrent updates from sysfs, hence the
ACCESS_ONCE()
.
Lines 5-11 enforce range limits, with a minimum of 3 seconds and a
maximum of 300 seconds.
Finally, line 12 converts from seconds to jiffies, and adds
a delta (five seconds) if CONFIG_PROVE_RCU=y
.
The record_gp_stall_check_time()
function is shown on
lines 15-19.
It simply records the start time of the grace period (line 17)
and the time of the first CPU-stall check (line 18).
The next set of functions handles the
CONFIG_RCU_CPU_STALL_INFO=y
case, printing additional
state information for the current stall warning.
1 static void print_cpu_stall_fast_no_hz(char *cp, int cpu) 2 { 3 struct rcu_dynticks *rdtp = &per_cpu(rcu_dynticks, cpu); 4 struct timer_list *tltp = &rdtp->idle_gp_timer; 5 6 sprintf(cp, "drain=%d %c timer=%lu", 7 rdtp->dyntick_drain, 8 rdtp->dyntick_holdoff == jiffies ? 'H' : '.', 9 timer_pending(tltp) ? tltp->expires - jiffies : -1); 10 } 11 12 static void print_cpu_stall_info_begin(void) 13 { 14 printk(KERN_CONT "\n"); 15 } 16 17 static void print_cpu_stall_info(struct rcu_state *rsp, int cpu) 18 { 19 char fast_no_hz[72]; 20 struct rcu_data *rdp = per_cpu_ptr(rsp->rda, cpu); 21 struct rcu_dynticks *rdtp = rdp->dynticks; 22 char *ticks_title; 23 unsigned long ticks_value; 24 25 if (rsp->gpnum == rdp->gpnum) { 26 ticks_title = "ticks this GP"; 27 ticks_value = rdp->ticks_this_gp; 28 } else { 29 ticks_title = "GPs behind"; 30 ticks_value = rsp->gpnum - rdp->gpnum; 31 } 32 print_cpu_stall_fast_no_hz(fast_no_hz, cpu); 33 printk(KERN_ERR "\t%d: (%lu %s) idle=%03x/%llx/%d %s\n", 34 cpu, ticks_value, ticks_title, 35 atomic_read(&rdtp->dynticks) & 0xfff, 36 rdtp->dynticks_nesting, rdtp->dynticks_nmi_nesting, 37 fast_no_hz); 38 } 39 40 static void print_cpu_stall_info_end(void) 41 { 42 printk(KERN_ERR "\t"); 43 } 44 45 static void zero_cpu_stall_ticks(struct rcu_data *rdp) 46 { 47 rdp->ticks_this_gp = 0; 48 } 49 50 static void increment_cpu_stall_ticks(void) 51 { 52 struct rcu_state *rsp; 53 54 for_each_rcu_flavor(rsp) 55 __this_cpu_ptr(rsp->rda)->ticks_this_gp++; 56 }
The print_cpu_stall_fast_no_hz()
function
is shown on lines 1-10.
It simply builds a string containing rcu_prepare_for_idle()
state for diagnostics purposes.
In kernels built with CONFIG_RCU_FAST_NO_HZ=n
(in which rcu_prepare_for_idle()
is an empty function),
it instead builds an empty string.
The print_cpu_stall_info_begin()
function is shown
on lines 12-15.
This function prints out the starting bracket for the information
printed by print_cpu_stall_info()
.
When print_cpu_stall_info()
prints full lines,
as is the case when CONFIG_RCU_CPU_STALL_INFO=y
and as shown here,
a newline is printed, otherwise an open curly brace (“{”).
The print_cpu_stall_info()
function is shown on
lines 17-38 for CONFIG_RCU_CPU_STALL_INFO=y
.
Line 25 checks to see if the current CPU is aware of the current
grace period, and if so lines 26 and 27 record the
number of scheduling-clock ticks that this CPU has received during
the current grace period.
Otherwise, lines 29 and 30 record the number of grace periods
that the current CPU has missed.
Line 32 invokes print_cpu_stall_fast_no_hz()
to pick up rcu_prepare_for_idle()
state,
and lines 33-37 print the information.
In the CONFIG_RCU_CPU_STALL_INFO=n
case,
print_cpu_stall_info()
instead simply prints the
current CPU's ID.
Quick Quiz 3:
Why wouldn't the current CPU be aware of the current grace
period?
After all, when printing an RCU CPU stall warning, the current grace
period has extended for many seconds, perhaps even minutes!
Answer
The print_cpu_stall_info_end()
function, shown on
lines 40-43, is the counterpart of
print_cpu_stall_info_begin()
, and operates quite
similarly, but with a closing curly brace (“}”) instead
of an open curly brace in the CONFIG_RCU_CPU_STALL_INFO=n
case.
The zero_cpu_stall_ticks()
is shown on
lines 45-48, and zeros the count of scheduling-clock interrupts
for the CPU specified by the rcu_data
structure passed in.
This function is called when the corresponding CPU notices that a
new RCU grace period has started.
The increment_cpu_stall_ticks()
function is shown
on lines 50-56, and increments each RCU flavor's
count of scheduling-clock interrupts for the current CPU.
Both of these functions are empty for
CONFIG_RCU_CPU_STALL_INFO=n
.
The rcu_print_detail_task_stall_rnp()
and
rcu_print_detail_task_stall()
functions,
shown below, prints out RCU CPU stall warning information for
the relevant rcu_node
structures:
1 static void rcu_print_detail_task_stall_rnp(struct rcu_node *rnp) 2 { 3 unsigned long flags; 4 struct task_struct *t; 5 6 raw_spin_lock_irqsave(&rnp->lock, flags); 7 if (!rcu_preempt_blocked_readers_cgp(rnp)) { 8 raw_spin_unlock_irqrestore(&rnp->lock, flags); 9 return; 10 } 11 t = list_entry(rnp->gp_tasks, 12 struct task_struct, rcu_node_entry); 13 list_for_each_entry_continue(t, &rnp->blkd_tasks, rcu_node_entry) 14 sched_show_task(t); 15 raw_spin_unlock_irqrestore(&rnp->lock, flags); 16 } 17 18 static void rcu_print_detail_task_stall(struct rcu_state *rsp) 19 { 20 struct rcu_node *rnp = rcu_get_root(rsp); 21 22 rcu_print_detail_task_stall_rnp(rnp); 23 rcu_for_each_leaf_node(rsp, rnp) 24 rcu_print_detail_task_stall_rnp(rnp); 25 }
The rcu_print_detail_task_stall_rnp()
function,
shown on lines 1-16, prints out CPU-stall warning information
for the specified rcu_node
structure.
Line 8 acquires the rcu_node
structure's
->lock
and line 15 releases it.
Line 7 checks to see if there are any RCU readers queued on
this structure that are blocking the current grace period, and if
not, line 8 releases the ->lock
and
line 9 returns to the caller.
Lines 11 and 12 obtain a pointer to the task referenced
by this structure's ->gp_tasks
pointer,
and then lines 13 iterates through the remainder of the
->blkd_tasks
lists, starting with the task
referenced by ->gp_tasks
.
For each such task, line 14 dumps its stack.
Quick Quiz 4:
But what if the ->gp_tasks
pointer
is NULL
on line 11 of
rcu_print_detail_task_stall_rnp
?
Won't that result in a segmentation fault?
Answer
The rcu_print_detail_task_stall()
function is shown on
lines 18-25.
It simply invokes rcu_print_detail_task_stall_rnp()
on the root rcu_node
structure and on each
leaf rcu_node
structure.
Quick Quiz 5:
Why doesn't rcu_print_detail_task_stall()
also invoke rcu_print_detail_task_stall_rnp()
on
all rcu_node
structures, rather than just
the root and leaves?
Answer
The next function is print_other_cpu_stall()
,
which handles the case where one CPU detects that some other CPU
has stalled.
1 static void print_other_cpu_stall(struct rcu_state *rsp) 2 { 3 int cpu; 4 long delta; 5 unsigned long flags; 6 int ndetected = 0; 7 struct rcu_node *rnp = rcu_get_root(rsp); 8 9 raw_spin_lock_irqsave(&rnp->lock, flags); 10 delta = jiffies - rsp->jiffies_stall; 11 if (delta < RCU_STALL_RAT_DELAY || !rcu_gp_in_progress(rsp)) { 12 raw_spin_unlock_irqrestore(&rnp->lock, flags); 13 return; 14 } 15 rsp->jiffies_stall = jiffies + 3 * jiffies_till_stall_check() + 3; 16 raw_spin_unlock_irqrestore(&rnp->lock, flags); 17 printk(KERN_ERR "INFO: %s detected stalls on CPUs/tasks:", 18 rsp->name); 19 print_cpu_stall_info_begin(); 20 rcu_for_each_leaf_node(rsp, rnp) { 21 raw_spin_lock_irqsave(&rnp->lock, flags); 22 ndetected += rcu_print_task_stall(rnp); 23 if (rnp->qsmask == 0) { 24 raw_spin_unlock_irqrestore(&rnp->lock, flags); 25 continue; 26 } 27 for (cpu = 0; cpu <= rnp->grphi - rnp->grplo; cpu++) 28 if (rnp->qsmask & (1UL << cpu)) { 29 print_cpu_stall_info(rsp, rnp->grplo + cpu); 30 ndetected++; 31 } 32 raw_spin_unlock_irqrestore(&rnp->lock, flags); 33 } 34 rnp = rcu_get_root(rsp); 35 raw_spin_lock_irqsave(&rnp->lock, flags); 36 ndetected += rcu_print_task_stall(rnp); 37 raw_spin_unlock_irqrestore(&rnp->lock, flags); 38 print_cpu_stall_info_end(); 39 printk(KERN_CONT "(detected by %d, t=%ld jiffies)\n", 40 smp_processor_id(), (long)(jiffies - rsp->gp_start)); 41 if (ndetected == 0) 42 printk(KERN_ERR "INFO: Stall ended before state dump start\n"); 43 else if (!trigger_all_cpu_backtrace()) 44 dump_stack(); 45 rcu_print_detail_task_stall(rsp); 46 force_quiescent_state(rsp); 47 }
Line 9 acquires the root rcu_node
structure's
->lock
for the RCU flavor specified by the
rsp
argument, which is released on line 16.
Line 10 computes the number of jiffies since the current
grace period began, and line 11 checks to see if this
has been long enough to warrant an RCU CPU stall warning,
and if not, lines 12 and 13 release the ->lock
and return.
Otherwise, execution continues with line 15, which computes the
time at which the next stall warning should occur, assuming that the
current grace period does not end first.
As noted earlier, line 16 releases the ->lock
.
Lines 17 and 18 print the stall-warning header and
line 19 prints the opening bracket for the CPU/task list.
Each pass through the loop spanning lines 20-31 prints
CPU stall warnings for one of the leaf rcu_node
structures.
Line 21 acquires the current structure's ->lock
and
line 22 invokes rcu_print_task_stall()
to
print information on each preempted task on this structure that is
blocking the current grace period (accumulating the number of such
tasks in ndetected
).
If line 23 determines that there are no CPUs corresponding to
this structure blocking the current grace period,
line 24 releases the ->lock
, and
line 25 advances to the next leaf.
Otherwise, the loop spanning lines 27-31 iterates through each
CPU corresponding to this structure, with line 29 invoking
print_cpu_stall_info()
on each CPU that line 27
determines to be blocking the current grace period, and line 30
counting those CPUs.
Finally, line 32 releases this structure's ->lock
.
Quick Quiz 6:
Yikes!!!
The print_other_cpu_stall()
function holds
rcu_node
structure ->
locks
while called functions invoke printk()
.
Is that a recipe for horrendous lock contention or what???
Answer
Line 34 picks up a pointer to the root rcu_node
structure and line 35 acquires its lock.
Line 36 invokes rcu_print_task_stall
to print information
on each preempted task on the root rcu_node
structure
that is blocking the current grace period.
Line 37 then releases the lock and line 38 prints
the closing bracket for the CPU/task list.
Lines 39 and 40 print the stall-warning trailer.
If line 41 sees that there actually were no tasks or CPUs
blocking the current grace period, line 42 tells the sad story,
otherwise line 43 attempts to force all CPUs to dump their
stacks, and if this attempt is unsuccessful, line 44 dumps
the current CPU's stack.
Line 45 dumps the stacks of all tasks blocking the current
grace period (but only if CONFIG_RCU_CPU_STALL_VERBOSE=y
)
and line 46 forces quiescent states in an attempt to end the
stall.
Quick Quiz 7:
Why would the attempt to make other CPUs dump their
stacks be subject to failure?
Answer
The following function, print_cpu_stall()
,
dumps out a stall-warning message when a CPU realizes that it
is the one that is still blocking the current grace period.
1 static void print_cpu_stall(struct rcu_state *rsp) 2 { 3 unsigned long flags; 4 struct rcu_node *rnp = rcu_get_root(rsp); 5 6 printk(KERN_ERR "INFO: %s self-detected stall on CPU", rsp->name); 7 print_cpu_stall_info_begin(); 8 print_cpu_stall_info(rsp, smp_processor_id()); 9 print_cpu_stall_info_end(); 10 printk(KERN_CONT " (t=%lu jiffies)\n", jiffies - rsp->gp_start); 11 if (!trigger_all_cpu_backtrace()) 12 dump_stack(); 13 raw_spin_lock_irqsave(&rnp->lock, flags); 14 if (ULONG_CMP_GE(jiffies, rsp->jiffies_stall)) 15 rsp->jiffies_stall = jiffies + 16 3 * jiffies_till_stall_check() + 3; 17 raw_spin_unlock_irqrestore(&rnp->lock, flags); 18 set_need_resched(); 19 }
Lines 6-10 dump the stall-warning header, opening bracket,
per-CPU information (but only for the current CPU), closing
bracket, and trailer, respectively.
Line 11 attempts to trigger a backtrace on all CPUs, and if that
fails, line 12 dumps the current CPU's stack.
Line 13 acquires the root rcu_node
structure's
->lock
(line 17 releases it).
Line 14 checks to see if the stall-warning time is in the past,
and, if so, lines 15 and 16 compute the time to the
next warning (assuming that the current grace period does not end
beforehand).
Finally, line 18 invokes set_need_resched()
in an
(probably futile) attempt to get this CPU unstalled.
Quick Quiz 8:
Given that print_cpu_stall()
is printing
an RCU CPU stall warning in response to the stall warning time
being in the past, how could it possibly be in the future?
Answer
The check_cpu_stall()
function is shown below,
which is the top-level function for emitting CPU stall warnings.
1 static void check_cpu_stall(struct rcu_state *rsp, struct rcu_data *rdp) 2 { 3 unsigned long j; 4 unsigned long js; 5 struct rcu_node *rnp; 6 7 if (rcu_cpu_stall_suppress) 8 return; 9 j = ACCESS_ONCE(jiffies); 10 js = ACCESS_ONCE(rsp->jiffies_stall); 11 rnp = rdp->mynode; 12 if (rcu_gp_in_progress(rsp) && 13 (ACCESS_ONCE(rnp->qsmask) & rdp->grpmask) && ULONG_CMP_GE(j, js)) { 14 print_cpu_stall(rsp); 15 } else if (rcu_gp_in_progress(rsp) && 16 ULONG_CMP_GE(j, js + RCU_STALL_RAT_DELAY)) { 17 print_other_cpu_stall(rsp); 18 } 19 }
Line 7 checks to see if CPU stall warnings have been suppressed,
and if so line 8 returns to the caller.
Lines 9 and 10 take snapshots of the current jiffies counter
and the value that the jiffies counter had at the beginning of the current
grace period, and line 11 picks up a pointer to the current
CPU's leaf rcu_node
structure.
If lines 12 and 13 determine that there is a grace period in
progress and that this CPU is blocking the current
grace period and that it is time for a stall warning,
line 14 invokes print_cpu_stall()
to issue
that warning.
Otherwise, if lines 15 and 16 determine that there is
a grace period in progress and that the stall-warning time was
a few jiffies ago, line 17 invokes
print_other_cpu_stall
to print a stall warning on
behalf of some other CPU or task.
Quick Quiz 9:
Why doesn't line 12 of check_cpu_stall()
need to check that there is an RCU grace period in progress?
Answer
Finally, the following functions handle automatic suppression of stall warnings when other error conditions occur. This suppression is carried out to avoid corrupting long console messages with irrelevant stall-warning messages—the stall might well be a side-effect of the other error condition.
1 static int rcu_panic(struct notifier_block *this, unsigned long ev, void *ptr) 2 { 3 rcu_cpu_stall_suppress = 1; 4 return NOTIFY_DONE; 5 } 6 7 void rcu_cpu_stall_reset(void) 8 { 9 struct rcu_state *rsp; 10 11 for_each_rcu_flavor(rsp) 12 rsp->jiffies_stall = jiffies + ULONG_MAX / 2; 13 } 14 15 static struct notifier_block rcu_panic_block = { 16 .notifier_call = rcu_panic, 17 }; 18 19 static void __init check_cpu_stall_init(void) 20 { 21 atomic_notifier_chain_register(&panic_notifier_list, &rcu_panic_block); 22 }
Lines&nbps;1-5 show the rcu_panic()
function, which
is a notifier function that suppresses stall warnings when kernel
panics occur.
Lines 7-13 show the rcu_cpu_stall_reset()
function,
which (nearly) indefinitely delays further stall warnings for the
current grace period.
This function is invoked in situations where the watchdog timer
is also suppressed.
Lines 15-17 define the notifier block for rcu_panic()
,
and lines 19-22 show the check_cpu_call_init()
function that registers this notifier.
Once the notifier is registered, any subsequent kernel panic
will invoke rcu_panic()
, thereby shutting off stall warnings.
Documentation/RCU/stallwarn.txt
.
This work represents the view of the author and does not necessarily represent the view of IBM.
Linux is a registered trademark of Linus Torvalds.
Other company, product, and service names may be trademarks or service marks of others.
Quick Quiz 1: How could a hardware failure result in an RCU CPU stall warning?
Answer: If a CPU fails in such a way that it stops executing, but without disturbing the rest of the system, then that CPU will never again report quiescent states to RCU. This is not a particularly probable failure mode, but it really has happened in real life. The RCU CPU stall warning messages were quite helpful in identifying the failed CPU.
Quick Quiz 2: Why not instead invoke RCU CPU stall detection from the grace-period-detection kthread?
Answer: In the future, RCU CPU stall detection might also be invoked from the quiescent-state-forcing mechanism that is invoked from the grace-period-detection kthread. However, this will not be likely to replace the other points of invocation, especially the scheduling-clock interrupt. The reason is that the scheduling clock interrupt will continue to run in kernels that are otherwise in very bad shape. Removing RCU CPU stall detection from the scheduling-clock interrupt handler would thus remove all diagnostics from certain types of hangs.
Quick Quiz 3: Why wouldn't the current CPU be aware of the current grace period? After all, when printing an RCU CPU stall warning, the current grace period has extended for many seconds, perhaps even minutes!
Answer: Yes, normally the current CPU would be aware of the current grace period: this code is executed only if the current CPU is blocking the current grace period. However, it is worth noting that CPUs in dyntick-idle mode and CPUs that are offline will not normally be aware of the current grace period. So, the question would then be “Why didn't some other CPU report a quiescent state on their behalf?” That question should help direct your debugging efforts.
Quick Quiz 4:
But what if the ->gp_tasks
pointer
is NULL
on line 11 of
rcu_print_detail_task_stall_rnp
?
Won't that result in a segmentation fault?
Answer:
If ->gp_tasks
is NULL
,
then rcu_preempt_blocked_readers_cgp()
would have
returned false, so that control would never have reached line 11
in the first place.
Quick Quiz 5:
Why doesn't rcu_print_detail_task_stall()
also invoke rcu_print_detail_task_stall_rnp()
on
all rcu_node
structures, rather than just
the root and leaves?
Answer:
Only the root and leaves can queue
tasks that have been preempted within their RCU read-side critical sections.
So there is no reason to invoke rcu_print_detail_task_stall_rnp()
on anything other than the root and leaves.
Quick Quiz 6:
Yikes!!!
The print_other_cpu_stall()
function holds
rcu_node
structure ->
locks
while called functions invoke printk()
.
Is that a recipe for horrendous lock contention or what???
Answer:
It would be a recipe for horrendous lock contention if
print_other_cpu_stall()
was invoked frequently.
However, currently the maximum rate at which it can be invoked
is once per three seconds, so it should not be a problem.
Unless you are dumping your console output over a low-speed serial
line, in which case you just might want to speed up your serial
console anyway.
Quick Quiz 7: Why would the attempt to make other CPUs dump their stacks be subject to failure?
Answer: Because a number of architectures don't implement it.
Quick Quiz 8:
Given that print_cpu_stall()
is printing
an RCU CPU stall warning in response to the stall warning time
being in the past, how could it possibly be in the future?
Answer: Because some other CPU might have printed a stall-warning message concurrently, which would have caused it to update the stall warning time before we got a chance to check it.
Quick Quiz 9:
Why doesn't line 12 of check_cpu_stall()
need to check that there is an RCU grace period in progress?
Answer:
Because the check is implicit in the check of
->qsmask
: If there was no grace period, there could
not possibly be any bits set in that mask.