June 28, 2012 (Linux 3.4+)

This article was contributed by Paul E. McKenney

Introduction

  1. Dyntick-Idle Overview
  2. RCU Dyntick-Idle API
  3. RCU Dyntick-Idle Operation
  4. RCU Dyntick-Idle Implementation
  5. Summary

At the end we have the ubiquitous answers to the quick quizzes.

Dyntick-Idle Overview

In the good old days, each CPU in a multiprocessor system received a periodic scheduling-clock interrupt. The purpose of the scheduling-clock interrupt was to allow the scheduler to re-evaluate what process it should be running, to update process-accounting information, to invoke timers whose timeouts have expired, and so on. This is all rather pointless when the CPU is idle:

"Wake up!!!"

"Huh??? Nope, still nothing to do!"

"Never mind, then. Go back to sleep."

Back then, useless periodic interrupts were not considered a real problem. After all, if the CPU was idle anyway, what possible harm could come from sending it an interrupt? Especially given that many CPUs back then actually consumed more power in the idle loop than when running normal code!

This situation changed dramatically around the turn of the millenium, with the rise of energy efficiency as a first-class computing concern in the mainstream server space. (In contrast, the battery-powered embedded guys have been deeply concerned about energy efficiency for several decades.) Recent server-class CPUs are able to enter sleep states on idle, and the longer the idle period, the deeper the sleep state and the better the energy efficiency. Therefore, awakening each CPU every few milliseconds is no longer acceptable, and so the CONFIG_NO_HZ kernel parameter enables dyntick-idle mode, which avoids sending useless scheduling-clock interrupts to idle CPUs.

Given that RCU was first developed back when sending scheduling-clock interrupts to idle CPUs was not considered to be a problem, RCU was designed to rely on all CPUs receiving these interrupts, whether idle or not. Therefore, RCU had to change to accommodate the dyntick-idle scheduling-clock-interrupt-free CPUs, and that change was the addition of the RCU dyntick-idle API described in the next section.

RCU Dyntick-Idle API

A schematic of the RCU dyntick-idle API is shown below:

RCU-dyntick-API.png

RCU's dyntick-idle API consists of seven members in three groups. The first group has a single member named rcu_needs_cpu(). The dyntick-idle subsystem uses this API to ask RCU if it is OK to put the specified CPU into dyntick-idle mode. The caller is required to disable interrupts across the call to rcu_needs_cpu(). Normally, RCU grants these requests unless that CPU has RCU callbacks pending, in which case RCU needs the CPU to stay out of dyntick-idle mode in order to process the callbacks. However, the recently added CONFIG_RCU_FAST_NO_HZ kernel-configuration parameter causes RCU to try harder to permit CPUs to enter dyntick-idle mode.

The second group of API members informs RCU that the specified CPU will be entering dyntick-idle mode. In contrast with rcu_needs_cpu(), RCU is not permitted to refuse. RCU expects that the caller has obtained permission from RCU using the rcu_needs_cpu() API member described in the previous paragraph, and that no interrupts have been received in the meantime (but receiving non-maskable interrupts (NMIs) is OK). The members of this group are as follows:

  1. rcu_idle_enter(): The specified CPU is entering dyntick-idle mode.
  2. rcu_irq_exit(): The specified CPU is returning from an interrupt handler. If this interrupt occurred while the CPU was idle, and if this CPU called rcu_idle_enter() at the beginning of this idle period, and if this interrupt is not nested within another interrupt handler, then we need to re-enter dyntick-idle mode. The implementation of this API member is currently identical to that of rcu_idle_enter(). And yes, the bit about exiting interrupt handlers entering dyntick-idle mode can be confusing, but that really is what happens.
  3. rcu_nmi_exit(): The specified CPU is returning from an NMI handler, possibly into dyntick-idle mode.

All three of these functions must be invoked on the CPU that is entering dyntick-idle mode with interrupts disabled. This situation is likely to change in the future, and so adjustments will be made to ensure that RCU is properly aware of each CPU's dyntick-idle state. This change will enable the system to turn off scheduling-clock interrupts while executing user-mode code on CPUs that have only one runnable task.

The third group of API members informs RCU that the specified CPU will be exiting dyntick-idle mode. Again, RCU is not permitted to refuse, but in this case, RCU never has a reason to refuse. The members of this group are as follows:

  1. rcu_idle_exit(): The specified CPU is leaving dyntick-idle mode.
  2. rcu_irq_enter(): The specified CPU is entering an interrupt handler, possibly from dyntick-idle mode. The implementation of this API member is currently identical to that of rcu_idle_exit().
  3. rcu_nmi_enter(): The specified CPU is entering an NMI handler, possibly from dyntick-idle mode.

Again, all three of these functions must be invoked on the CPU that is entering dyntick-idle mode with interrupts disabled.

Quick Quiz 1: Why does receiving an interrupt invalidate RCU's grant of the rcu_needs_cpu() request?
Answer

RCU Dyntick-Idle Operation

RCU's dyntick-idle implementation makes use of a set of fields in a rcu_dynticks structure to track each CPU's dyntick-idle state. The rcu_enter_idle(), rcu_idle_exit(), rcu_irq_enter(), and rcu_irq_exit() functions maintain a nesting-level counter that counts the number of nested interrupts currently running on the corresponding CPU. An additional (large) count is added if that CPU is not in dyntick-idle mode. The counter is therefore zero when the CPU is idle in dyntick-idle mode, one if in an interrupt handler taken from idle in dyntick-idle mode, a large number (call it “N”) if taken from process level when not in dyntick-idle mode, N plus one if taken from an interrupt handler when not in dyntick-idle mode, and N plus two if taken from a second-level nested interrupt when not in dyntick-idle mode.

Quick Quiz 2: Why bother with the “large number N”??? Wouldn't “N=1” be a lot simpler?
Answer

The rcu_nmi_enter() and rcu_nmi_exit() functions maintain a separate counter that counts the number of nested NMI handlers taken from process level in dyntick-idle mode. And no, NMI handlers are not meant to nest, but if they ever do, RCU is ready for them.

Quick Quiz 3: Why can't a single counter serve both interrupts and NMIs?
Answer

There is a separate ->dynticks counter in the per-CPU rcu_dynticks structure whose lower bit is zero if RCU should ignore this CPU due to its being in dyntick-idle mode, and is one if RCU must pay attention to this CPU, whether because the CPU is not in dyntick-idle mode, has been interrupted from dyntick-idle mode, or has been NMIed from dyntick-idle mode. Thus ->dynticks is incremented whenever the sum of ->dynticks_nesting and ->dynticks_nmi_nesting changes from zero to non-zero or vice versa.

Quick Quiz 4: This sounds quite a bit different than the algorithm previously proven correct. If you have a proven-correct algorithm, why would you ever use something else? Especially when this stupid new algorithm requires taking action based on the sum of two counters, which cannot be done atomically!!! Whatever were you thinking???
Answer

The following section examines the code that implements all this.

RCU Dyntick-Idle Implementation

This section will look first at the ->dynticks_nesting format; then the dyntick-idle-exit functions (rcu_idle_exit(), rcu_irq_enter(), and rcu_nmi_enter()); next the dyntick-idle-entry functions (rcu_enter_idle(), rcu_irq_exit(), and rcu_nmi_exit()); next code fragments used to determine whether or not some other CPU is in dyntick-idle mode; and finally the dyntick-idle query function (rcu_needs_cpu()), both the default and the CONFIG_RCU_FAST_NO_HZ variants. Each of these categories is covered in one of the following sections.

Format of the ->dynticks_nesting Counter
The ->dynticks_nesting counter has three jobs:
  1. Count the number of exits from idle from an RCU viewpoint, including actual exits from the idle loop as well as RCU_NONIDLE() invocations, which might be nested.
  2. Count the number of nested interrupts.
  3. Allow for the fact that it is not possible to compute an exact count of nested interrupts on some architectures due to those architectures' strange habit of entering interrupt handlers that they never exit and perhaps also vice versa. (This is used to emulate system calls from within the kernel on some architectures.)

Fortunately, any time a CPU enters the idle loop, we know that it has exited all of the interrupt nesting levels that it is going to exit, so we can simply set the value of the ->dynticks_nesting counter to zero at that point, thereby eliminating any interrupt-level miscounting that might have occurred during the just-ended non-idle period. Similarly, a CPU cannot exit the idle loop while executing in an interrupt handler.

The 64-bit ->dynticks_nesting counter handles all of this by reserving the lower seven bits of the upper byte to count exits from the idle loop and nested RCU_NONIDLE() invocations. The upper bit is used to detect excessive nesting. The next pair of bits (bits number 54 and 55 of 63) are guard bits to protect against the interrupt nesting level count going negative. When a CPU exits idle, bit 54 is set to one and bit 55 is set to zero. The remaining 54 bits are used to count interrupt nesting.

Of course, if the count of interrupt nesting goes negative, then bit 54 will become zero. However, if that happens, bit 53 will become one. So if either bit 53 or 54 are non-zero, then there is a process-level-related reason why RCU is non-idle on this CPU.

The bit-field definitions are as follows:

  1 #define DYNTICK_TASK_NEST_WIDTH 7
  2 #define DYNTICK_TASK_NEST_VALUE ((LLONG_MAX >> DYNTICK_TASK_NEST_WIDTH) + 1)
  3 #define DYNTICK_TASK_NEST_MASK  (LLONG_MAX - DYNTICK_TASK_NEST_VALUE + 1)
  4 #define DYNTICK_TASK_FLAG       ((DYNTICK_TASK_NEST_VALUE / 8) * 2)
  5 #define DYNTICK_TASK_MASK       ((DYNTICK_TASK_NEST_VALUE / 8) * 3)
  6 #define DYNTICK_TASK_EXIT_IDLE  (DYNTICK_TASK_NEST_VALUE + DYNTICK_TASK_FLAG)
The actual hexadecimal values are probably more illuminating:
DYNTICK_TASK_NEST_VALUE 0100000000000000
DYNTICK_TASK_NEST_MASK  7f00000000000000
DYNTICK_TASK_FLAG       0040000000000000
DYNTICK_TASK_MASK       0060000000000000
DYNTICK_TASK_EXIT_IDLE  0140000000000000

DYNTICK_TASK_NEST_MASK defines the seven-bit field that is used to count the process-related reasons why RCU must consider this CPU to be non-idle, and DYNTICK_TASK_NEST_VALUE is used to increment and decrement this field. DYNTICK_TASK_FLAG is a guard bit that prevents underflow of the interrupt-nesting-level count in the low-order bits from affecting the value of the DYNTICK_TASK_NEST_MASK field. The DYNTICK_TASK_MASK field is such that at least one of its two bits will be non-zero if there is a process-level-related reason why RCU is non-idle on this CPU. Finally, the DYNTICK_TASK_EXIT_IDLE macro defines the value that the ->dynticks_nesting flag is set to upon normal exit from the idle loop.

Dyntick-Idle Exit

Quick Quiz 5: Why are we covering dyntick-idle exit first? Don't you usually have to enter something before you can exit it?
Answer

The rcu_idle_exit() function is responsible for telling RCU that the current CPU is transitioning out of dyntick-idle mode. This function is short, but it nevertheless should be taken seriously.

  1 void rcu_idle_exit(void)
  2 {
  3   unsigned long flags;
  4   struct rcu_dynticks *rdtp;
  5   long long oldval;
  6 
  7   local_irq_save(flags);
  8   rdtp = &__get_cpu_var(rcu_dynticks);
  9   oldval = rdtp->dynticks_nesting;
 10   WARN_ON_ONCE(oldval < 0);
 11   if (oldval & DYNTICK_TASK_NEST_MASK)
 12     rdtp->dynticks_nesting += DYNTICK_TASK_NEST_VALUE;
 13   else
 14     rdtp->dynticks_nesting = DYNTICK_TASK_EXIT_IDLE;
 15   rcu_idle_exit_common(rdtp, oldval);
 16   local_irq_restore(flags);
 17 }

Line 7 disables interrupts and line 16 restores them, although most current callers will have already disabled interrupts.

Quick Quiz 6: Is there anything special that a caller must do if it invokes rcu_idle_exit() with interrupts enabled?
Answer

Line 8 picks up a pointer to the current CPU's rcu_dynticks structure. Line 9 snapshots the ->dynticks_nesting counter, and if the value of this counter is negative, line 10 complains piteously, presumably because of excessive nesting of RCU_NONIDLE() invocations. Line 11 checks to see if RCU is already non-idle, and if so line 12 adds DYNTICK_TASK_NEST_VALUE to ->dynticks_nesting counter to record an additional level of non-idleness, otherwise line 14 sets the value of the ->dynticks_nesting counter to DYNTICK_TASK_EXIT_IDLE, which is a large positive number that indicates that a non-idle task is running on this CPU. Line 15 then invokes rcu_idle_exit_common() to do processing common to rcu_idle_exit() and rcu_irq_enter().

Quick Quiz 7: Why not trace the start and end of rcu_idle_exit()? This would allow determining the overhead of this function.
Answer

As noted above, rcu_idle_exit_common() handles processing common to the rcu_idle_exit() and rcu_irq_enter() cases.

  1 static void rcu_idle_exit_common(struct rcu_dynticks *rdtp, long long oldval)
  2 {
  3   smp_mb__before_atomic_inc();
  4   atomic_inc(&rdtp->dynticks);
  5   smp_mb__after_atomic_inc();
  6   WARN_ON_ONCE(!(atomic_read(&rdtp->dynticks) & 0x1));
  7   rcu_cleanup_after_idle(smp_processor_id());
  8   trace_rcu_dyntick("End", oldval, rdtp->dynticks_nesting);
  9   if (!is_idle_task(current)) {
 10     struct task_struct *idle = idle_task(smp_processor_id());
 11 
 12     trace_rcu_dyntick("Error on exit: not idle task",
 13           oldval, rdtp->dynticks_nesting);
 14     ftrace_dump(DUMP_ALL);
 15     WARN_ONCE(1, "Current pid: %d comm: %s / Idle pid: %d comm: %s",
 16         current->pid, current->comm,
 17         idle->pid, idle->comm);
 18   }
 19 }

Lines 3-5 atomically increment the ->dynticks counter with full memory ordering. The smp_mb__before_atomic_inc() ensures that the two increments are perceived in order by all CPUs. The smp_mb__after_atomic_inc() in turn ensures that any subsequent RCU read-side critical sections are perceived as occuring after the increment of the ->dynticks counter. This last is extremely important, because RCU will happily ignore any RCU read-side critical sections on a CPU whose ->dynticks appears to have an even value. Line 6 complains if the value of ->dynticks is still even, line 7 has no effect unless CONFIG_RCU_FAST_NO_HZ=y (which will be covered later), and line 8 traces the end of this dyntick-idle soujourn. Finally, lines 9-18 complain bitterly if a non-idle task is attempting to exit from idle. (After all, what the heck was the non-idle task doing in the idle loop to start with?)

The implementation of rcu_irq_enter() is similar to that of rcu_idle_exit():

  1 void rcu_irq_enter(void)
  2 {
  3   unsigned long flags;
  4   struct rcu_dynticks *rdtp;
  5   long long oldval;
  6 
  7   local_irq_save(flags);
  8   rdtp = &__get_cpu_var(rcu_dynticks);
  9   oldval = rdtp->dynticks_nesting;
 10   rdtp->dynticks_nesting++;
 11   WARN_ON_ONCE(rdtp->dynticks_nesting == 0);
 12   if (oldval)
 13     trace_rcu_dyntick("++=", oldval, rdtp->dynticks_nesting);
 14   else
 15     rcu_idle_exit_common(rdtp, oldval);
 16   local_irq_restore(flags);
 17 }

Line 7 disables interrupts and line 16 restores them. Line 8 picks up a pointer to the current CPU's rcu_dynticks structure. Line 9 snapshots the ->dynticks_nesting counter, and line 10 increments it. If the new value of this counter is zero, line 11 complains piteously. Line 12 checks to see if the old value was non-zero, and if so, line 13 traces the additional level of non-idleness, otherwise, line 15 invokes rcu_idle_exit_common().

The rcu_nmi_enter() is as follows:

  1 void rcu_nmi_enter(void)
  2 {
  3   struct rcu_dynticks *rdtp = &__get_cpu_var(rcu_dynticks);
  4 
  5   if (rdtp->dynticks_nmi_nesting == 0 &&
  6       (atomic_read(&rdtp->dynticks) & 0x1))
  7     return;
  8   rdtp->dynticks_nmi_nesting++;
  9   smp_mb__before_atomic_inc();
 10   atomic_inc(&rdtp->dynticks);
 11   smp_mb__after_atomic_inc();
 12   WARN_ON_ONCE(!(atomic_read(&rdtp->dynticks) & 0x1));
 13 }

Line 3 picks up a pointer to this CPU's rcu_dynticks structure. Lines 5 and 6 are a bit subtle and understanding them requires some background:

  1. The ->dynticks_nmi_nesting counter will always be zero unless you are executing in an NMI handler.
  2. The ->dynticks_nesting counter cannot change while you are executing in an NMI handler. More generally, of course, non-NMI software cannot do anything at all from a CPU that is running an NMI handler.
  3. Interrupts are disabled while sampling and updating the ->dynticks_nesting counter, so it cannot change while it is being accessed.
  4. If the >dynticks counter has an odd value, then RCU is already paying attention to this CPU, and the NMI handlers therefore don't need to do anything.

Given this background, let's take a closer look at lines 5 and 6. Line 5 checks that ->dynticks_nmi_nesting is zero, in other words, that we are not nested within another NMI handler that updated the ->dynticks counter. Line 6 checks to see whether ->dynticks has an odd value, in other words whether we are outside of dyntick-idle mode. If no NMI handler has updated the ->dynticks counter, but that counter already has an odd value, then the initial NMI must have been taken from within a non-dyntick-idle region of code. Either way, we know that RCU is already paying attention to this CPU, so that the NMI handler does not need to do anything, and so line 7 simply returns without updating any rcu_dynticks state.

Quick Quiz 8: But NMIs cannot nest!!! So why in the world are you allowing for NMI nesting???
Answer

If we get to line 8, then we need to update the dyntick-idle state. Line 8 increments the ->dynticks_nmi_nesting counter and lines 9-11 atomically increment the ->dynticks with full memory ordering in the same way and for the same reasons that rcu_idle_exit() did. Finally, line 12 complains if the ->dynticks counter still has an even value.

Quick Quiz 9: Suppose that an NMI occurs after the read from ->dynticks on line 13 of rcu_idle_exit(), but before the write? Can't that result in the value of ->dynticks going backwards? Is this possibility correctly handled by the code?
Answer

Now that we have seen how to get RCU out of dyntick-idle mode, let's look at how to reverse this process.

Dyntick-Idle Entry

The rcu_idle_enter is responsible for telling RCU that the current CPU is transitioning to dyntick-idle mode.

  1 void rcu_idle_enter(void)
  2 {
  3   unsigned long flags;
  4   long long oldval;
  5   struct rcu_dynticks *rdtp;
  6 
  7   local_irq_save(flags);
  8   rdtp = &__get_cpu_var(rcu_dynticks);
  9   oldval = rdtp->dynticks_nesting;
 10   WARN_ON_ONCE((oldval & DYNTICK_TASK_NEST_MASK) == 0);
 11   if ((oldval & DYNTICK_TASK_NEST_MASK) == DYNTICK_TASK_NEST_VALUE)
 12     rdtp->dynticks_nesting = 0;
 13   else
 14     rdtp->dynticks_nesting -= DYNTICK_TASK_NEST_VALUE;
 15   rcu_idle_enter_common(rdtp, oldval);
 16   local_irq_restore(flags);
 17 }

Line 7 disables interrupts and line 12 restores them, though all current callers will have already disabled interrupts. Line 8 picks up a pointer to the current CPU's rcu_dynticks structure, line 9 takes a snapshot of the ->dynticks_nesting counter, and line 10 complains if the counter is already zero (it is not nice to enter the idle loop if you are already in the idle loop). Line 11 checks to see if this is the final process-related reason for RCU considering this CPU to be non-idle, and if so, line 12 zeroes the ->dynticks_nesting counter, otherwise, line 14 decrements the process-related-reason field of this same counter. Line 15 then invokes rcu_idle_enter_common(), which contains code common to rcu_idle_enter() and rcu_irq_exit().

The implementation of rcu_idle_enter_common() is as follows:

  1 static void rcu_idle_enter_common(struct rcu_dynticks *rdtp, long long oldval)
  2 {
  3   trace_rcu_dyntick("Start", oldval, 0);
  4   if (!is_idle_task(current)) {
  5     struct task_struct *idle = idle_task(smp_processor_id());
  6 
  7     trace_rcu_dyntick("Error on entry: not idle task", oldval, 0);
  8     ftrace_dump(DUMP_ALL);
  9     WARN_ONCE(1, "Current pid: %d comm: %s / Idle pid: %d comm: %s",
 10         current->pid, current->comm,
 11         idle->pid, idle->comm);
 12   }
 13   rcu_prepare_for_idle(smp_processor_id());
 14   smp_mb__before_atomic_inc();
 15   atomic_inc(&rdtp->dynticks);
 16   smp_mb__after_atomic_inc();
 17   WARN_ON_ONCE(atomic_read(&rdtp->dynticks) & 0x1);
 18   rcu_lockdep_assert(!lock_is_held(&rcu_lock_map),
 19          "Illegal idle entry in RCU read-side critical section.");
 20   rcu_lockdep_assert(!lock_is_held(&rcu_bh_lock_map),
 21          "Illegal idle entry in RCU-bh read-side critical section.");
 22   rcu_lockdep_assert(!lock_is_held(&rcu_sched_lock_map),
 23          "Illegal idle entry in RCU-sched read-side critical section.");
 24 }

Line 3 traces entry to dyntick-idle mode, which must happen while the ->dynticks counter has an odd-numbered value. Lines 4-12 complain bitterly if the current task is not the idle task. Line 13 has effect only if CONFIG_RCU_FAST_NO_HZ=y, which is covered later. Lines 14-16 then atomically increment the ->dynticks counter with full ordering in order to prevent any RCU read-side critical sections from bleeding out into the dyntick-idle region. Line 17 then complains if the value of the ->dynticks remains odd. Lines 18-23 complain if someone forgets to exit an RCU read-side critical section on the way to the idle loop.

Quick Quiz 10: Is there anything special that a caller must do if it invokes rcu_enter_idle() with interrupts enabled?
Answer

The implementation of rcu_irq_exit() is similar to that of rcu_idle_enter():

  1 void rcu_irq_exit(void)
  2 {
  3   unsigned long flags;
  4   long long oldval;
  5   struct rcu_dynticks *rdtp;
  6 
  7   local_irq_save(flags);
  8   rdtp = &__get_cpu_var(rcu_dynticks);
  9   oldval = rdtp->dynticks_nesting;
 10   rdtp->dynticks_nesting--;
 11   WARN_ON_ONCE(rdtp->dynticks_nesting < 0);
 12   if (rdtp->dynticks_nesting)
 13     trace_rcu_dyntick("--=", oldval, rdtp->dynticks_nesting);
 14   else
 15     rcu_idle_enter_common(rdtp, oldval);
 16   local_irq_restore(flags);
 17 }

Line 7 disables interrupts and line 16 restores them, though all current callers will have already disabled interrupts. Line 8 picks up a pointer to the current CPU's rcu_dynticks structure, line 9 takes a snapshot of the ->dynticks_nesting counter, and line 10 decrements the counter. Line 11 complains if the counter is now negative. Line 12 checks the new value of the ->dynticks_nesting counter is zero, and if not, line 13 traces the unnesting of a level of dyntick-non-idleness, and otherwise line 15 invokes rcu_idle_enter_common().

In contrast, the implementation of rcu_nmi_exit() must be considered separately:

  1 void rcu_nmi_exit(void)
  2 {
  3   struct rcu_dynticks *rdtp = &__get_cpu_var(rcu_dynticks);
  4 
  5   if (rdtp->dynticks_nmi_nesting == 0 ||
  6       --rdtp->dynticks_nmi_nesting != 0)
  7     return;
  8   smp_mb__before_atomic_inc();
  9   atomic_inc(&rdtp->dynticks);
 10   smp_mb__after_atomic_inc();
 11   WARN_ON_ONCE(atomic_read(&rdtp->dynticks) & 0x1);
 12 }

Line 3 picks up a pointer to the current CPU's ->rcu_dynticks structure. Line 5 checks to see if NMI nesting is being tracked, and if not, line 7 returns. If NMI nesting is being tracked, line 6 decrements it, and if it has not returned to zero, line 7 again returns. This “if” statement is driven by the same logic as its counterpart in rcu_nmi_enter(). Because rcu_nmi_enter() and rcu_nmi_exit() are paired, the only way that ->dynticks_nmi_nesting can be zero upon entry is if the corresponding rcu_nmi_entry() declined to increment it. If rcu_nmi_entry() declined to increment it, then the corresponding rcu_nmi_exit() had better not decrement it.

Given this description of how each CPU tracks whether or not RCU should be paying attention to it, we now look at how RCU safely accesses this information.

Dyntick-Idle Determination

This section describes how code on one CPU may safely determine whether or not some other CPU has recently been in dyntick-idle mode. This code is simple but subtle:

  1 dt = atomic_add_return(0, &rdp->dynticks->dynticks);

Here we are taking a snapshot of the ->dynticks counter by atomically adding zero to it, and assigning the new value (which happens to equal the old value) to the local variable dt. This is a rather expensive way to load a counter, given that it involves an atomic instruction and full memory barriers. However, this strong ordering is absolutely required. We need to precisely order our sampling with any transitions that the other CPU might be making: if the other CPU exited dyntick-idle mode and entered an RCU read-side critical section just before the current grace period started, we need to know that.

Of course, it is possible that successive samples could be unluckily timed so that each sample shows the CPU as not being in dyntick-idle mode despite that CPU repeatedly entering and exiting that mode. This possibility is handled by comparing the results from a pair of samples taken at different (though perhaps closely spaced) times. The comparison expression is as follows:

  1 ((curr & 0x1) == 0 || UINT_CMP_GE(curr, snap + 2))

Here curr is the most recent sampling of ->dynticks and snap is a previous sampling. The upshot is that if the most recent sampling shows the CPU to be in dyntick-idle mode, or if ->dynticks was incremented at least twice, we consider the CPU to have been in dyntick-idle mode at least once since the first sampling.

It is now time to look into the dyntick-idle query implementation.

Dyntick-Idle Query: Default

This section describes the functions that determine whether or not RCU can tolerate the current CPU entering dyntick-idle mode. The default dyntick-idle query mechanism is straightforward:

  1 static int rcu_preempt_cpu_has_callbacks(int cpu)
  2 {
  3   return !!per_cpu(rcu_preempt_data, cpu).nxtlist;
  4 }
  5 
  6 static int rcu_cpu_has_callbacks(int cpu)
  7 {
  8   /* RCU callbacks either ready or pending? */
  9   return per_cpu(rcu_sched_data, cpu).nxtlist ||
 10          per_cpu(rcu_bh_data, cpu).nxtlist ||
 11          rcu_preempt_needs_cpu(cpu);
 12 }
 13 
 14 int rcu_needs_cpu(int cpu)
 15 {
 16   return rcu_preempt_cpu_has_callbacks(cpu);
 17 }

The rcu_preempt_needs_cpu() helper function on lines 1-4 simply checks to see if the current CPU has any RCU-preempt callback queued. In kernels built with CONFIG_TREE_RCU (as opposed to CONFIG_TREE_PREEMPT_RCU), there is no RCU-preempt, so rcu_preempt_needs_cpu() is instead a function that unconditionally returns zero.

The rcu_cpu_has_callbacks() helper function on line 6-12 checks to see if the current CPU has any RCU-sched or RCU-bh callbacks queued, and uses the aforementioned rcu_preempt_needs_cpu() function to check if the current CPU has any RCU-preempt callbacks. If any of the flavors of RCU have callbacks queued on this CPU, rcu_cpu_has_callbacks() returns 1, otherwise zero.

The rcu_needs_cpu() function on lines 14-17 is simply a wrapper for rcu_cpu_has_callbacks().

In short, given a kernel build with the default CONFIG_RCU_FAST_NO_HZ=n, RCU permits a given CPU to enter dyntick-idle mode only if it has no RCU callbacks queued for all flavors of RCU built into the kernel.

Dyntick-Idle Query: CONFIG_RCU_FAST_NO_HZ

The CONFIG_RCU_FAST_NO_HZ variant is one of many RCU features that I would have never thought of on my own. I became aware of the need for it when an embedded developer telephoned me complaining that RCU was draining the battery of his dual-core battery-powered embedded device. After some discussion, I learned that his complaint was that when one of the CPUs was in dyntick-idle mode, and the second CPU was trying to go into dyntick-idle mode, RCU would prevent the second CPU from entering dyntick-idle mode for the several jiffies required to complete the current grace period. His point was that both CPUs were idle, so the grace period should instead complete immediately, rather than chewing up another few milliseconds of precious battery power.

Not wanting RCU to be considered ungreen, I implemented CONFIG_RCU_FAST_NO_HZ, and later reimplemented it so that it would provide similar benefits to large servers. When this option is enabled, rcu_needs_cpu() checks to see if this CPU has any RCU callbacks queued, and attempts to accelerate grace periods if so.

In a perfect world, this implementation would simply loop, alternately invoking force_quiescent_state() and rcu_process_callbacks(), but deadlock considerations rule out a direct function call to rcu_process_callbacks() (but don't take my word for it, try it with lockdep enabled). We therefore invoke rcu_process_callbacks() indirectly, via invoke_rcu_core(), which will raise RCU_SOFTIRQ. If there is nothing else for the system to do when the softirq handler completes, the CPU will again attempt to enter dyntick-idle mode, which will again invoke rcu_needs_cpu(). The effect is to create a loop via repeated RCU_SOFTIRQ invocations, as shown in the following (highly stylized) diagram:

RCU-dyntick-API.png

This “preparing for idle” loop runs through the “Process Level”, “rcu_needs_cpu()”, “rcu_prepare_for_idle()”, “force_quiescent_state()”, and “RCU Core” boxes. Each pass through this loop works to advance RCU's grace-period state machine, attempting to clear out all of this CPU's RCU callbacks. If this attempt succeeds, the CPU moves to the “idle, !tick, !CBs” box, where the CPU remains (ignored by RCU) until either an interrupt, a wakeup, or an RCU_NONIDLE() transitions away from this state.

If after a few passes through the “preparing for idle” loop callbacks remain, but rcu_pending() says that the RCU core doesn't need anything more from this CPU, then the CPU moves to the “idle, !tick, CBs” box. However, in this case, RCU sets up a timer to awaken the CPU in order to process its callbacks in a reasonably timely manner. The CPU remains in “idle, !tick, CBs” until either an interrupt (for example, a timer interrupt), a wakeup, or an RCU_NONIDLE() transitions away from this state.

Finally, if after a number of passes through the “preparing for idle” loop callback remain and the RCU core still needs something from this CPU, the CPU moves to the “idle, tick, CBs” box. Here the scheduling-clock tick remains active, so that the CPU will respond to the RCU core every jiffy. The CPU will not attempt to move to the other idle states for the remainder of the current jiffy, but then will restart the “preparing for idle” state machine from the beginning.

With this background, we are ready to look at the code. We first look at the constant and variable definitions, then the helper functions, and finally rcu_prepare_for_idle() itself.

  1 #define RCU_IDLE_FLUSHES 5
  2 #define RCU_IDLE_OPT_FLUSHES 3
  3 #define RCU_IDLE_GP_DELAY 6
  4 #define RCU_IDLE_LAZY_GP_DELAY (6 * HZ)
  5 
  6 static DEFINE_PER_CPU(int, rcu_dyntick_drain);
  7 static DEFINE_PER_CPU(unsigned long, rcu_dyntick_holdoff);
  8 static DEFINE_PER_CPU(struct timer_list, rcu_idle_gp_timer);
  9 static DEFINE_PER_CPU(unsigned long, rcu_idle_gp_timer_expires);
 10 static DEFINE_PER_CPU(bool, rcu_idle_first_pass);
 11 static DEFINE_PER_CPU(unsigned long, rcu_nonlazy_posted);
 12 static DEFINE_PER_CPU(unsigned long, rcu_nonlazy_posted_snap);

The RCU_IDLE_FLUSHES macro on line 1 defines the maximum number of times that we will go through the loop. The RCU_IDLE_OPT_FLUSHES macro on line 2 defines the minimum number of times we will go through the loop in the case where the CPU has callbacks. This minimum is required in cases where the current CPU is not yet aware of the current grace period. The RCU_IDLE_GP_DELAY macro on line 3 is a rough estimate of the typical RCU grace-period delay in jiffies, and is used to set an hrtimer to prevent the CPU from indefinitely delaying RCU callbacks. The RCU_IDLE_LAZY_GP_DELAY macro on line 4 is a much longer delay that is used when all of the RCU callbacks queued on this CPU are “lazy”, in other words, if they all do nothing other than freeing memory. Such callbacks may usually be safely deferred for quite some time on an idle system.

Quick Quiz 11: Why limit the number of passes through the loop rather than keeping going until the grace period completes and all callbacks are invoked? After all, the CPU is idle anyway.
Answer

The per-CPU rcu_dyntick_drain variable on line 6 is the loop variable, counting from zero up to RCU_NEEDS_CPU_FLUSHES. The per-CPU rcu_dyntick_holdoff variable on line 7 is used to force a holdoff period of about a jiffy should the loop be unable to force the CPU into dyntick-idle mode despite looping the maximum number of times allowed by RCU_IDLE_FLUSHES. The per-CPU rcu_idle_gp_timer is a timer that is used to wake up CPUs as needed to prevent indefinite postponement of RCU callbacks. Line 9 defines rcu_idle_gp_timer_expires, which is used to cache the time at which rcu_idle_gp_timer will expire. This variable is used to repost the timer after momentary exits from idle (for example, to do tracing). Line 10 defines rcu_idle_first_pass, which is used to enable special handling of the first pass through the rcu_prepare_for_idle() loop. Finally, lines 11 and 12 are a pair of counters that allow rcu_prepare_for_idle() to detect when an otherwise innocuous momentary exit from idle posted a non-lazy RCU callback, which requires rcu_prepare_for_idle() to re-evaluate its dyntick-idle decision. For example, if there previously were no non-lazy callbacks, but one was posted during a momentary exit from idle, then it may be necessary to repost rcu_idle_gp_timer with a shorter timeout value.

  1 int rcu_needs_cpu(int cpu)
  2 {
  3   per_cpu(rcu_idle_first_pass, cpu) = 1;
  4   if (!rcu_cpu_has_callbacks(cpu))
  5     return 0;
  6   return per_cpu(rcu_dyntick_holdoff, cpu) == jiffies;
  7 }
  8 
  9 static bool __rcu_cpu_has_nonlazy_callbacks(struct rcu_data *rdp)
 10 {
 11   return rdp->qlen != rdp->qlen_lazy;
 12 }
 13 
 14 static bool rcu_preempt_cpu_has_nonlazy_callbacks(int cpu)
 15 {
 16   struct rcu_data *rdp = &per_cpu(rcu_preempt_data, cpu);
 17 
 18   return __rcu_cpu_has_nonlazy_callbacks(rdp);
 19 }
 20 
 21 static bool rcu_cpu_has_nonlazy_callbacks(int cpu)
 22 {
 23   return __rcu_cpu_has_nonlazy_callbacks(&per_cpu(rcu_sched_data, cpu)) ||
 24          __rcu_cpu_has_nonlazy_callbacks(&per_cpu(rcu_bh_data, cpu)) ||
 25          rcu_preempt_cpu_has_nonlazy_callbacks(cpu);
 26 }
 27 
 28 void rcu_idle_demigrate(void *unused)
 29 {
 30   trace_rcu_prep_idle("Demigrate");
 31 }
 32 
 33 static void rcu_idle_gp_timer_func(unsigned long cpu_in)
 34 {
 35   int cpu = (int)cpu_in;
 36 
 37   trace_rcu_prep_idle("Timer");
 38   if (cpu == smp_processor_id()) {
 39     WARN_ON_ONCE(1);
 40   } else {
 41     preempt_disable();
 42     if (cpu_online(cpu))
 43       smp_call_function_single(cpu, rcu_idle_demigrate,
 44              NULL, 0);
 45     preempt_enable();
 46   }
 47 }
 48 
 49 static void rcu_prepare_for_idle_init(int cpu)
 50 {
 51   per_cpu(rcu_dyntick_holdoff, cpu) = jiffies - 1;
 52   setup_timer(&per_cpu(rcu_idle_gp_timer, cpu),
 53         rcu_idle_gp_timer_func, cpu);
 54   per_cpu(rcu_idle_gp_timer_expires, cpu) = jiffies - 1;
 55   per_cpu(rcu_idle_first_pass, cpu) = 1;
 56 }
 57 
 58 static void rcu_cleanup_after_idle(int cpu)
 59 {
 60   del_timer(&per_cpu(rcu_idle_gp_timer, cpu));
 61   trace_rcu_prep_idle("Cleanup after idle");
 62 }
 63 
 64 static void rcu_idle_count_callbacks_posted(void)
 65 {
 66   __this_cpu_add(rcu_nonlazy_posted, 1);
 67 }

The rcu_needs_cpu() function is shown on lines 1-7. Line 3 sets the per-CPU rcu_idle_first_pass so that the next invocation of rcu_prepare_for_idle knows that it should restart its state machine. Line 4 checks to see if this CPU has callbacks, and, if not, line 5 returns indicating that RCU is happy for this CPU to enter dyntick-idle mode. Otherwise, line 6 checks to see if RCU recently tried and failed to force this CPU into dyntick-idle mode. If not, RCU is again happy for this CPU to enter dyntick-idle mode, otherwise, it prevents this CPU from entering dyntick-idle mode.

The __rcu_cpu_has_nonlazy_callbacks() helper function on lines 9-12 compares the rcu_data structure's ->qlen and ->qlen_lazy to determine if there are any non-lazy callbacks. The rcu_preempt_cpu_has_nonlazy_callbacks() function on lines 14-19 and the rcu_cpu_has_nonlazy_callbacks() function on lines 21-26 use the __rcu_cpu_has_nonlazy_callbacks() helper function to determine if there are any RCU-preempt non-lazy callbacks and any RCU non-lazy callbacks of any flavor, respectively. In kernels configured with CONFIG_PREEMPT=n, thus lacking RCU-preempt, the __rcu_cpu_has_nonlazy_callbacks() helper function instead unconditionally returns zero.

The rcu_idle_demigrate() function on lines 28-31 is a call_smp_function_single() function whose only purpose is to wake up the targeted CPU. It therefore only does event tracing.

The rcu_idle_gp_timer_func() function spans lines 33-42. Oddly enough, this timer handler is normally never invoked because the timer is always canceled first—the only purpose of this timer is to wake up the CPU, which does not actually require this handler function to be invoked. However, one abnormal possibility is that the timer will migrate to some other CPU before expiring. Therefore, after line 37 does event tracing, line 38 checks to see if the timer handler is running on the intended CPU, and if so, line 39 complains bitterly. Otherwise, lines 41-45 attempt to wake up the intended CPU. However, this is legal only if the intended CPU is still online, so line 41 disables preemption (and line 45 re-enables it), and only if line 42 determines that the intended CPU is online do lines 43-44 use smp_call_function_single() to wake up the intended CPU.

The rcu_prepare_for_idle_init() function, which initializes the per-CPU hrtimer, is shown on lines 49-56. Line 51 initializes the CPU's rcu_dyntick_holdoff variable to the non-holdoff state. Lines 52 and 53 initialize the struct timer_list itself. Line 54 sets the the CPU's rcu_idle_gp_timer_expires variable so that the timer expires in the past, and line 55 initializes the CPU's rcu_idle_first_pass variable so that then next call to rcu_prepare_for_idle() will initialize its state machine.

The rcu_cleanup_after_idle() function shown on lines 58-62 cancels the hrtimer and does event tracing. The fact that this function is invoked upon every exit from idle is what prevents the handler function from being invoked in the common case of no timer migration.

Finally, the rcu_idle_count_callbacks_posted() shown on lines 64-67 increments the running count of the number of callbacks (or groups of callbacks in the case of adoption from an offline CPU) so that rcu_prepare_for_idle() can determine when an otherwise innocuous momentary exit from idle requires special handling. Note that the caller must have at least disabled preemption, otherwise the manipulation of the per-CPU variable is unsafe.

With this background, we are ready to look at rcu_prepare_for_idle:

  1 static void rcu_prepare_for_idle(int cpu)
  2 {
  3   struct timer_list *tp;
  4 
  5   if (!per_cpu(rcu_idle_first_pass, cpu) &&
  6       (per_cpu(rcu_nonlazy_posted, cpu) ==
  7        per_cpu(rcu_nonlazy_posted_snap, cpu))) {
  8     if (rcu_cpu_has_callbacks(cpu)) {
  9       tp = &per_cpu(rcu_idle_gp_timer, cpu);
 10       mod_timer_pinned(tp, per_cpu(rcu_idle_gp_timer_expires, cpu));
 11     }
 12     return;
 13   }
 14   per_cpu(rcu_idle_first_pass, cpu) = 0;
 15   per_cpu(rcu_nonlazy_posted_snap, cpu) =
 16     per_cpu(rcu_nonlazy_posted, cpu) - 1;
 17   if (!rcu_cpu_has_callbacks(cpu)) {
 18     per_cpu(rcu_dyntick_holdoff, cpu) = jiffies - 1;
 19     per_cpu(rcu_dyntick_drain, cpu) = 0;
 20     trace_rcu_prep_idle("No callbacks");
 21     return;
 22   }
 23   if (per_cpu(rcu_dyntick_holdoff, cpu) == jiffies) {
 24     trace_rcu_prep_idle("In holdoff");
 25     return;
 26   }
 27   if (per_cpu(rcu_dyntick_drain, cpu) <= 0) {
 28     per_cpu(rcu_dyntick_drain, cpu) = RCU_IDLE_FLUSHES;
 29   } else if (per_cpu(rcu_dyntick_drain, cpu) <= RCU_IDLE_OPT_FLUSHES &&
 30        !rcu_pending(cpu) &&
 31        !local_softirq_pending()) {
 32     trace_rcu_prep_idle("Dyntick with callbacks");
 33     per_cpu(rcu_dyntick_drain, cpu) = 0;
 34     per_cpu(rcu_dyntick_holdoff, cpu) = jiffies;
 35     if (rcu_cpu_has_nonlazy_callbacks(cpu))
 36       per_cpu(rcu_idle_gp_timer_expires, cpu) =
 37              jiffies + RCU_IDLE_GP_DELAY;
 38     else
 39       per_cpu(rcu_idle_gp_timer_expires, cpu) =
 40              jiffies + RCU_IDLE_LAZY_GP_DELAY;
 41     tp = &per_cpu(rcu_idle_gp_timer, cpu);
 42     mod_timer_pinned(tp, per_cpu(rcu_idle_gp_timer_expires, cpu));
 43     per_cpu(rcu_nonlazy_posted_snap, cpu) =
 44       per_cpu(rcu_nonlazy_posted, cpu);
 45     return;
 46   } else if (--per_cpu(rcu_dyntick_drain, cpu) <= 0) {
 47     per_cpu(rcu_dyntick_holdoff, cpu) = jiffies;
 48     trace_rcu_prep_idle("Begin holdoff");
 49     invoke_rcu_core();
 50     return;
 51   }
 52 #ifdef CONFIG_TREE_PREEMPT_RCU
 53   if (per_cpu(rcu_preempt_data, cpu).nxtlist) {
 54     rcu_preempt_qs(cpu);
 55     force_quiescent_state(&rcu_preempt_state, 0);
 56   }
 57 #endif /* #ifdef CONFIG_TREE_PREEMPT_RCU */
 58   if (per_cpu(rcu_sched_data, cpu).nxtlist) {
 59     rcu_sched_qs(cpu);
 60     force_quiescent_state(&rcu_sched_state, 0);
 61   }
 62   if (per_cpu(rcu_bh_data, cpu).nxtlist) {
 63     rcu_bh_qs(cpu);
 64     force_quiescent_state(&rcu_bh_state, 0);
 65   }
 66   if (rcu_cpu_has_callbacks(cpu)) {
 67     trace_rcu_prep_idle("More callbacks");
 68     invoke_rcu_core();
 69   } else
 70     trace_rcu_prep_idle("Callbacks drained");
 71 }

Lines 5-13 handle returning from a momentary exit from idle (such as those caused by RCU_NONIDLE() and power-management events). Recall that a momentary exit from idle will not involve a call to rcu_needs_cpu, so that the CPU's rcu_idle_first_pass variable will be zero. Line 5 verifies that this CPU is not returning from a major exit from idle and lines 6 and 7 verify that no non-lazy callbacks have been posted on this CPU during the momentary exit from idle. If both these conditions hold, then lines 8-12 handle the return to idle after a momentary exit. In that case, line 8 checks to see if this CPU has callbacks, and if so lines 9 and 10 repost the rcu_idle_gp_timer. Whether or not this CPU has callbacks, line 12 returns to the caller, who will re-enter dyntick-idle mode.

If execution reaches line 14, rcu_prepare_for_idle() must decide afresh whether or not this CPU can be permitted to enter dyntick-idle mode. Line 14 clears this CPU's rcu_idle_first_pass variable, and lines 15 and 16 force a mismatch between the CPU's rcu_nonlazy_posted_snap and rcu_nonlazy_posted variables. Therefore, subsequent calls to rcu_prepare_for_idle() will be treated as part of this same attempt to drive this CPU into dyntick-idle mode: Control will not enter the momentary-idle code on lines 8-12.

Lines 17-21 are executed if the CPU has no RCU callbacks, in which case line 18 cancels any holdoff that might have been in effect, line 19 resets the state machine so that it will go through a full set of loops next time, line 21 does tracing, and line 21 returns. Note that the CPU's rcu_nonlazy_posted_snap and rcu_nonlazy_posted variables are still mismatched, so that the momentary-idle code will not be invoked on subsequent calls to rcu_prepare_for_idle(). Instead, if there are still no callbacks after a momentary exit from idle, lines 17-21 will again take care of slipping the CPU back into dyntick-idle mode. If callbacks have appeared, then rcu_prepare_for_idle() will re-evaluate the situation.

Lines 23-26 handle holdoff, where a recent set of rcu_prepare_for_idle() invocations were unable to force this CPU into dyntick-idle mode. If line 23 detects this condition, signified by the per-CPU rcu_dyntick_holdoff variable being equal to the current jiffies counter, line 24 does tracing, and line 25 returns.

Line 27 checks to see if this is the first pass through the loop, and, if so, line 28 initializes the loop counter, namely, the per-CPU rcu_dyntick_drain variable.

Otherwise, lines 29-31 check to see if this CPU has done everything that RCU requires of it for the immediate future, if we have gone through the loop enough times, and whether thare is no runnable task that needs attention from this CPU. If so, line 32 does tracing, line 33 clears the loop counter to indicate exit from the loop, and line 34 sets the holdoff period to avoid wasting time before other CPUs have had a chance to progres the RCU grace period. Line 35 checks to see if the CPU has non-lazy RCU callbacks posted, and if so lines 36 and 37 set the timer expiry a short time into the future (about one RCU grace period's worth), otherwise lines 39 and 40 set the timer expiry a long time into the future (many seconds). Line 42 posts the timer, lines 43 and 44 snapshot the CPU's rcu_nonlazy_posted variable, permitting returns to idle after momentary exits from idle to be handled by lines 8-12, and line 27 returns.

Quick Quiz 12: Why post the timer? Won't that force the CPU to wake up in the near future??? I thought that the whole point of this code was to prevent needless wakeups!!!
Answer

Otherwise, line 46 checks to see if we have exceeded the loop count. If so, line 47 sets the per-CPU rcu_dyntick_holdoff variable to jiffies to prevent attempting to force this CPU into dyntick-idle mode until the start of the next jiffy, line 48 does tracing, line 49 invokes the RCU core code for the sole purpose of forcing this CPU out of dyntick-idle mode, and line 50 returns.

Quick Quiz 13: But what if there is momentary exit from idle just after the jiffies counter advances? Won't that leave the scheduler-clock tick enabled, even if we manage to clear out all the callbacks?
Answer

Otherwise, there is more looping to be done. Line 53 checks to see if there are RCU-preempt callbacks queued on this CPU, and if so, line 54 notes an RCU-preempt quiescent state and line 55 invokes force_quiescent_state() in order to cause the grace-period machinery to properly ignore any dyntick-idle CPUs, which might result in the current RCU-preempt grace period completing, so that subsequent code can invoke any ready-to-invoke RCU-sched callbacks that are queued on this CPU. Lines 58-61 do the same thing, but for RCU-sched, while lines 62-65 handle RCU-bh.

Line 66 checks to see if any of the RCU flavors have callbacks queued on this CPU, and, if so, line 67 does tracing, and line 68 causes an invocation of RCU core processing to be scheduled, which will invoke any RCU callbacks that are now ready. Otherwise, line 70 does tracing.

The first pass through the loop consists of a direct call to rcu_prepare_for_idle(). Subsequent passes through the loop consist of rcu_prepare_for_idle() scheduling an RCU_SOFTIRQ via invoke_rcu_core(), which invokes rcu_process_callbacks(). If there is nothing else for this CPU to do at the completion of the softirq handler, then the CPU will again attempt to enter dyntick-idle mode, which in turn invokes rcu_prepare_for_idle(), restarting the cycle.

Quick Quiz 14: Why bother turning on the scheduling-clock interrupt at all if RCU doesn't need it? Why not instead turn it on only if an RCU callback is registered from call_rcu()?
Answer

Summary

This section has covered RCU's dyntick-idle interface, which tracks each CPU's dyntick-idle state and controls which CPUs may enter dyntick-idle state. In CONFIG_RCU_FAST_NO_HZ, RCU tries to help CPUs get prepared to enter dyntick-idle state.

Acknowledgments

I owe thanks to Frederic Weisbecker, Cyrill Gorcunov, Cheng Xu, and Peter Zijlstra for their review and thoughtful comments. I owe Neven M. Abou Gazala, Heiko Carstens, and Pascal Chapperon many thanks for their tireless testing of CONFIG_RCU_FAST_NO_HZ.

Legal Statement

This work represents the view of the author and does not necessarily represent the view of IBM.

Linux is a registered trademark of Linus Torvalds.

Other company, product, and service names may be trademarks or service marks of others.

Answers to Quick Quizzes

Quick Quiz 1: Why does receiving an interrupt invalidate RCU's grant of the rcu_needs_cpu() request?

Answer: Because the interrupt handler might invoke call_rcu() or one of its friends, which would queue an RCU callback, which would cause that RCU to once again need that CPU to stay out of dyntick-idle mode in order to handle that new callback.

In contrast, NMI handlers are forbidded from invoking call_rcu() and friends, so NMIs do not invalidate RCU's grant of the rcu_needs_cpu() request. Which is a good thing, given that NMIs by definition cannot be masked.

Back to Quick Quiz 1.

Quick Quiz 2: Why bother with the “large number N”??? Wouldn't “N=1” be a lot simpler?

Answer: No. The otherwise reasonable choice of “N=1” fails miserably on those architectures that can enter an interrupt handler without ever leaving it and vice versa. This strange trick is used to emulate a system call from kernel space, among other things.

So RCU picks a large number for N, namely DYNTICK_TASK_NEST_VALUE, which is defined to be roughly 1/128 of the largest integer that can be represented in a variable of type “long long”. When a CPU enters the idle loop, it cannot possibly be in an interrupt handler, nor can it be running in the context of a non-idle task. We can therefore simply zero the entire counter when entering idle, erasing the effects of any miscounting that might have occurred due to entering interrupt handlers and not leaving them (and vice versa).

In theory, overflow could occur, but in practice, even if a given CPU encountered a nesting-level miscount every microsecond, it would still take tens of thousand of years for the counter to overflow, thanks to the magic of 64-bit variables.

And yes, misnesting did represent a potential RCU bug in many of the kernel versions prior to 3.3, at least on those architectures that indulged in interrupt misnesting.

And the reason that DYNTICK_TASK_NEST_VALUE consumes the entire upper byte is due to the fact that some code is invoked from both processes and the idle loop. When this code is invoked from the idle loop, the RCU_NONIDLE() macro is used to wrapper that code with rcu_idle_exit() and rcu_idle_enter() calls. This wrappering momentarily brings the CPU out of RCU-idle mode in order to allow use of RCU read-side primitives. The RCU_NONIDLE() macros might well be nested, and the upper byte is used as a nesting count.

Back to Quick Quiz 2.

Quick Quiz 3: Why can't a single counter serve both interrupts and NMIs?

Answer: Because it is not enough merely to count the interrupts and NMIs. RCU's dyntick-idle subsystem must also update dyntick-idle state upon transitions of the count to and from zero. The counting and the handling cannot be done as a single atomic operation, so use of a single counter would cause possibly-fatal confusion if an NMI is received between the time the counter is manipulated and the dyntick-idle state manipulation.

Maintaining separate nesting counters for interrupts and NMIs allows RCU to avoid such confusion.

Back to Quick Quiz 3.

Quick Quiz 4: This sounds quite a bit different than the algorithm previously proven correct. If you have a proven-correct algorithm, why would you ever use something else? Especially when this stupid new algorithm requires taking action based on the sum of two counters, which cannot be done atomically!!! Whatever were you thinking???

Answer: First, let's review a funny thing about proofs: They rely on assumptions. And the validity of assumptions can vary over time.

In this case, the key (unstated) assumption is that CPUs will flush their write buffers reasonably quickly compared to the duration of an RCU grace period. Unfortunately, most system vendors won't tell you the maximum time that a CPU will take to flush its write buffer, but on the other hand, you can get a reasonably good idea by taking appropriate measurements. And when I measured the write-buffer delay with a reasonably nasty workload on an eight-socket system, I got a result on the order of a few microseconds, which provides a nice engineering safety factor when compared to the multi-millisecond grace-period delays, even allowing for the much larger systems that can run Linux.

Enter the expedited grace period, which can complete in tens of microseconds instead of multiple milliseconds. Exit the engineering safety factor, exit the validity of the assumption, exit the validity of the proof, exit the correctness of the older algorithms. Hence the new algorithm.

Important safety tip #1: Proofs of correctness are only as valid as their most-shaky assumption.

Important safety tip #2: Never forget the immortal words of Donald Knuth: “Beware of bugs in the above code; I have only proved it correct, not tried it.” If proofs of correctness in the absence of testing were not good enough for Donald Knuth, perhaps they should not be good enough for you, either.

That said, proofs of correctness can be very valuable tools. But, as with any other tool, you need to keep limitations in mind. Just as you cannot expect to test all the bugs out of your software, you also cannot expect to prove your assumptions correct, especially your unstated assumptions.

Second, it turns out to be trivial to atomically evaluate the sum of the two counters in this particular special case because:

  1. The ->dynticks_nmi_nesting counter will always be zero unless you are executing in an NMI handler. Non-NMI code can therefore just assume that this counter is zero.
  2. The ->dynticks_nesting counter cannot change while you are executing in an NMI handler. More generally, non-NMI software cannot do anything at all while on a CPU that is running an NMI handler.
  3. Interrupts are disabled while sampling and updating the ->dynticks_nesting counter, so it cannot change while it is being accessed.
  4. If the >dynticks counter has an odd value, then RCU is already paying attention to this CPU, and the NMI handlers therefore don't need to do anything.

Therefore, non-NMI code can simply check the value of the ->dynticks_nesting counter with interrupts disabled and take action (while interrupts remain disabled) based on the value. NMI code must check the ->dynticks counter and avoid changing any state if it has an odd-numbered value, However, if the ->dynticks counter has an even-numbered value, the NMI code can ignore the non-NMI state.

Back to Quick Quiz 4.

Quick Quiz 5: Why are we covering dyntick-idle exit first? Don't you usually have to enter something before you can exit it?

Answer: Are you exiting dyntick-idle mode? Or are you really entering non-dyntick-idle mode?

Given that these are just different words for the same thing, we clearly cannot rely on the words “entry” and “exit” to determine the order these topics should be presented in. Instead, we should look at the code. Given that it is dyntick-idle exit that increments the nesting-level counters and dyntick-idle entry that decrements them, dyntick-idle exit should be presented first.

And just for the record, even though it does not apply in this case, one way to exit from something before entering it is to start out in that something.

But if you still don't like the order, by all means feel free to rearrange this article's HTML to suit your preferences. ;–)

Back to Quick Quiz 5.

Quick Quiz 6: Is there anything special that a caller must do if it invokes rcu_idle_exit() with interrupts enabled?

Answer: Not in this implementation, but that is subject to change at any time.

Back to Quick Quiz 6.

Quick Quiz 7: Why not trace the start and end of rcu_idle_exit()? This would allow determining the overhead of this function.

Answer: Because tracing uses RCU read-side critical sections, which are being ignored at the beginning of rcu_idle_exit(). It is therefore illegal for rcu_idle_exit() to use tracing before the atomic_inc() on line 13. If you try putting tracepoints at the beginning of this function in recent kernels, lockdep-RCU will yell at you.

Back to Quick Quiz 7.

Quick Quiz 8: But NMIs cannot nest!!! So why in the world are you allowing for NMI nesting???

Answer: It is true that there is quite a bit of Linux code that assumes that NMIs cannot nest. However, I have seen systems where NMIs could nest. These systems predated Linux and were completely and utterly incapable of running even the most cut-down version of Linux imaginable, but they really did exist. And perhaps in the future there will be good reason for nested NMIs. Given that the nesting count is absolutely required for tracing and debugging purposes, it hurts nothing to make RCU handle this case, and it might come in very handy in the future.

If this is still unconvincing, I suggest you try rewriting rcu_nmi_enter() and rcu_nmi_exit() under the assumption that NMIs do not nest. Once you have a correctly implemented version, look carefully at the differences between your version and the one shown in this article.

Back to Quick Quiz 8.

Quick Quiz 9: Suppose that an NMI occurs after the read from ->dynticks on line 13 of rcu_idle_exit(), but before the write? Can't that result in the value of ->dynticks going backwards? Is this possibility correctly handled by the code?

Answer: This scenario cannot happen because the increment is atomic.

Back to Quick Quiz 9.

Quick Quiz 10: Is there anything special that a caller must do if it invokes rcu_enter_idle() with interrupts enabled?

Answer: The caller must ensure that if the CPU is interrupted between the call to rcu_needs_cpu() and the call to rcu_idle_exit(), RCU will be given another chance to keep the CPU out of dyntick-idle mode. The reason for this rule is that if interrupts were enabled upon entry into rcu_idle_exit(), then an interrupt handler might have happened just before rcu_idle_exit() disabled interrupts. The corresponding interrupt handler might have invoked call_rcu(), which would have enqueued an RCU callback. If this CPU were to stay in dyntick-idle mode indefinitely, that callback would never be invoked, which might result in all sorts of problems, up to and including system hangs.

However, this is currently not a problem because all callers do disable interrupts. As noted earlier, this is likely to change.

Back to Quick Quiz 10.

Quick Quiz 11: Why limit the number of passes through the loop rather than keeping going until the grace period completes and all callbacks are invoked? After all, the CPU is idle anyway.

Answer: Because it is legal for an RCU callback to queue another RCU callback, in fact, it is legal for an RCU callback to requeue itself. Therefore, if we don't limit the looping, we might loop infinitely. And it is better to for the CPU to sleep a jiffy at a time than to not sleep at all.

Back to Quick Quiz 11.

Quick Quiz 12: Why post the timer? Won't that force the CPU to wake up in the near future??? I thought that the whole point of this code was to prevent needless wakeups!!!

Answer: This CPU still has callbacks, and failing to invoke those callbacks in a timely manner could result in a system hang. And if you are so concerned about power consumption that you are willing to hang your system, why not just power it off completely? That would certainly minimize its power consumption!

Back to Quick Quiz 12.

Quick Quiz 13: But what if there is momentary exit from idle just after the jiffies counter advances? Won't that leave the scheduler-clock tick enabled, even if we manage to clear out all the callbacks?

Answer: Clearing out the callbacks will require the subsequent code to run, which will force the RCU core to run, which will force the CPU into non-idle state, which will result in rcu_needs_cpu() being invoked, which will give the scheduler-clock tick another chance to be disabled.

Back to Quick Quiz 13.

Quick Quiz 14: Why bother turning on the scheduling-clock interrupt at all if RCU doesn't need it? Why not instead turn it on only if an RCU callback is registered from call_rcu()?

Answer: Other things besides RCU need the scheduling-clock interrupt, so we cannot necessarily avoid turning it on just because RCU does not currently need it.

Back to Quick Quiz 14.