by

Favorite Processor Paper from ISSCC 2014

My favorite paper from the ISSCC processor session (5.6) describes an adaptive clocking technique implemented in AMD’s 28nm Steamroller core that compensates for power supply noise. Most papers in the processor sessions are overviews that emphasize broad feature sets and the scope and scale of the project. In contrast, paper 5.6 was tightly focused on a specific problem (i.e. power supply noise) and clearly articulated a solution that was implemented in the Steamroller core.

Power Supply Noise

The power delivery network (PDN) in a modern processor starts with the 12V DC rail from the power supply and crosses the motherboard to a voltage regulator, where it is converted down to roughly 1V. The power is then delivered across the motherboard to the processor socket, through the processor package, and finally arrives at the processor die.

In an ideal world, the voltage that is delivered to the processor is constant. In reality, dynamic conditions cause significant fluctuations in the voltage. For example, when a large number of gates begin to switch (e.g., a core goes from power gated to active or a 256-bit vector unit goes from idle to computing a multiply-add), the current draw increases sharply. This is referred to as a dI/dt event, and it will cause an immediate droop in voltage across the chip (referred to as the first droop). More insidiously, dI/dt events interact with the PDN (e.g., power rails in the processor package and motherboard) to create additional delayed voltage droops (e.g., a second and third droop respectively).

Conceptually, a voltage droop causes the equivalent of a brownout on a portion of the chip. For example, if a chip is clocked at 3GHz and requires V min = 900mV to safely achieve this frequency – then a voltage droop that reduces the voltage supplied to the transistors below V min is likely to cause undetectable errors. In most chips, the worst case design point is a scenario when a first and second droop simultaneously occur, which can drop the voltage at the transistors by 10-15%.

To address these power supply problems, there are several techniques available to chip designers:

Reduce the static frequency based on post-manufacturing testing so that the V min is not violated during voltage droop; this reduces performance by ~10-15% for a ~10-15% droop. This is the standard technique used by many companies during the frequency binning process. Increase the static voltage supply to tolerate any possible droops; this wastes considerable power (~20-32%), but avoids losing performance. Add decoupling capacitance to reduce power supply noise; this adds cost to the processor/package and only partially solves the problem. Moreover, area efficient on-die capacitors are very difficult and proprietary; IBM uses an expensive deep trench process and Intel has a customized MIM capacitor. Dynamically increase the voltage during low activity periods to tolerate small voltage droops; this costs power during low activity, cannot address very big (i.e. large current and/or small time) dI/dt events, and may not work for dI/dt events during high activity periods. Use a very fast voltage regulator (e.g., 140Mhz Haswell IVR), which eliminates the second and third droop. IVRs are very difficult to implement, entirely proprietary, and don’t fix the first droop without large capacitors. Microarchitectural throttling to reduce current draw (e.g., Itanium processors issue fewer instructions during dI/dt events and vector units often take many cycles to ‘warm up’); this reduces IPC and can cause instruction scheduling challenges. Adaptive clocking which dynamically adjusts the cycle time (e.g., decreasing the frequency) to tolerate voltage droops, without increasing voltage.

Steamroller Adaptive Clocking

The design team at AMD chose option 7, an adaptive clocking system, for Steamroller. The system has two key components: a detector that determines when the voltage droops, and an adaptive clocking component that reduces the clock frequency in response to a droop. Figure 1 shows the concepts behind AMD’s implementation in Steamroller.

One of the critical measures of an adaptive clocking system is the response latency. The faster the system can respond, the greater the reduction in voltage and therefore the greater the power savings. The adaptive clocking in Streamroller is designed for very low latency (e.g., 3 cycles) to compensate for the first and second droop.

The droop detector works by comparing the phase of the reference clock signal (which uses a clean analog power supply with no noise), with clock phases that are generated on the noisy digital power supply. When a droop occurs, the reduction in voltage will slow down the noisy clock and the phases on the noisy clock will fall behind the phases of the reference clock. Steamroller uses a DLL-based 40-phase detection circuit, which is configurable to detect when the voltage falls below a pre-defined threshold. The droop detection is very low latency, and takes between 1-3 clock cycles to trigger. Based on experiments with real workloads, the architects determined that the best threshold is 2.5% of the supply voltage. A smaller threshold (e.g., 1.25%) will trigger too many changes in clock frequency, but larger thresholds save less power.

Once the droop is detected and the magnitude is determined, a clock stretching circuit increases the clock period (i.e., decreasing frequency) to compensate. Normally, the PLL sets the clock period; for Steamroller’s adaptive clocking, the PLL output is divided into 40 phases using a DLL and extra phases are added, stretching out the clock period slightly. Once a droop is detected, the clock period can be adjusted in as little as two cycles. The AMD team ran a number of experiments and determined that for a 2.5% droop threshold, stretching the clock period by 7% struck the right balance between maintaining high frequencies and improving V min .

Adaptive Clocking Results

The improvements in V min from adaptive clocking are greatest when the Steamroller core is operating at high frequency (up to 9% lower V min at 4GHz). High frequency circuits are much more sensitive to power supply noise and stand to benefit more as a result. Figure 2 shows the benefits of the adaptive clocking system on power consumption for a desktop productization of Steamroller, with savings as high as 19%.

The cost of implementing adaptive clocking is quite low, especially given the significant power benefits. The area overhead for adaptive clocking is minimal; the droop detection and clock stretching circuits are approximately 0.2mm2, whereas a Streamroller module (i.e., two cores and a 2MB L2 cache) is 29.5mm2. The additional clock stretching circuitry does increase jitter by 0.5-1%, but only when the clock stretching is engaged and the core is already operating at a reduced frequency – so there is no impact on peak frequency.

The adaptive clocking reduces the average frequency of the Steamroller module, since each droop will cause a small dip in frequency. The number of droops is a function of the underlying workload and system design (e.g., motherboard layout). AMD provided silicon measurements for a Steamroller module that is nominally operating at 3.4GHz and configured with a 2.5% droop threshold and 7% clock stretch. The workloads that were characterized were 3DMark, Cinebench, iTunes_acc, iTunes_mp3, POVRay, and Winrar. Overall, only 0.2% of cycles were stretched with the 2.5% droop threshold; the floating point heavy workloads such as POVRay had the most droop events (average frequency 3392MHz), whereas WinRAR saw no frequency degradation at all. Generally, most real workloads lose less than 0.2% of cycles.

Future Directions

AMD’s adaptive clocking system in Steamroller is quite attractive, offering a significant improvement in power at a minimal cost in terms of area and a negligible impact on performance. However, there are several potential avenues for improvement.

First of all, the latency of the droop detection and clock stretching could be reduced. Currently, there is a minimum 3 cycle lag before the system can begin to compensate. The droop detector is an asynchronous circuit, which creates a slight delay as the output must be synchronized before it is passed to the clock stretcher. This means that V min must have enough guardband to tolerate a few cycles (probably <10) of voltage droop. Reducing the response time of the clock stretching would reduce V min even further, resulting in greater power savings. Certain dI/dt events may be predicted in the pipeline. For example, the front-end could signal a hint when decoding 256-bit AVX instructions, indicating that there is likely to be a dI/dt event when those instructions are executed.

Second, this technique could be applied to AMD’s discrete and integrated GPUs, although it is hard to say how big the benefits would be for GPUs. The target clock frequency for a GPU is 1GHz rather than 3GHz and the clock domains are bigger and contain more cores. On the other hand, since GPUs are so parallel dI/dt events may be much bigger (e.g., if all the shaders in a GPU simultaneously begin executing a floating point kernel). Even if the benefits are just half of what is possible in a CPU, a 5-10% decrease in power is significant for a 250W GPU.

Third, since adaptive clocking minimizes the impact of voltage droops AMD could remove package decoupling capacitors or package layers to reduce the cost of the overall platform.

Fourth, the adaptive clocking could be used to improve the transition between different voltage/frequency combinations by reducing the latency.

Summary

Overall, AMD’s adaptive clocking paper was enlightening and enjoyable and stood out from the processor section. While it addresses a longstanding problem, the solution is new and an interesting approach to the challenges in power delivery.

The paper also demonstrated one of AMD’s key differentiators, expertise in power management and clocking, that is critical for any computing platform from mobile to servers. The techniques described will first appear in AMD’s Steamroller based platforms, but are expected to roll out across other IP blocks potentially including GPUs, ARM cores, and the Jaguar core.