A day at Intel

At the Research@Intel day last week, Intel had a huge array of technologies and active research initiatives on display for press and analysts. As I toured the company's Santa Clara offices, I was able to piece together a few major themes and directions by stepping back and looking at the places where Intel is currently focusing its forward-looking research. In my next few articles, starting with this one, I'll take an in-depth look at each of these themes and at what it tells us about where computing is headed in the next decade.

I'm going to start this series with a discussion of power, because—as boring as it is to the performance-oriented enthusiast crowd—plain old wattage is the metric that constrains all others in the realm of computing for the foreseeable future. So much of what Intel showed at their research day had power as either a primary or a secondary theme that it's worth taking some time to look at a few of their power-related research projects. I also had some on-background conversations with current and former employees of companies with names you'd recognize, and those chats also had power consumption as a major theme. In short, everyone that I talked to in Silicon Valley is thinking about power in some form or another: how to generate it, how to save it, and how to wring the most out of every watt. So let's talk power.

The basics of dynamic power optimization

I said in my recent Griffin/Puma article that the key to power optimization lies in turning off components when they're not in use. But in addition to turning off components that aren't in use, you also have to be able to detect when part of a device is heating up for some reason so that you can shut that device down and move its work elsewhere. So to flesh out our abstract picture of power optimization, we can say that the essence of power optimization lies in dynamically adapting hardware to fit two primary constraints:

Dynamic workload: Power usage must match the ever-changing demands of whatever task the device is currently carrying out Fixed thermal envelope: The temperature of the device must not exceed a certain ceiling.

So the art of dynamic power optimization is about tuning the voltage and/or current going into a device in real-time in order to meet those two constraints. But enough with the generalities for now. Let's take a look at some of Intel's research efforts to see these principles in action. I'll also draw some more general conclusions about power optimization and computing technology in general at the close of this article.

Processor-level power optimization

As I've reported before, Intel's Terascale program is a network on a chip (NoC) that consists of multiple different types of processing elements arranged in a grid configuration.* The silicon die containing this grid of elements—or "tiles," as Intel prefers to call them—is stacked directly on top of a chunk of RAM, and connections go vertically downwards from the router embedded in each tile into the RAM array so that the tiles can dip into this large pool of shared memory beneath them.

* Note: Once again, I am compelled by inaccurate tech press coverage of Intel's research day to point out to the press, analysts, and anyone else who'll listen: Larrabee is not "the first Terascale product" or any such nonsense. Larrabee is a many-core product family that will first see the light of day in GPU form. Terascale is a multifaceted research initiative, and though there's a chance that one or two technologies developed under its auspices will make it into the first Larrabee part, it is not equivalent to Larrabee. So please make it stop, and please see this article for more on the relationship between the two.

Perhaps the most important way that the Intel Terascale prototype, called "Polaris," controls power across the chip is by turning each tile on or off depending on its usage level and temperature level. So if a core is idle, then it can be put into one of two low-power sleep states; likewise, if one of the many thermal sensors scattered throughout the Polaris die reports that the area of the chip containing that core is too hot, that core is turned off.

If a core is shut down due to overheating, Polaris will try to transfer its work to one of the other cores in a cooler area of the chip. This way, work can be dynamically moved around the chip from tile to tile in order to adapt to changing thermal conditions.

One such thermal condition that may need adapting to is a hotspot in the large RAM cache beneath the Polaris die. Such a hotspot could arise from multiple, repeated accesses to a particular part of the memory array due to locality of reference. Polaris could shut down the tiles over this hotspot and move their work elsewhere until things cool down.

Note that cores can turn themselves and each other off, so Polaris doesn't need one master core to control the thermal management across the chip. If the cores have access to the thermal sensors, then they can work with each other to move the workload around in response to changing thermal conditions.

Datacenter-level power optimization

This notion of workloads moving to match to changing thermal conditions is also being enacted at the datacenter level, with a suite of large-scale power-management technologies that Intel is working on in conjunction with a long list of partners (IBM and Microsoft among them). In a nutshell, the idea behind Intel's Group-Enabled Management System (GEMS) is much the same as what I described above for Polaris, except at the server level.

GEMS servers are able to communicate with each other in order to move workloads around to units that are either underutilized or overheated. For instance, if an air conditioning unit goes out in one part of the datacenter, then those servers can use virtualization to pass their workloads on to servers in another location before switching themselves off.

Individual servers that are running the GEMS agent can organize themselves into functional groups and elect a group leader that does thermal monitoring and power optimization for the entire group.

A key hardware part of this picture is a special instrumented power supply, which can give out information on exactly how much power the system is drawing at any given moment. This way, the power optimization software (and the datacenter administrators) can get precise, real-time data on power usage for every server in the system, enabling them to adjust server workloads and actual power usage dynamically over the course of the day.