Intel’s Atom processor debuted five years ago as the first x86-compatible CPU from Intel tailored explicitly for low-power operation. At that time, the iPhone was less than a year old, and Asus had only recently introduced the first-generation Eee PC. Intel was talking about a new class of products, known as MIDs or “mobile Internet devices,” as the natural home for the Atom.

You know what happened next. Without robust touch interfaces, MIDs never took off. Instead, the netbook craze came and went, and tablets became an outright phenomenon. Smartphones grew in size, tablets shrank, and “phablets” bridged the gap between the two.

Something else happened along the way—or perhaps didn’t. The Atom never really replicated its initial success in netbooks among other consumer devices. Intel revved the various incarnations of the Atom, reducing power envelopes and physical footprints. It integrated ever more functionality into true SoC products like Moorestown, aiming for smartphones, and won some business along the way. But the Atom has captured only a handful of high-profile design wins among smartphones, and the few Windows 8 tablets based on the current Clover Trail platform have seen only modest adoption to date. Instead, the great majority of mobile computing devices for consumers are based on ARM’s CPU technology—or are compatible with it.

Another thing that didn’t happen is a change to the Atom’s microarchitecture. Intel wrung out some performance and power efficiency gains through integration, improved process tech, and higher clock speeds, but the CPU cores themselves remained largely the same.

That fact seems a little odd, since it’s been clear for several years that Intel views ARM as its biggest competitive threat. But the world’s biggest chipmaker hasn’t been idle. It has been pushing its highest profile Core processors into ever-lower power envelopes, and the hotly anticipated Haswell chip is expected to hit the market early next month with TDPs reaching down to 10W or less. Meanwhile, the firm’s Austin-based design team has been hard at work on a clean-sheet redesign of the Atom microarchitecture, code-named Silvermont. Today, we can reveal the first details about Silvermont, and they look very promising indeed. Intel is claiming that this new architecture, when combined with its 22-nm fabrication process, will enable chips that offer three times the performance of the prior-generation Saltwell Atom with “5X lower” power consumption.

Silvermont isn’t just a new architecture; it’s also the beginning of an accelerated update schedule for Intel’s low-power processors. Going forward, the Atom will be getting the same sort of “tick-tock” cadence that Intel has employed to great effect with its Core processors. As before, the Atom will be shrunk to a new process node roughly every other year. In between, the CPU architecture will be revised, as well. As you can see in the image above, “Airmont” will be a shrink of Silvermont to 14 nm. After that, we should see a revamped microarchitecture on this same fab process, although Intel isn’t ready to reveal its codename.

It goes without saying, perhaps, but the move to a tick-tock cadence in the low-power segment means Intel is dead serious about winning in this part of the market.

Before that new plan can take hold, Intel has to deliver Silvermont-based products. That’s slated to begin happening later this year, in several system-on-a-chip (SoC) configurations intended for different market segments.

The Bay Trail SoC will replace Clover Trail and offer more than double the compute performance for tablets. Bay Trail should also make its way into entry-level notebooks and desktops. It’s slated to arrive inside of new systems ahead of the holiday buying season. Merrifield, the phone chip, should start shipping to smartphone makers by year’s end, and products based on it should be announced in the first quarter of 2014. Avoton is targeted at micro-servers and is already sampling, with an official launch coming in the second half of 2013. Rangeley, the communications infrastructure part, will also launch in the year’s second half. Intel intends to address other parts of the embedded CPU space, such as automotive infotainment systems, with additional Silvermont-based platforms that have yet to be announced.

The Silvermont story

Despite driving its Core architecture into power envelopes of 10W and lower, Intel is making a big commitment to the separate development of a low-power architecture going forward, because the Atom can go places Core cannot: into power envelopes measured in hundreds of milliwatts, into smaller physical footprints, and into much lower-cost platforms. The list of SoCs being created with Silvermont tells that tale. Necessarily, then, this low-power architecture must accept a different set of compromises than Core, with a focus on operating at very low voltages using a more modest transistor budget.

Within the scope of these limitations, Silvermont’s architects have reached for a much higher performance target, especially for individual threads. The big news here is the move from the original Atom’s in-order execution scheme to out-of-order execution. Going out-of-order adds some complexity, but it allows for more efficient scheduling and execution of instructions. Most big, modern CPU cores employ OoO execution, and newer low-power cores like AMD’s Jaguar, ARM’s Cortex-A15, and Qualcomm’s Krait do, as well. Silvermont is joining the party. Belli Kuttanna, Intel Fellow and Silvermont chief architect, tells us the new architecture will achieve lower instruction latencies and higher throughput than the prior generation.

Interestingly, Silvermont tracks and executes only a single thread per core, doing away with symmetric multithreading (SMT)—or Hyper-Threading, in Intel’s lingo. SMT helped the prior generations of Atom achieve relatively strong performance for an in-order architecture, but the resource sharing between threads can reduce per-thread throughput. Kuttanna says SMT and out-of-order execution have a similar cost in terms of die area, so the switch from SMT to OoO was evidently a fairly straightforward tradeoff.

This decision makes a lot of sense in the context of Silvermont’s new fundamental building block, which is a dual-core “module” with a single, shared L2 cache. Intel talks of the two cores being “tightly coupled,” echoing the way AMD describes the dual-core module used by its Bulldozer architecture, but no logic is shared between the two cores—just the cache. The module can scale up to four on a chip, or eight cores on a single SoC. With core counts like that possible, Silvermont-based systems ought to exploit thread-level parallelism sufficiently without the use of SMT.

Silvermont will have additional opportunities for parallelism thanks to its expanded ISA support, which brings Intel’s low-power architecture largely up to parity with Westmere-class desktop processors. That means support for the SSE4.1 and 4.2 extensions along with AES-NI encryption. AVX isn’t supported, which is no great surprise given the requirements and Atom’s mission in life. The architecture does include expanded virtualization support, including extended page tables and the rest of Intel’s VT-x2 suite, which could benefit those SoCs targeted at micro-servers. Another new feature is real-time instruction tracing, to aid with debugging.

The dual-core Silvermont module talks to the rest of the chip via a new system fabric architecture, which should offer higher transfer rates and easier integration than the internal front-side bus used in prior Atoms.

The new core

Although Silvermont is a brand-new, clean-sheet design, Kuttanna tells us it carries over certain key principles and concepts from the last Atom. Indeed, the new architecture sometimes seems like an evolutionary step. For instance, the core retains the same 32KB L1 instruction cache and 24KB L1 data cache sizes as before.

Another attribute carried over is what Intel calls the “macro-op execution pipeline.” Most x86 processors break up the CISC-style instructions of the x86 ISA into multiple, simpler internal operations, but Silvermont executes the vast majority of x86 instructions atomically, as single units. Certain really complex legacy x86 instructions are handled via microcode. Compared to older Atoms, such as the prior-gen Saltwell core, Silvermont microcodes substantially fewer x86 instructions, which should translate into higher performance when those instructions are in use. We’d expect Silvermont to tolerate the vast amounts of legacy code in consumer applications better than current Atoms do.

Kuttanna shared the above block diagram of the Silvermont core with us. We haven’t had time to map out the new architecture in any great detail, but we can pass along the highlights he identified.

In the front end, Silvermont can decode two x86 instructions per clock cycle, like its predecessor. However, the branch predictors are larger (and thus, presumably, more accurate), and they include an improved facility for the prediction of indirect branches. Also upgraded is the loop stream buffer, which detects loops that will repeat, buffers the decoded instruction sequence (up to 32 macro-ops in Silvermont), and feeds the sequence into the execution engine. The chip can then shut down its fetch and decode units while the loop executes, to save power.

The execution units have been redesigned with a different mix of resources. The FPU is largely 128 bits wide, but the floating-point multiplier is 64 bits wide, mirroring the prior-gen Atom architecture.

Out-of-order loads are now supported, naturally. Note the presence of only a single address generation unit, with a reissue queue ahead of it. In a bit of dark magic, the architecture can handle a load and a store in parallel when that queue comes into use. The caches have larger translation lookaside buffers, which should allow for quicker accesses. And store-to-load forwarding has been enhanced, as well.

One consequence of the move to OoO execution is that the pipeline is effectively shorter for instructions that don’t need to access the cache. The penalty for branch misprediction in Saltwell’s in-order pipeline was 13 cycles, but that penalty is reduced to 10 cycles in Silvermont.

The end result of all of Silvermont’s enhancements, from the fetch and decode to retirement, is a roughly 50% increase in instruction throughput per clock compared to the generation before. That improvement will be compounded, of course, by higher clock speeds and integration into SoCs with faster complementary subsystems, such as improved memory controllers.

Speaking of which, each dual-core Silvermont module connects to the SoC fabric using a dedicated, point-to-point interface known as IDI. This interface has independent read and write channels, and it features higher bandwidth and lower latency than the old Atom bus, along with support for out-of-order transactions. In the example above, a pair of Silvermont modules connect to the system agent via a pair of IDI links. The system agent then routes requests to the memory controller for access to DRAM.

Oh, I should mention that the L2 cache in each Silvermont module, shared between two cores, is 1MB in size. L2 access latencies have been reduced by two clocks compared to Saltwell, whose L2 cache was smaller at 512KB.

Better burst and power management

The new architecture has gained quite a bit of flexibility and capability in its dynamic frequency and power management schemes. The headliner here is a more capable “burst” mode, as the Atom guys call it, similar to the Turbo Boost feature in Core processors. The prior-gen Atom’s boost feature was fairly simple; it exposed an additional P-state to the operating system to allow higher-speed operation when thermal headroom allowed. The frequencies for Silvermont’s burst mode are managed in hardware and take into account the current thermal, electrical, and power delivery constraints, both locally and at the platform level. We don’t yet have many specifics about the SoCs that Silvermont will inhabit, but we assume an on-chip power microcontroller will be calling the shots.

Silvermont’s more sophisticated power management opens up several notable new capabilities, illustrated in the images above. The example on the left shows power sharing between two cores, where an unoccupied core drops into a sleep state, ceding its thermal headroom to the busy core, which can then operate at a higher frequency than its default baseline. In the middle example, the two CPU cores share power with the SoC’s integrated graphics processor; since the graphics workload is light, both cores can burst up a couple of steps beyond their default speed. In the example on the left, the cores can temporarily step up to a high frequency even under relatively full utilization, so long as platform-level thermals will allow it. All of these behaviors are familiar from larger Intel SoCs like Ivy Bridge, but the exact algorithms and mechanisms are distinct.

Each Silvermont module is fed by a single voltage plane, but oddly enough, each core in the module can run at its own frequency, independently of the other one. When speeds differ, the shared L2 cache will run at the higher of the two frequencies. The existence of this capability seems rather odd, since we’ve seen a number of x86 processors run into performance problems when threads hop around onto cores running at low frequencies. Still, architects keep building fully independent clocking into their processors. Our understanding is that independent core clocking within a module probably won’t be used in the Bay Trail platform that’s most likely to run Windows or other desktop-class operating systems. Instead, Intel tells us independent clocking schemes might be used in specific scenarios, such as very-low-cost parts where one of the two cores might not operate perfectly at higher frequencies or as an enabler for custom TDPs chosen by the system vendor.

Good power management is largely about taking advantage of the idle time between user inputs, and Silvermont is definitely geared to do that. Each core can drop into the C6 “deep sleep” state independently. When it does so, a power gate will shut off power to the core completely.

Silvermont modules can choose from a suite of C6 sub-states depending on the status of their two cores, as shown above. The L2 cache can be kept fully active, partially flushed, or shut down entirely, with each step into a lower-power state carrying a longer wake-up time.

The 22-nm advantage

One of the great thing about being Intel, of course, is having the lead in chip fabrication tech. The firm was first to market with a 3D transistor structure, or FinFET, when it shipped products based on its 22-nm process last year. To date, the company says it has shipped over 100 million processors built on its 22-nm process, and it claims defect densities are now lower than they were with its 32-nm process two years ago. In short, Intel appears to be well over a year ahead of the rest of the industry in terms of process geometries—and even further ahead in productizing FinFETs.

The firm’s 22-nm process technology offers some advantages that seem almost ideally suited to low-power processors. Those start with a threshold voltage for transistor operation that’s about 100 mV lower than with the 32-nm planar transistors on Intel’s older process node. At relatively low voltages, the 22-nm process with tri-gate transistors can operate up to 37% faster. At higher voltages, it can offer similar switching performance to the 22-nm planar process while consuming about half the active power.

What’s more, Silvermont-based chips will be built using a variant of this 22-nm process tailored for SoCs. In fact, Intel says the Silvermont architecture and its SoC process variant have been “co-optimized” for one another. Compared to the P1270 process used for Ivy Bridge chips, the P1271 SoC process offers several additional points of flexibility. The SoC process provides more tuning points in the form of lower speed, lower leakage transistors better suited for low-power devices. At the same time, it adds the high-voltage transistors needed for external I/O. These transistors have increased oxide thickness and gate length, and they support both 1.8 and 3.3V operation. Also, the process can be tweaked to provide a range of density options, from 9 to 11 metal layers, at different costs. Interestingly enough, Intel says the 22-nm tri-gate process is better suited for analog devices than the last three generations of planar transistors, as well.