Intel's current generation of Westmere parts marks the end of one era and the start of another. On the one hand, Westmere kicks off the 32nm process node, which is more important than normal because it marks the point at which you can make an x86 system on a chip (SoC) for every segment, from embedded up to servers. But Westmere marks the end of the P6 era—the architecture that has powered Intel's most successful processors since its debut with the Pentium Pro makes its curtain call with Westmere. In its place is a brand-new architecture, which Intel has codenamed Sandy Bridge.

In terms of its overall block diagram, Sandy Bridge will look familiar to any student of the P6 architecture. There's a similar mix of execution units, although the labor has been redistributed a bit (e.g., the AGUs are now general, and not specialized to load or store). And there's the same group of three main dispatch ports, with ports 0, 1, and 5 still hosting the actual scalar and vector math hardware.

But if you look a bit deeper, you can see that Sandy Bridge isn't yet another P6 derivative from Intel. It looks a bit like a fusion of the Pentium 4 architecture and the P6; or, alternately, you could say that it looks like neither. Whatever you call it, it's clear that Sandy Bridge is the first truly new microarchitecture from Intel since Atom, and it's the first new desktop microarchitecture since the Pentium 4. It's such a departure from the P6 lineage that we can say that the P6 line, which began with the Pentium Pro, officially ends with the current-generation, 32nm Westmere processors.

Two very good articles have taken an in-depth look at Sandy Bridge, one from David Kanter at Realworldtech and another from Anand. Both of them compare Sandy Bridge to the P6 and P4, and both are worth checking out for the details.

In this article, I'll draw on both of those pieces, which came out of technical sessions at IDF, to point out some highlights of Sandy Bridge's design. But before I dive in, I'll scoop myself by giving away the punchline. If you can, just skip Westmere and wait for Sandy Bridge, because Intel's new architecture is probably the best, classic per-thread performance architecture ever attempted. It also may be the last.

Front-end: uop cache

Sandy Bridge's most obvious point of similarity with the P4 is its uop cache. But this similarity is mostly superficial.

Time and again in processor technology, it turns out that radical leaps tend to fail, while the incremental approach wins the day. The Pentium 4's trace cache, along with much else about the P4 architecture, is a prime example of this phenomenon in action. The trace cache was a totally new structure that aimed to take the slow, bulky, power-hungry x86 decode hardware out of the critical execution path for most code. The idea was that on the first pass through the instruction stream, the P4's front-end would decode x86 instructions into small, fixed-length uops (or "micro-ops"), and that these uops would then be stitched together into "traces" and stored in a special cache. Ideally, most L1 cache accesses could then be diverted to this trace cache, so that uops that were already decoded could then stream into the execution core.

As cool as it was, the trace cache came with a ton of added complexity. It had its own little mini-front-end behind it, which was redundant to the main front-end hardware. And because most of the time the machine was supposed to bypass the main front-end, the main decoders were weakened, so that when you did have to go out to the L1 and then do x86 decode, it was a slower process.

Sandy Bridge's uop cache isn't anything like the trace cache, other than the fact that it caches decoded uops. The uop cache is a subset of the L1, so you can think of it as a very fast, exclusive "L0," if you like. Or, you can think of it as a slice of the L1 that's faster and that caches uops, instead of x86 instructions. Either way, the new cache's aim is not ambitious on the level of replacing the L1 for most accesses and cutting out the x86 decode phase. Rather, the uop cache is very much a proper "cache," in that its aim is to lower the average effective latency of a larger storage pool by inserting a little patch of faster memory in between that pool and the execution units.