AMD's long-awaited Bulldozer processor finally hit the market this week, and the Web has been flooded with benchmark results. One thing is clear: this won't kill Intel's Sandy Bridge, as some were hoping. Indeed, in some tests, Bulldozer can't even keep up with its predecessor. The launch of the Phenom in 2007 was similarly underwhelming—it arrived late, broken, and slow—but AMD managed to turn things around with Phenom II to produce a viable competitor to many of Intel's processors.

AMD's future success will depend on the company's ability to make lemonade from the Bulldozer lemons. And its ability to do that will be governed by the Bulldozer architecture: is it fundamentally flawed, or are the performance issues merely teething trouble?

It could go either way. With Phenom, the problems were fortunately not fundamental. The biggest single issue was that the cache used for supporting virtual memory was buggy (a problem known as "the TLB bug"). A BIOS fix to work around the bug and correct the processor's behavior was released, but it exacted a severe performance penalty. This bug was fixed partway through Phenom's life, adding another 10 percent to the processor's performance. In late 2008, Phenom II was introduced, boasting substantial improvements in clock speed and a much larger level 3 cache. The K10 architecture used in both Phenom and Phenom II was essentially sound; AMD just had to work out some relatively minor problems before it could achieve its potential.

Contrast this with Intel's Prescott Pentium 4s. Prescott was substantially modified from its predecessor, Northwood, with a much longer pipeline, larger cache, and new instructions. However, it didn't boast consistent performance gains over Northwood, largely because it never achieved the clock speed targets it was intended to reach. The lack of clock speed meant that the processor could never offset the penalties incurred by the long pipeline. The problems Intel faced with scaling its Pentium 4 designs eventually gave the company no option but to abandon the architecture entirely.

Intel, thanks to a combination of massive manufacturing capacity, deep pockets, and multiple design teams, could weather the storm. With the introduction of the Core 2 Duo line, Prescott was abandoned, and Intel has held the performance crown ever since. AMD's position is a whole lot more precarious. The company lacks Intel's riches, so a failed architecture that it can't monetize and evolve over a period of many years could be fatal.

AMD's fortunes may depend on whether Bulldozer is another K10—or whether it is AMD's Prescott.

A brave new design

The Bulldozer architecture is arguably AMD's first radically new architecture since the introduction of the K7 Athlons way back in 1999. Both K8, which added 64-bit and integrated the memory controller, and K10, which added single-chip quad core, more cache, and a host of changes to improve instructions per cycle (IPC), can trace their lineage back to the K7. Bulldozer is something new.

For all the low-level detail of how Bulldozer works, Dave Kanter's write-up at Real World Tech is your best bet. If Kanter's article is a little too low-level, a higher-level overview can be had at Tech Report. I'm not going to talk about every part of the processor's design here, but a number of key points are worth picking out for discussion due to the way they reflect AMD's vision.

The Bulldozer design has been influenced by AMD's long-term beliefs about the way processors should be built. First, the company believes that workloads will become increasingly multithreaded; processors should be optimized for multithreaded throughput—more concurrent threads—rather than single-threaded performance.

Second, it believes that heavy floating point tasks shouldn't be done on the CPU at all. They should execute on GPUs. This belief underscores AMD's Fusion strategy: the integration of CPU cores and GPU cores into accelerated processing units (APUs) so that mathematical tasks can use the GPU cores.

For Bulldozer specifically, additional design influences came into play. In the words of Chief Architect Mike Butler, AMD's goal was to "hold the line" on IPC (presumably meaning to keep it at around the same level as in Phenom II) but to increase the clock speed, thereby achieving improved single-threaded performance, too. The processor also had to be power efficient.

Taken together, these goals explain just about every aspect of Bulldozer's design.

Trade-offs

Bulldozer is based around processing modules, but describing these modules introduces some terminology problems. Like a processor core, the modules include a front-end that fetches and decodes instructions, level 1 and level 2 cache, a branch predictor, out-of-order instruction schedulers, integer and floating point pipelines, and back-ends to retire instructions. Each module can run two threads simultaneously, and here's where the complexity lies. Unike Intel's Hyper-Threading, where the two threads share all the resources of the core, Bulldozer modules include dedicated integer pipelines, each with their own scheduler and retire unit. For integer-heavy code, the result is that Bulldozer is more like two independent cores than it is one; for floating point-heavy code, it's more like one core with hyperthreading.

The first Bulldozer design, codenamed Orochi, includes four modules (and therefore, can handle eight threads at a time), a shared 8 MB level 3 cache, four HyperTransport links (though only one is enabled in Zambezi, the desktop-oriented chip; all four are enabled on Valencia, the server part), a dual-channel memory controller, and other miscellaneous support infrastructure.

Eight concurrent threads provides high throughput for highly multithreaded applications. The belief that floating point-heavy workloads should use the GPU justifies the separate integer/shared floating point design. With floating point heavy lifting performed by the GPU, it no longer matters that two threads have to share access to the floating point unit.

The implications of AMD's desire to save power and to boost clockspeeds are lower level and widespread. Compared to K10, Bulldozer has fewer per-thread execution resources, longer pipelines, and slower caches, all as a result of these influences. The modular design—in particular the ability to share the x86 decode units—saves power. x86 is a complicated instruction set, and replicating a decoder for every single core takes a lot of transistors. Bulldozer's decoder is more capable than K10's—it can decode four instructions per cycle instead of K10's three—but those four instructions are now potentially sourced from two threads, meaning that Bulldozer's decode bandwidth can effectively be lower than that of K10.

A similar regression can be seen in each integer pipeline. The core elements of the integer pipeline are arithmetic logic units (ALUs), used for performing integer arithmetic, and address generation units (AGUs) that calculate memory addresses for the reads and writes that the processor must perform. A K10 core has three ALUs and three AGUs. Bulldozer discards one ALU and one AGU, having just two of each in each of its integer pipelines. AMD claims that the K10's third AGU was superfluous, only there to make laying out the chip easier (by increasing the commonality between each AGU/ALU pair), but the same is not true of the ALU; K10 could execute up to three integer instructions per thread per cycle. Bulldozer tops out at two.

The situation for floating point is perhaps the worst of all. Each K10 core had three 128-bit floating point units. These could perform x87 scalar floating point, 128-bit SSE vector floating point, 64-bit MMX vector integer, and 128-bit SSE vector integer operations. Bulldozer has four units in its floating point pipeline. Two are for integer operations (64-bit MMX and 128-bit SSE); the other two are for floating point. In addition to the scalar x87 and vector SSE instructions, the two floating point units can be ganged together, to perform new 256-bit Advanced Vector Extensions (AVX) floating point instructions. Given that this pipeline is now shared between two threads, it's a big reduction in per-thread execution resources.

Not everything has fewer resources; the instruction buffers used for out-of-order execution are larger, meaning that Bulldozer has more instructions eligible for execution. This should allow it to fill its pipelines on a more consistent basis. Bulldozer also supports some potent new instructions. It has AVX, but also features some AMD-specific ones such as a combined FMA ("fused multiply add") instruction that performs a floating point addition and multiplication in a single instruction, which can double floating point throughput for code that can use it. But for code that already dispatched more than two instructions per cycle, and which doesn't use the new instructions, Bulldozer can definitely fall behind its predecessor.

The quest for higher clock speeds also caused AMD to lengthen the pipeline. The company has not disclosed the actual length, but it's estimated at around 20 stages compared to the low-to-mid teens for K10 and Sandy Bridge. Longer pipelines are, all other things being equal, easier to run at higher clock speeds, but they also mean that the penalty when a branch is incorrectly predicted is higher. Similarly, the cache and main memory latencies are longer than they are for K10 (four cycles compared to three for level 1 cache; 21 cycles compared to 14 or 15 for level 2; 65 compared to 55 or 59 for level 3; and 195 versus 182 or 157 cycles for main memory). K10's latencies were already worse overall than Sandy Bridge's (which boasts 4, 11, 25, and 148 cycle latencies, from level 1 through to main memory), and Bulldozer makes them worse still.

Again, the news here isn't all bad; the caches are larger than in the older processors, and they offer more bandwidth. For some workloads, this will work in Bulldozer's favor—but it's a trade-off.