STANFORD — In a presentation today at Hot Chips, AMD unveiled new details of two upcoming microprocessor architectures aimed at the server and mobile markets. Those architectures, codenamed Bulldozer and Bobcat, are AMD's first new-from-the-ground-up designs since the original Opteron, and Bulldozer in particular marks the biggest departure from existing hardware since AMD introduced the original K7 back in 1999.

In this short article, I'll take a look at Bulldozer in order to explore the reasons why AMD made the design decisions that it did.

AMD's official line on Bulldozer is that it's a "third way" between traditional multicore and simultaneous multithreading (SMT). In the former type of design (also called chip multiprocessing, or CMP), each concurrently running thread runs on a separate core; contrast this to SMT, where two or more threads can share the same core by executing simultaneously.

AMD pitches Bulldozer as a middle ground between these two approaches, where two threads share a single front-end but have separate integer execution resources.

After looking closely at Bulldozer, it's best described as a new form of SMT—AMD's "third way" description isn't inaccurate, but it's most useful to think of Bulldozer as an approach to SMT that builds on what has come before.

The rationale for SMT

Traditional SMT designs are primarily an answer to the question, "How do we keep the processor busy during long stretches when the currently executing thread has to wait on critical code or data to arrive from main memory?" Actually saving the stalled thread's state and swapping it off the processor for a new thread takes too long, and it involves a bunch of memory accesses anyway. So SMT essentially keeps two or more threads loaded into the processor, and the machine's execution hardware switches between them dynamically based on current conditions (and possibly thread priority). This way, if one thread stalls while waiting on main memory, there are instructions from a different, non-stalled thread right there in the machine that can be immediately executed without any kind of context switch.

In order to make SMT work, all of the code and data storage parts of the processor must be either replicated or partitioned. Remember this point, because we'll come back to it in a moment.

Take a two-way SMT processor, for instance. Such a processor needs two different sets of architectural and rename registers, one for thread A and one for thread B. It also needs plenty of space in any shared instruction queues that make up the instruction window, so that the instruction window can hold enough instructions form both threads to keep the execution units busy.

Finally, any shared structures (instruction buffers in different parts of the pipeline, usually) must be partitioned in such a way that neither thread gets squeezed out of the structure by the other thread. In other words, the two threads have to share nicely with one another, preferably by using some flexible scheme that doesn't involve leaving a bunch of buffer entries empty in situations where there's no second thread.

Because of the added storage and bookkeeping overhead described above, SMT adds a bit to a processor's die area and power budget. But if it keeps the processor from burning up precious cycles and power while waiting on main memory, then the tradeoff is more than worth it.

The good and the bad of SMT

SMT works great for multithreaded workloads where none of the threads that share the processor need to run full-bore. For a two-way SMT design, if both of the threads are going to spend a lot of time waiting on main memory, then two-way SMT will run those threads just as well as a dual-core processor—and it will do so much more efficiently, because it has just a little more than one core's worth of hardware. In such an ideal scenario, each two-way SMT core is equal to a little less than two cores of an identical non-SMT processor, which makes for fantastic power savings.

AMD claims that in the real world, a single two-way SMT core works as well as 1.3 regular cores, because the threads don't wait on main memory but on execution resources to free up. In other words, in cases where main memory is not a bottleneck, the execution units quickly become oversubscribed and an SMT core quickly becomes a bit less efficient than a non-SMT core (i.e., it performs just like a single-core design, but it has extra hardware, so it's overall less efficient).

This 1.3 number as an average is probably on the pessimistic end, since I've heard that efficiency can be as high as 1.7 cores. But all of this is highly workload-dependent, so take these numbers with a large helping of salt. The point is that only in very ideal scenarios is an SMT core not bottlenecked by having too many instructions waiting on too little execution hardware.

The 'Dozer solution: more more more

AMD's way of breaking the execution unit bottleneck described above is to just replicate the integer execution hardware so that there's one set of integer units per thread. Conventional SMT designs only replicate the storage parts of the processor so that there's, say, one register file per thread; Bulldozer takes it to the next level by replicating the integer units, so that there's one register file and one complete set of integer units per thread.

When you boil it down, the replicated integer hardware is really the main difference between Bulldozer and a conventional two-way SMT design. So Bulldozer isn't "dual-core" in any real sense—it's more like a 1.5-core design, whereas a conventional SMT processor core is really a 1.2-core design.

AMD claims that a single Bulldozer core can execute two threads like a 1.8-core part, on average, which strikes me as a bit optimistic. But again, these numbers will be very highly workload dependent. What isn't in doubt is that Bulldozer will perform much better than a regular SMT design, but at the cost of a ton of additional, very power-hungry integer execution hardware. Whether and how often that tradeoff works to Bulldozer's advantage will depend greatly on the performance and tuning of Bulldozer's cache hierarchy, on whether the core's internal buffers and queues are correctly sized, and on how efficiently and fairly space in those queues is allocated to each thread (especially in cases where one or both threads are greedy and are overusing shared resources).

If nothing else, Bulldozer should have very good floating-point performance. AMD claims that since the FPU is one of the shared parts of the machine, engineers could beef it up because the cost of the additional hardware is amortized over the two threads. Given enough memory bandwidth, this chip may be a floating-point monster.

Can Bulldozer save AMD?

Nothing about Bulldozer looks like a huge gamble. When considered in the context of other very wide SMT designs from IBM and Intel, Bulldozer is actually a conservative, evolutionary step forward from what has gone before.

In the world of processor design, evolution is always a lot better than revolution. It's the radical designs that fail to live up to expectations (e.g., Itanium, Pentium 4, IBM Cell), while more conservative, incremental approaches tend to win out in the end.

That said, as incremental improvements go, adding a whole separate set of four integer units is a pretty large increment. This introduces many changes, and there are a ton of knobs that will need to be dialed in to exactly the right value (cache size, cache associativity, cache latency, instruction buffer size, partition policy, decode bandwidth, etc.). Of course, this is always the case with a brand new design, but that's what's perilous for AMD—these values often get tweaked as the design matures, but Bulldozer won't be mature for some time. AMD needs Bulldozer to deliver immediately, though, so the margin for error is zero.

But even if the first Bulldozer products ship on-time and are fully price/performance and performance/watt competitive with Intel, Bulldozer (and Bobcat, which I'll talk more about in a separate piece) may still not be the home run that AMD needs.

Fighting the last war

AMD always succeeds when it attacks Intel not where the latter is strong, but where it is weak. Historically, AMD's biggest wins have come when the company moved into an obvious hole in Intel's product line. For example, when Intel announced that EPIC and Itanium would be its 64-bit upgrade path, AMD countered with x86-64 and scored a huge victory in the server market. Or, when delays with the QuickPath Interconnect forced Intel to stick with its aging frontside bus architecture for way too long, AMD exploited its superior HyperTransport interconnect to pursue the multisocket server market. When Intel was pushing RAMBUS and, later, the power-hungry FB-DIMM, AMD stuck with cheaper DDR and gained a platform-level performance/watt advantage.

Right now, there are no obvious weak spots in Intel's conventional server platform; indeed, Intel's Xeon line is as strong as it has ever been. (Mobile is a different story, but that's a topic for later.) Insofar as Bulldozer is aimed at the server market, AMD is attacking Intel when and where the larger chipmaker is at its absolute strongest.

But notice that I said "conventional server platform" above. There is one obvious gap in Intel's current suite of datacenter offerings: Intel isn't directly pursuing low-power, high-density cloud servers, and this is a gap that both ARM and startups like SeaMicro are looking to fill with very dense server offerings based on mobile technologies (e.g., physicalization solutions).

If I ran AMD, I would redirect the company's effort toward building a low-cost, low-power, high-density, flash-based cloud server platform around Bobcat. Intel's Justin Rattner has admitted that for certain cloud workloads, these types of high-density solutions are superior to a monolithic server chip like Xeon. So AMD should stop obsessing over netbooks and monolithic server parts—both of these amount to fighting the last war—and just jump straight into the cloud server market that ARM is set to tackle with its upcoming Eagle part.

To do this would be to attack Intel where it is weak, because Intel's current answer to this is still in the labs. Intel will probably keep puttering away at its experimental Single Chip Cloud Computer, while pushing Xeon at cloud vendors and losing rack space to ARM-based systems. AMD could jump right in with something like Bobcat and be well-established as the go-to maker of high-density x86 servers before the SCCC makes it to market.

Will AMD take this advice? Probably not, and if it doesn't, Bulldozer better be very good.