AMD's forthcoming 32nm Bulldozer design is novel in a number of respects, and at this past ISSCC, AMD gave two technical papers that went into detail on the integer and floating-point parts of the design. The bulk of each paper covered circuit design details, but there were a quite a few higher-level bits of information that came out.

We've previous given a high-level overview of Bulldozer's approach to power-efficient performance, so check out our earlier coverage for the big picture. In this short article, we'll fill in some of the details from AMD's papers.

We've described Bulldozer earlier as a "1.5-core" design, and that's still true. The core represents a kind of extreme approach to simultaneous multithreading, where instead of just replicating some of the instruction flow parts of the machine, AMD has also replicated the entire integer unit execution block. In one of the papers, AMD gave fresh details about the design of this integer block, and specifically about the out-of-order window.

Each Bulldozer integer block features a 40-entry out-of-order instruction scheduler that's designed from the ground up to conserve power. In general, AMD has made tradeoffs in favor of leaving data where it is and using the out-of-order window to hold and juggle pointers to that data. This approach also entails multiple copies of the physical register file to keep data close to where it's needed. Contrast Bulldozer's "stationary data in multiple physical register files, plus pointers in the out-of-order window" approach with that of, say, the classic Pentium Pro family reorder buffer (ROB), where register data is actually stored in the ROB itself. The PPro approach may be more compact, but it was always fairly power hungry. Bulldozer's approach spreads out to take up more die space, but the upside from spreading the work around to more structures is that you get finer granularity of control in managing power by turning blocks on and off.

The integer scheduler is capable of issuing four instructions per cycle to one of four pipelines in the integer execution block. The integer block has a single-cycle bypass network for feeding results directly back into the arithmetic-logic units (ALUs) for dependent instruction sequences (i.e., the results bypass the register file). Again, this is more costly from a die area standpoint, but it boosts performance and probably results in a net power savings.

As was mentioned above, there are four independent integer pipelines: two address generation units (AGUs), and two integer ALUs. This four-pipe division of labor is similar to the previous AMD architectures.

Bulldozer's single floating-point unit (FPU) has been beefed up substantially vs. its predecessors, the idea behind this being that it can be a bit bigger, because for every two integer blocks on the die there's one less FPU than there would normally be. The FPU has a 60-entry instruction scheduler that can dispatch up to four instructions per cycle to the floating-point execution pipes. The FPU supports a number of new SIMD and encryption-related instructions, and it also features faster floating-point multiply-accumulate (FMAC) hardware.

At 213 million transistors per "dual-core" module, Bulldozer seems svelte for a dual-core part, but that depends on how you view what AMD has produced—a really lean dual-core design, or a really fat single-core SMT design. Either way, the benchmarks will tell the tale, and we'll refrain from making any predictions one way or the other about this novel architecture until we see some hard data.