2011 has been a good year for AMD; the company has successfully launched both its new netbook platform (Brazos) and its mainstream mobile architecture (Llano). The greatest challenge, however, is still ahead. AMD’s next-generation CPU, codenamed Bulldozer, is scheduled to launch in the next few weeks. While its initial impact will be limited, AMD’s 2012 server, desktop, and mainstream mobile roadmaps are all relying on Bulldozer to drive corporate profits and hopefully regain lost market share.

Bulldozer: a product of its time

The key to understanding Bulldozer is rooted in AMD’s competitive position vis-à-vis Intel. Intel currently has an approximate 18 month lead time on its smaller rival when it comes to moving to new process technologies, while its gross margin is 16 percent higher. When it comes to research and development, the gap between the two companies is enormous. Intel spent $3.9B on R&D in Q2; AMD’s total revenue for the same period was $1.57B. What this means in the real world is that Intel can build more CPUs per silicon wafer than AMD can, earns more money per sale, and has a research fund that’s nearly 2.5x AMD’s quarterly income.

Historically, AMD has ignored these factors and designed cores that were meant to go toe-to-toe with Intel’s best in terms of single-threaded performance. This has proven to be an ineffective strategy *. AMD has never been able to retain any performance crown it took from Intel, and the cost of attempting to do so nearly killed the company. Bulldozer breaks with these trends. It’s explicitly designed to lower AMD’s manufacturing costs, play to the company’s strengths, and help it achieve competitive parity with Intel over the long term. Doing so necessitated some short-term tradeoffs, but they’ve been made in an intelligent manner. The result is an x86 processor that’s different from anything we’ve seen from either AMD or Intel before.

Share and share alike: Bulldozer’s modular approach

A conventional dual-core processor duplicates all of a CPU’s execution and top-level cache, with the two cores potentially linked by a lower level cache. Bulldozer, in contrast, is designed to share resources, including its scheduler, decode hardware, and floating point unit. It retains separate integer units as a holdover from “true” dual-core days, but these units are designed differently from what we saw in AMD’s previous Istanbul core. AMD claims that the new design will let the chip make better use of available throughput and has stated Bulldozer will offer substantially higher integer performance than previous Istanbul-based processors.



A conventional dual-core design

The situation with the floating point unit is more complex. AMD claims that FPU code only accounts for about 40 percent of a server workload. It therefore made sense for the company to cut die size here; a well-shared FPU should be capable of handling the workloads of two CPU cores with only a minor performance hit. Each Bulldozer module shares a single FPU, but the FPU unit aboard Bulldozer is substantially larger than its Istanbul counterpart. In theory, this should mitigate any potential performance impact.



Bulldozer’s shared fetch/decode units

Reality, unfortunately, is likely to be a great deal messier. Benchmark results that surfaced early this spring showed Bulldozer’s performance varying wildly when compared to a previous generation Magny-Cours. RealWorldTech’s Richard Kanter wrote: ” the data suggests that for a number of applications, Bulldozer will have comparable IPC [instructions per clockcycle] to its predecessor; sometimes better, sometimes slightly worse. Yet at the same time, the data also implies a very real risk that some workloads may hit particular bottlenecks in the architecture and suffer greatly.”

Bulldozer’s launch performance, in other words, may resemble that of the original Pentium 4. At launch, the P4’s performance could vary by as much as 40 percent depending on whether or not an application supported the then-new SSE2 instruction set. Performance in native x87 FPU code, by contrast, was much poorer; the 1.5GHz P4 was often outperformed by Pentium 3 and Athlon processors running at 1GHz or less. Over time, updated libraries and widespread SSE2 support made x87 performance a non-issue. In this case, AMD has added support for AVX (advanced vector extensions) instructions and has long-term plans to move floating-point workloads towards the GPU with the Fusion architecture.

Competitive positioning

The first generation of Bulldozer products will feature up to eight cores on Socket AM3+. AMD has yet to officially confirm any clock speeds or prices but available information on Bulldozer’s target clockspeeds make it reasonable to assume that launch speeds for the eight-core parts will be comparable to current six-core Thubans (2.8-3.3GHz). More generally, Zambezi base clock speeds should be 100-200MHz faster than current Phenom II parts of the same core count.

AMD isn’t relying on Zambezi to redefine the desktop performance landscape. Instead, the new architecture simply needs to move the performance bar substantially forward and give OEMs reason to believe that future Bulldozer-based products will be capable of competing with Intel’s offerings in ways current 45nm Phenom II and 32nm, K10-derived Llano processors can’t. Desktop vendors compete heavily on price, and while AMD clearly wants to recapture part of the enthusiast market, the likes of Dell and HP are going to be primarily concerned with how well the company competes with the lower tiers of Intel’s Core i3/i5 lineup.

If Sunnyvale’s roadmaps are accurate, the Bulldozer core about to debut will be replaced within a year by what the company is calling an Enhanced Bulldozer core. The EB variation will power both the Trinity APU and the enthusiast-oriented deca-core Komodo CPU. The speed of the respin may reflect AMD’s ongoing research into Bulldozer’s ideal configuration. Blending Bulldozer’s fetch, decode, and execution units gave AMD the opportunity to significantly improve the processor’s performance density but it introduced an additional layer of complexity as well. It’s actually quite unlikely that Sunnyvale nailed everything perfectly the first time, and future enhancements and efficiency improvements will hopefully keep the core’s relative performance steady once Intel debuts Ivy Bridge next year.

*- The reason why this has been an ineffective strategy varies depending on which company you ask. AMD’s antitrust lawsuit against Intel, which the two companies have since settled, claimed that Intel used a system of exclusionary rebates and predatory pricing that prevented AMD from competing effectively. Intel blames AMD’s own internal problems. [Back to top]