Tick-tock. Tick-tock.

The sound of Intel’s ongoing CPU development cycle has been constantly in the backdrop for its biggest competitor, AMD, ever since the world’s largest chipmaker set an aggressive cadence for itself more than five years ago. Since then, Intel has turned over new manufacturing technologies followed by extensively revised CPU architectures in relentless succession. The introduction of Sandy Bridge processors at the beginning of this year put Intel firmly in the lead in terms of overall performance, power efficiency, and the value proposition offered to consumers.

Being the perennial number-two CPU maker in such a competitive context can’t be easy, but AMD hasn’t taken the challenge lightly. In fact, the firm has been working for several years on a brand-new breed of PC processors based on a fresh microarchitecture, code-named “Bulldozer,” that aims to restore some competitive balance. Nearly every CPU AMD has made for the past decade-plus (with the exception of the low-power Ontario/Zacate E-series APUs) has been derived from the original K7, the chip first known as Athlon. Bulldozer draws on that tradition in various ways, but it is a novel, clean-sheet design intended to take AMD processors into their next era.

To that end, Bulldozer introduces some unorthodox concepts into the PC processor space. The first of those is a dual-core “module” as a fundamental building block. To date, we’ve seen x86-compatible CPU cores capable of tracking and executing two threads via a feature known as simultaneous multithreading (SMT), better known by its Intel marketing name, Hyper-Threading, and we’ve had a number of chips with multiple cores onboard in a chip-level multiprocessor (CMP) configuration, tracing back to the original Athlon 64 X2. The Bulldozer module is sort of a mid-point between those two familiar arrangements. AMD says the module has “two tightly coupled integer cores” with some sharing of resources—including, notably, the FPU. The idea is to save space on the silicon die by pooling resources where possible while still offering “robust” performance on both threads, with fewer of the performance hazards created by SMT or Hyper-Threading.

At the same time, Bulldozer resurrects a concept that’s fallen out of favor in PC processors in recent years: it’s a “speed demon,” optimized for higher clock frequencies rather than maximum instruction throughput in each clock cycle. The Pentium 4 “Netburst” microarchitecture—particularly in its troubled “Prescott” incarnation—gave frequency-optimized designs a reputation for high power draw and iffy performance. Yet Chief Architect Mike Butler told us the engineering team’s goal with Bulldozer was to “hold the line” on instructions per clock (presumably at about the same rate as the Phenom II) and to “aggressively pursue higher frequencies.” Speed demons have typically reduced the amount of work done at each stage of the pipeline in order to simplify logic and thus enable higher operating frequencies, but this approach can also theoretically help manage power consumption. The rationale, if we understand Butler correctly, is that a design with a relatively low number of gates flipping at each pipeline stage may require less voltage to operate at a given frequency. Chip power consumption has three main determinants: clock speed, the number of transistors flipping, and the square of the voltage. That voltage squared term is, obviously, the single biggest factor in the power equation, so a design capable of keeping voltage in check could make some sense for today’s power-constrained world.

We don’t know precisely how aggressively AMD has pursued the speed-demon approach. When we asked, AMD declined to tell us the number of stages in Bulldozer’s main pipeline. This behavior seems unusually guarded. We’ve been writing about these things for over a decade, and an outright refusal to disclose pipeline depth in a major x86 processor is very rare. Our sense is that it’s somewhere between the 12-14 stages of contemporary Core- and Phenom-branded chips and the astounding 31 stages in Prescott. I expect we’ll learn more about Bulldozer’s inner workings as time passes.

At any rate, here’s the big picture. The first incarnation of the Bulldozer architecture is a formidable chip, with four modules onboard. That gives it a total of eight integer cores, with four floating-point units. Each module has 2MB of L2 cache, and there’s a shared third-level cache of 8MB. This chip retains compatibility with AMD’s existing system architecture, so it has an integrated memory controller with support for dual channels of DDR3. Also present are four HyperTransport links, only one of which will be used in desktop products.

Inside the module

Because Bulldozer is what it is—an all-new, high-performance x86-compatible processor—it’s incredibly complex and difficult to summarize. Nevertheless, we’re going to make a quick attempt, with the assistance of the block diagram below, which provides a high-altitude overview of Bulldozer’s key components.

The sharing in a Bulldozer module starts with the front end, where the branch prediction, instruction fetch, and decode units track two threads and service both cores. With two integer cores featuring relatively long pipelines to keep fed, the front-end hardware must be very effective at its job in order for the whole chip to function efficiently.

The decode units dispatch ops, or decoded instructions, to the two integer cores on an interleaved, every-other-cycle basis. Each of those cores has a pair of ALUs, and each ALU has an associated address generation unit. Thus, individual Bulldozer cores have fewer execution resources than those in the preceding Deneb/Thuban architecture. However, instruction scheduling is more flexible, and beyond the obvious increase in integer core counts, Bulldozer seeks to make things up in other ways.

One of those ways is a vastly reworked memory subsystem that looks very different from those in prior AMD chips. Among other things, the memory pipeline can speculatively move loads ahead of stores if doing so won’t cause a problem, a capability Intel has called memory disambiguation. Memory access latencies should be further reduced by the use of multiple data prefetchers that operate according to different rules in order to keep the caches populated with, hopefully, the appropriate data for the cores’ upcoming work. Both the prefetchers and the L2 cache into which they pull data are shared between the two cores in a module, assuming both cores have active threads. If only one thread is active, these resources are fully used by that single thread.

Another major shared resource in the Bulldozer module is the floating-point unit, which has been spun off into a co-processor arrangement in which both integer cores act as clients. This setup is quite different from the intermixed integer and FP execution resources in Sandy Bridge, and AMD has hinted that it may pave the way for a GPU-type shader array to one day take the place of the traditional FPU. For now, though, Bulldozer’s FPU is quite formidable in its own right. The scheduler can track two threads, of course, and the execution units include dual FMAC units capable of processing 128-bit vectors in a single clock cycle, along with dual 128-bit integer units (marked as “MMX” in the diagram above). Yes, that means integer SIMD goodness happens in the FPU, as well as floating-point math.

The fact that both Bulldozer and Sandy Bridge, two substantially new x86 microarchitectures, have hit the streets within the same calendar year isn’t entirely coincidental. The common thread is the advent of the follow-on to SSE, the extended instruction set known as Advanced Vector Extensions, or AVX. AVX increases parallelism by extending the width of vectors from 128 to 256 bits, and supporting those wider datatypes requires the broad reworking of the processor’s execution engine. The result should be much higher peak computational throughput on data-parallel workloads.

However, the path to that destination will have a few twists and turns. After initially proposing its own 256-bit vector extensions known as SSE5, AMD has reversed course and attempted to follow Intel by making Bulldozer compatible with AVX, instead. As that change was happening, Intel apparently was modifying its own course, as well. So Bulldozer catches up with Sandy Bridge on nearly every front, adding support for SSE 4.1 and 4.2 and most of AVX, including the AES instructions for accelerating encryption. It also includes support for AMD’s own XOP extensions, a surviving bit of SSE5 with more of a focus on integer datatypes. Where Bulldozer moves beyond Sandy Bridge, though, is with those two 128-bit FMAC pipes—and there, we get into disputed territory.

The dispute is over the FMAC instruction, which is the key to unlocking AVX’s peak potential. FMAC stands for “fused multiply-accumulate,” an operation that can be described logically as: “d = a + b * c”. Instructions that combine a multiply and an add together tend to map well to multimedia workloads, and they have been a staple of GPU shader cores for quite some. Doing both operations at once has a performance benefit, obviously—the processor is executing two floating-point operations (FLOPS) per clock cycle. The FMAC form of this instruction has a further precision advantage because the results of one operation are fed directly into the other, at the chip’s full internal precision, without being stored. These virtues have made FMAC very popular in other chips, including DirectX 11-class GPUs.

Bulldozer is the first x86 CPU to support FMAC. Sandy Bridge doesn’t, and the upcoming Ivy Bridge won’t, either. Instead, Intel intends to add FMAC support to Haswell, its next architectural refresh, due in 2013. Trouble is, Bulldozer supports a version of FMAC with four operands, while Haswell will support a three-operand variant of FMAC. This sort of incompatibility isn’t a good thing when you’re trying to persuade software developers to use your new instructions. AMD seems to recognize that fact, so it plans to add FMAC3 support in the next version of Bulldozer, code-named Piledriver, alongside FMAC4. The FMAC4-only chip we’re looking at today, though, will always be something of an oddity, as a result.

All of this madness still leaves Bulldozer in decent shape, FPU-wise, but not quite indisputably at the head of the pack. Even without FMAC support, Sandy Bridge still has two 256-bit vector units, so it can produce a 256-bit add and a 256-bit multiply in a single clock cycle. Bulldozer can theoretically match Sandy’s peak throughput, either by processing dual 128-bit FMACs or a single 256-bit FMAC per cycle, but it can’t match Sandy without FMAC.

For a discussion of the Bulldozer microarchitecture in much more depth, let me point you to David Kanter’s excellent piece on the subject, from which I’ve stolen small bits of info here and there.

Power management and Turbo Core

Now that we’ve spent entirely too much time on the FPU, let’s move on to power management, another topic too large to cover in the time and space we have today. Power efficiency has become critically important in modern processors, and any clean-sheet architecture like this one will include a zillion little pockets of logic conceived with power efficiency in mind.

The headliner here, though, is the use of power gates for each of the modules and a fifth power gate for the north bridge and L3 cache. Closing one of these gates shuts off power to the portion of the chip behind it, even leakage power. Intel has used power gates to good effect since Nehalem. AMD first used power gates in the Llano APU, where they are quite effective, but Bulldozer is its first high-end CPU to employ them.

Another feature makes a surprise return: separate clock domains for each of the modules, along with one for the north bridge. (The north bridge and L3 cache run at 2.2GHz in desktop parts and 2-2.2GHz in Bulldozer-derived Opterons.) AMD first instituted separate clock domains per core in Barcelona, the original Phenom chip, but back-tracked in the Phenom II generation and used BIOS code to lock all four cores to a single clock—making the Phenom II operate much like Intel’s recent CPUs do. Turns out threads pinging around from one core to the next in the Windows scheduler sometimes led to performance issues, because threads would be reassigned to cores operating at low frequencies. AMD tells us it has returned to this approach for a simple reason, “because power is important.” Our sense is that Bulldozer should be better equipped to avoid problems on this front. The chip has a higher floor for clock speed (1.4GHz versus 800MHz in the Phenom II), improved latency for clock-speed ramps, and can probe the caches of other modules more quickly. AMD also seems to be banking on smart scheduling in future versions of Windows to accommodate the Bulldozer architecture, a subject we’ll discuss shortly.

First, though, we should talk about Bulldozer’s version of AMD’s Turbo Core dynamic clock scaling feature, which raises clock speed on all or part of the chip when there’s thermal headroom available to do so. As in other recent AMD CPUs, Turbo Core uses power estimates based on the chip’s internal activity monitoring to determine the extent of that thermal headroom. Bulldozer’s Turbo Core implementation is the most granular one yet, with three P-states possible. P2 is the base clock of the chip, the speed at which it’s guaranteed to run. P1 is an intermediate Turbo clock speed that can apply to all four modules, provided that they’re not too heavily loaded. The third state, P0, is an even higher Turbo clock that comes into use when only two modules are active. As before, Turbo Core seeks to run at the highest possible clock speed for the given conditions, and it dithers between the P-states in order to stay within the chip’s prescribed thermal envelope, or TDP.

Now that we have Turbo Core in the picture, we have the context to talk more about thread scheduling. Bulldozer’s unique architecture creates some intriguing questions about how software threads should be distributed across its cores. There are obvious advantages to scheduling one thread per module before doubling up threads on a single module: shared resources like the front end, L2 cache, and FPU will be dedicated to a lone thread, improving performance. However, scheduling two threads per module gets you several nice things, too, including the possibility of data sharing between related threads via the L2 cache. Power efficiency should improve if more inactive modules can be turned off, and Turbo Core can convert that power savings back into performance by raising the clock speed of the active module.

Unfortunately, the Windows 7 scheduler wasn’t built with Bulldozer’s distinctive sharing arrangement in mind, and as far as we call tell, the BIOS doesn’t provide any hints to that OS about how to schedule threads. Win7 simply sees eight equal cores, with no preference between them. AMD claims Windows 8 will be better optimized for the Bulldozer architecture and cites improvements of 2-10% in several recent games with the Windows 8 developer preview. We haven’t been able to squeeze too many details out of AMD about how complex Win8’s understanding of Bulldozer scheduling will be, but we get the sense that the OS may attempt to schedule related threads on the same module when possible. We need to play with the Win8 developer preview on a Bulldozer system in order to learn more.

The chip: Orochi

Like mythical heroes in fantasy novels, modern CPUs are known by many names. We’ve been talking about Bulldozer almost exclusively up to this point, but that code name actually applies to the CPU cores and the microarchitecture inside of them—or something like that. These names are powerful symbols and are often multi-valent. (Yikes, religion major mode OFF. Sorry.) The proper code name for the silicon die that implements the Bulldozer architecture is “Orochi,” and Orochi will be deployed in multiple ways, each with its own name. On the desktop, it’s called “Zambezi.” In 1-2P servers, a single Orochi die will be called “Valencia,” and in 1-4P servers, two dies placed together in a package will be called “Interlagos.” I liked it better when a single name, like K7, could refer to the whole caboodle, before the marketing guys got into the code-name business, but I suppose that horse left the barn long ago.

Whatever you call it, this chip is AMD’s second attempt at a CPU fabricated on GlobalFoundries’ 32-nm process, with high-k metal gates and a silicon-on-insulator substrate. The unnecessarily overpopulated table below shows how Orochi compares to a range of other desktop processors from Intel and AMD.

Code name Key products Cores Threads Last-level cache size Process node (Nanometers) Estimated transistors (Millions) Die area (mm²) Bloomfield Core i7 4 8 8 MB 45 731 263 Lynnfield Core i5, i7 4 8 8 MB 45 774 296 Westmere Core i3, i5 2 4 4 MB 32 383 81 Gulftown Core i7-980X 6 12 12 MB 32 1168 248 Sandy Bridge Core i5, i7 4 8 8 MB 32 995 216 Sandy Bridge Core i3, i5 2 4 4 MB 32 624 149 Sandy Bridge Pentium 2 4 3 MB 32 – 131 Deneb Phenom II 4 4 6 MB 45 758 258 Propus/Rana Athlon II X4/X3 4 4 512 KB x 4 45 300 169 Regor Athlon II X2 2 2 1 MB x 2 45 234 118 Thuban Phenom II X6 6 6 6 MB 45 904 346 Llano A8, A6, A4 4 4 1MB x 4 32 1450 228 Llano A4 2 2 1MB x 2 32 758 – Orochi/Zambezi FX 8 8 8MB 32 1200 315

With roughly 1.2 billion transistors and a die area of 315 mm², Orochi is a very big and complex chip. Sandy Bridge, which has four cores and integrated graphics, is about 100 mm² smaller. Still, Orochi isn’t quite a large as the chip it succeeds, the “Thuban” Phenom II X6, so that’s progress of a sort.

The FX-series processors

Now that we’ve explored Bulldozer’s many code names, we should take a look at the names of the products that, you know, actual people will buy. AMD is introducing a trio of Bulldozer-based products today, and we have their vitals in the table below.

Model Cores Base core clock speed Turbo clock speed Peak Turbo clock speed L3 cache size Memory channels TDP Price FX-6100 6 3.3 GHz 3.6 GHz 3.9 GHz 6 MB 2 95 W $165 FX-8120 8 3.1 GHz 3.4 GHz 4.0 GHz 8 MB 2 125 W $205 FX-8150 8 3.6 GHz 3.9 GHz 4.2 GHz 8 MB 2 125 W $245

As you can see, the clock speeds involved aren’t stratospheric, but the FX-8150’s peak of 4.2GHz is a fair bit higher than anything else offered by AMD or Intel these days. These products are targeted directly opposite Intel’s Sandy Bridge parts, so let’s have a look at the competition’s lineup for comparison.

Model Cores Threads Base core clock speed Peak Turbo clock speed L3 cache size Memory channels TDP Price Core i3-2100 2 2 3.1 GHz – 3 MB 2 65 W $117 Core i5-2320 4 4 3.0 GHz 3.3 GHz 6 MB 2 95 W $177 Core i5-2400 4 4 3.1 GHz 3.4 GHz 6 MB 2 95 W $184 Core i5-2500 4 4 3.3 GHz 3.7 GHz 6 MB 2 95 W $205 Core i5-2500K 4 4 3.3 GHz 3.7 GHz 6 MB 2 95 W $216 Core i7-2600K 4 8 3.4 GHz 3.8 GHz 8 MB 2 95 W $317

We’ve taken this table almost without modification from our original Sandy Bridge review early this year. Intel hasn’t lowered prices on its key products once since then. The only real changes have been the additions of models to fill gaps in the original lineup, such as the addition of the Core i5-2120 at $177. We’ve not listed every single Sandy Bridge model above, since there are so very many. We think the ones included are the most relevant for our purposes today.

You’ll notice several things about these competing lineups right away. For one, AMD has made no attempt to go after the highest-end Sandy Bridge part, the Core i7-2600K, with a Bulldozer-based offering. We expect AMD would have liked to compete at that level, but doing so apparently wasn’t feasible at present. Similarly, AMD hasn’t attempted to take on Intel’s high-end Core i7-900-series processors. Also, notice that the top two Bulldozer-derived models are rated for 125W power envelopes, while the fastest Sandy Bridge chips have a TDP of 95W. Apparently, AMD needs the extra thermal headroom in order to compete on price and performance with the Intel products it has targeted.

With that said, the competitive match-ups are still reasonably straightforward. At $245, the FX-8150 is priced a bit above the Core i5-2500K, but the two are clearly rivals. The FX-8120 is in an interesting spot, taking on the Core i5-2500, the non-K-series version of that product, with a locked multiplier. Finally, the FX-6100 has one module disabled and a 95W TDP, and it’s priced opposite the Core i5-2320.

Unfortunately, we’ve only managed to get our hands on one of the three initial FX-series products today, the FX-8150. You’ll see a full set of results for it on the following pages, versus a host of other CPUs. We don’t have a real FX-8120, but we did attempt to simulate its clock speeds and performance using AMD’s Overdrive software. We’re only somewhat confident that we’ve managed to do so successfully, but we have provisionally included some performance results for the FX-8120. Take ’em with a grain of salt, and we’ll attempt to replace them with results from the real product when we can.

One more thing. In order to sweeten the pot a little, AMD has decided to unlock the multipliers on all three models of FX-series processors. That should make overclocking relatively easy to do, and it should give AMD a leg up in cases where the FX chips aren’t competing with unlocked K-series parts from Intel.

The platform: Socket AM3+

As we’ve mentioned, Bulldozer-based CPUs should be compatible AMD’s existing socket infrastructure. On the desktop, that’s AMD’s Socket AM3+ platform, which the company introduced back in May alongside its 9-series chipsets. FX-series processors have 942 pins, one more than older Socket AM3 CPUs, and that pin prevents them from dropping into anything but true Socket AM3+ motherboards. On the flip side, Socket AM3+ boards are capable of hosting older Socket AM3 processors like the Phenom II just fine.

If you have an existing Socket AM3+ system with an older CPU and would like to upgrade to an FX CPU, that should be possible after a quick BIOS update. The trick is that users may need to flash their BIOSes to add Bulldozer support using an older CPU before installing the new processor. That requirement generally shouldn’t be a big deal for would-be upgraders, but folks who are buying new motherboards for use with an FX processor will have to hope they receive a board with an FX-capable BIOS. Otherwise, unhappy times may ensue. On that subject, AMD tells us motherboards have been shipping with Bulldozer-ready BIOSes “for some time now,” and it expects any such problems with new mobos to be rare.

Folks who own a Socket AM3+ system with an older CPU and don’t plan to upgrade will want to be careful about BIOS upgrades, too. A major mobo maker told us recently that BIOS/EFI space for adding Bulldozer support is cramped, so some features aimed at older Athlon II and Phenom II processors, such as core unlockers, may have to be deleted from newer BIOSes in order to make room. Owners of those older CPUs may want to avoid the impulse to update to the latest firmware automatically. They may be better off sticking to an older version with a full feature set for Athlon II and Phenom II CPUs. As always, you’ll want to check with your motherboard maker for the final word on your board’s compatibility story.

Happily, dropping an FX-series processor into a Socket AM3+ motherboard will prompt an upgrade of sorts: the board’s two memory channels will then support the latest in DDR3 memory speeds, up to 1866MHz. Both Bulldozer and Llano officially support those higher memory frequencies. Intel’s K-series Sandy Bridge parts are capable of working with faster RAM, too, but memory speeds above 1333MHz aren’t officially blessed.

Our testing methods

We ran every test at least three times and reported the median of the scores produced.

The test systems were configured like so:

Processor

Athlon II X3 455 3.3GHz Phenom II X2 565 3.4GHz Phenom II X4 840 3.2GHz Phenom II X4 975 3.6GHz Phenom II X4 980 3.7GHz Phenom II X6 1075T 3.0GHz Phenom II X6 1100T 3.3GHz Pentium

Extreme Edition 840 3.2GHz Pentium

G6950 2.8GHz AMD

FX-8120 3.1GHz AMD FX-8150 3.6GHz Core

i7-990X 3.46 GHz Core

2 Duo E6400 2.13GHz Core

i3-560 3.33 GHz Core i5-655K 3.2GHz Core i5-760 2.8GHz Core i7-875K 2.93GHz Core

2 Quad Q9400 2.67GHz Motherboard Gigabyte

890GPA-UD3H Asus

P5E3 Premium Asus

P7P55D-E Pro Asus

Crosshair V Formula Intel

DX58SO2 North bridge 890GX X48 P55 990FX X58 South bridge SB850 ICH9R SB850 ICH10R Memory size 8GB

(4 DIMMs) 8GB

(4 DIMMs) 8GB

(4 DIMMs) 8GB

(2 DIMMs) 12GB

(6 DIMMs) Memory type Corsair CMD8GX3M

4A1333C7 DDR3 SDRAM Corsair CMD8GX3M

4A1600C8 DDR3 SDRAM Corsair CMD8GX3M

4A1600C8 DDR3 SDRAM Corsair CMZ8GX3M

2A1866C9 DDR3 SDRAM Corsair CMP12GX3M

6A1600C8 DDR3 SDRAM Memory speed 1333 MHz 800

MHz 1066 MHz 1866 MHz 1333 MHz 1066 MHz 1333 MHz 1333 MHz Memory timings 8-8-8-20 2T 7-7-7-20 2T 7-7-7-20 2T 9-10-0-27 2T 8-8-8-20 2T 7-7-7-20 2T 8-8-8-20 2T 8-8-8-20 2T Chipset drivers AMD

AHCI 1.2.1.263 INF

update 9.1.1.1025 Rapid Storage Technology 9.6.0.1014 INF

update 9.1.1.1025 Rapid Storage Technology 9.6.0.1014 AMD

AHCI 1.2.1.301 INF update 9.1.1.1020 Rapid Storage Technology 9.5.0.1037 Audio Integrated SB850/ALC892 with Realtek 6.0.1.6235 drivers Integrated ICH9R/AD1988B with Microsoft drivers Integrated P55/RTL8111B with Realtek 6.0.1.6235 drivers Integrated SB850/ALC889 with Realtek 6.0.1.6235 drivers Integrated ICH10R/ALC892 with Realtek 6.0.1.6235 drivers

Processor Core

i7-950 3.06 GHz Core i7-970 3.2 GHz Core i7-980X Extreme 3.3 GHz Core

i3-2100 2.93 GHz Core i5-2400 3.1 GHz Core i5-2500K 3.3 GHz Core i7-2600K 3.4 GHz AMD

A8-3800 2.4GHz AMD

A8-3850 2.9 GHz Atom

D525 1.8 GHz AMD

E-350 1.6GHz Motherboard Gigabyte

X58A-UD5 Asus

P8P67 Deluxe Gigabyte

A75M-UD2H Jetway

NC94FL-525-LF MSI

E350IA-E45 North bridge X58 P67 A75 NM10 Hudson

M1 South bridge ICH10R Memory size 12GB

(6 DIMMs) 8GB

(4 DIMMs) 8GB

(4 DIMMs) 4GB (2 DIMMs) 4GB (2 DIMMs) Memory type Corsair CMP12GX3M

6A1600C8 DDR3 SDRAM Corsair CMD8GX3M

4A1600C8 DDR3 SDRAM Corsair CMD8GX3M

4A1600C8 DDR3 SDRAM Corsair CM2X2048-

8500C5D DDR2 SDRAM Corsair CMD8GX3M

4A1333C7 DDR3 SDRAM Memory speed 1333 MHz 1333 MHz 1333 MHz 800

MHz 1066 MHz Memory timings 8-8-8-20 2T 8-8-8-20 2T 8-8-8-20 2T 5-5-5-18

2T 7-7-7-20 2T Chipset drivers INF update 9.1.1.1020 Rapid Storage Technology 9.5.0.1037 INF update

9.2.0.1016 Rapid Storage Technology 10.0.0.1046 AMD

AHCI 1.2.1.296 AMD USB 3.0 1.0.0.52 INF update 9.1.1.1020 Rapid Storage Technology 9.5.0.1037 AMD

AHCI 1.2.1.275 Audio Integrated ICH10R/ALC889 with Realtek 6.0.1.6235 drivers Integrated P67/ALC889 with Microsoft drivers Integrated A75 FCH/ALC889 with Realtek 6.0.1.6235 drivers Integrated NM10/ALC662 with Realtek 6.0.1.6235 drivers Integrated Hudson M1/ALC887 with Realtek 6.0.1.6235 drivers

They all shared the following common elements:

Hard drive Corsair

Nova V128 SATA SSD Discrete graphics Asus

ENGTX460 TOP 1GB (GeForce GTX 460) with ForceWare 260.99 drivers OS Windows 7 Ultimate x64 Edition Power supply PC Power & Cooling Silencer 610 Watt

Thanks to Asus, Corsair, Gigabyte, and OCZ for helping to outfit our test rigs with some of the finest hardware available. Thanks to Intel and AMD for providing the processors, as well, of course.

The test systems’ Windows desktops were set at 1900×1200 in 32-bit color. Vertical refresh sync (vsync) was disabled in the graphics driver control panel.

We used the following versions of our test applications:

Some further notes on our testing methods:

Many of our performance tests are scripted and repeatable, but for some of the games, including Battlefield: Bad Company 2, we used the Fraps utility to record frame rates while playing a 60-second sequence from the game. Although capturing frame rates while playing isn’t precisely repeatable, we tried to make each run as similar as possible to all of the others. We raised our sample size, testing each Fraps sequence five times per video card, in order to counteract any variability. We’ve included second-by-second frame rate results from Fraps for those games, and in that case, you’re seeing the results from a single, representative pass through the test sequence.

We used a Yokogawa WT210 digital power meter to capture power use over a span of time. The meter reads power use at the wall socket, so it incorporates power use from the entire system—the CPU, motherboard, memory, graphics solution, hard drives, and anything else plugged into the power supply unit. (The monitor was plugged into a separate outlet.) We measured how each of our test systems used power across a set time period, during which time we ran Cinebench’s multithreaded rendering test.

After consulting with our readers, we’ve decided to enable Windows’ “Balanced” power profile for the bulk of our desktop processor tests, which means power-saving features like SpeedStep and Cool’n’Quiet are operating. (In the past, we only enabled these features for power consumption testing.) Our spot checks demonstrated to us that, typically, there’s no performance penalty for enabling these features on today’s CPUs. If there is a real-world penalty to enabling these features, well, we think that’s worthy of inclusion in our measurements, since the vast majority of desktop processors these days will spend their lives with these features enabled. We did disable these power management features to measure cache latencies, but otherwise, it was unnecessary to do so.

The tests and methods we employ are usually publicly available and reproducible. If you have questions about our methods, hit our forums to talk with us about them.

Memory subsystem performance

Our first few tests are synthetic benchmarks that let us inspect the performance of the cache and memory subsystems.

This test is nicely multithreaded, so the caches from all available cores contribute to the throughput measured. You may be surprised to see that the Phenom II X6 1100T achieves higher bandwidth than the FX-8150 at the smaller block sizes, but remember it has six L1 caches where the FX has four. The more apt comparison may be the Phenom II X4 980, with four cores and a 3.7GHz clock frequency. The FX’s L1 caches will cover block sizes up to 64KB, and the FX-8150 is faster than the Phenom II X4 980 at each step from 2KB to 64KB. Then again, with only four cores, Sandy Bridge’s L1 caches are faster still.

The 256KB to 1MB block sizes are L2 cache territory, and the FX’s L2 caches don’t look to be especially fast, either, though they do largely outperform the Phenom II X4 980’s. Bulldozer’s L2 caches may lack for speed, but they’re large. At the 4MB data point, the rest of the CPUs are into their L3 caches. The FX is still in its L2 coverage area. The next step up in block size is 16MB, which is right at the outer edge of the FX’s effective total cache capacity, since its 8MB of L3 cache doesn’t replicate the contents of its 8MB of L2 cache. The FX-8150 again delivers the highest throughput at the 16MB block size, but not by much.

Some of the credit for the FX-8150’s strong showing here no doubt goes to its use of 1866MHz DIMMs. However, we’ve tried 1866MHz memory on the older CPU cores in Llano, and our Stream results topped out at around 15GB/s. Bulldozer’s smart data prefetchers and large L2 caches deserve credit for taking good advantage of the available memory bandwidth.

Measuring memory access latencies has gotten to be tricky with the advent of Turbo-style clock speed ramping, because latencies are reported in the number of CPU cycles. Nevertheless, we’ve chosen to report access latencies with the caveat that our guesses about likely frequencies for these CPUs may be incorrect.

If we’re right, the FX comes out looking pretty good, with access latencies comparable to competing Sandy Bridge parts, despite its larger caches. Again, the use of 1866MHz memory may be helping the FX here.

For what it’s worth, our tool reports Bulldozer’s L1 data cache latency at 3 cycles, L2 at 18 cycles, and L3 at 65 cycles.

Battlefield: Bad Company 2

After a promising start in our synthetic memory tests, we finally get our first look at the FX-8150’s real-world performance—and it’s a bit underwhelming. The FX-8150 is no faster than the Phenom II X4 980, and it’s slower than quite a few Intel processors, even somewhat older models.

Of course, a result like this one should come with a reminder: a frame rate of 75 FPS, with a low of 58 FPS, means the FX-8150 is easily competent to run this game smoothly—at least, as far as we can tell when measuring in frames per second. We have some promising new testing methods we may bring to bear on CPU performance soon, so stay tuned.

Civilization V

The developers of Civ V have cooked up a number of interesting benchmarks, two of which we used here. The first one tests a late-game scenario where the map is richly populated and there’s lots happening at once. As you can see by the setting screen below, we didn’t skimp on our the image quality settings for graphics, either. Doing so wasn’t necessary to tease out clear differences between the CPUs.

Civ V also runs the same test without updating the screen, so we can eliminate any overhead or bottlenecks introduced by the video card and its driver software. Removing those things from the equation reshuffles the order slightly.

In both cases, the FX-8150 slots in just ahead of the Phenom II X6 1100T and just behind an aging Intel CPU, the Core i5-760. All of the Sandy Bridge-based CPUs are faster, including the dual-core Core i3-2100.

The next test populates the screen with a large number of units and animates them all in parallel. It can also run in “no render” mode without updating the screen.

This test is clearly multithreaded—it’s much faster in “no render” mode on the Athlon II X3 455 than on the Phenom II X2 565, for instance, and the 12-threaded Core i7-900-series CPUs capture the top three spots. Still, the FX-8150 and its eight cores end up near the middle of the pack. When the screen is being rendered, a number of Phenom II X6 and X4 models are slightly faster than the FX-8150.

F1 2010

CodeMasters has done a nice job of building benchmarks into its recent games, and F1 2010 is no exception. We scripted up test runs at three different display resolutions, with some very high visual quality settings, to get a sense of how much difference a CPU might make in a real-world gaming scenario where GPU bottlenecks can come into play.

We also went to some lengths to fiddle with the game’s multithreaded CPU support in order to get it to make the most of each CPU type. That effort eventually involved grabbing a couple of updated config files posted on the CodeMasters forum, one from the developers and another from a user, to get an optimal threading map for the Phenom II X6. What you see below should be the best possible performance out of each processor.

The results at the two higher resolutions underscore a dynamic that AMD and Nvidia have been banging on about: if your GPU is the primary performance limiter, a CPU upgrade may not do you much good. That’s a noteworthy practical point, but we’ll still want to focus on the lower-resolution results in order to compare CPU performance.

At 1280×800, both the FX-8150 and our “pretend” FX-8120 are faster than any other processor AMD has fielded to date, and the FX-8150 isn’t far from hitting that GPU wall at around 65 FPS.

Metro 2033

Metro 2033 also offers a nicely scriptable benchmark, and we took advantage by testing at several different combinations of resolution and visual quality.

Again, the lower resolution tests lets us see the impact of CPU performance more clearly. The FX-8150 doesn’t look half bad in those tests, either, delivering slightly higher frame rates than the Phenom II X6 1100T and mixing it up with the Core i5-2400.

Source engine particle simulation

Next up is a test we picked up during a visit to Valve Software, the developers of the Half-Life games. They had been working to incorporate support for multi-core processors into their Source game engine, and they cooked up some benchmarks to demonstrate the benefits of multithreading.

This test runs a particle simulation inside of the Source engine. Most games today use particle systems to create effects like smoke, steam, and fire, but the realism and interactivity of those effects are limited by the available computing horsepower. Valve’s particle system distributes the load across multiple CPU cores.

Honestly, I expected FX processors to perform well in this test. This program is widely multithreaded, and Intel CPUs seem to benefit greatly from Hyper-Threading when running it. Seems a likely target for Bulldozer, no? However, the FX-8150 once more trails the Phenom II X6 1100T, which itself is well behind the FX-8150’s ostensible competitor, the Core i5-2500K.

Productivity

SunSpider JavaScript performance

Several of AMD’s older Phenom II processors outperform the FX processors in this test, as do over half of the CPUs we’ve tested. The FX-8150 is only slightly quicker than the lowly Athlon II X3 455.

7-Zip file compression and decompression

Here’s a nice bright spot where the FX-8150 runs with and even defeats the top Sandy Bridge, the Core i7-2600K. AMD’s older processors also perform fairly well in this test, interestingly enough.

TrueCrypt disk encryption

This full-disk encryption suite includes a performance test, for obvious reasons. We tested with a 500MB buffer size and, because the benchmark spits out a lot of data, averaged and summarized the results in a couple of different ways.

TrueCrypt has added support for Intel’s custom-tailored AES-NI instructions since we last visited it, so the encoding of the AES algorithm, in particular, should be very fast on the CPUs that support those instructions. Those CPUs include the six-core Gulftowns, the dual-core Clarkdales, Sandy Bridge, and of course Bulldozer.

Bulldozer’s support of AES-related instructions pays off nicely for it in TrueCrypt. At these encryption rates, few storage devices will be able to keep up.

Image processing

The Panorama Factory photo stitching

The Panorama Factory handles an increasingly popular image processing task: joining together multiple images to create a wide-aspect panorama. This task can require lots of memory and can be computationally intensive, so The Panorama Factory comes in a 64-bit version that’s widely multithreaded. I asked it to join four pictures, each eight megapixels, into a glorious panorama of the interior of Damage Labs.

In the past, we’ve added up the time taken by all of the different elements of the panorama creation wizard and reported that number, along with detailed results for each operation. However, doing so is incredibly data-input-intensive, and the process tends to be dominated by a single, long operation: the stitch. Thus, we’ve simply decided to report the stitch time, which saves us a lot of work and still gets at the heart of the matter.

Image stitching would seem to be a natural fit for Bulldozer’s eight integer cores and high memory throughput, and the FX processors do improve on the performance of the Phenom II X6. Still, the FX-8150 can’t catch the Core i5-2400, which is a $184 product. The FX-8150 costs $245.

picCOLOR image processing and analysis

picCOLOR was created by Dr. Reinert H. G. Müller of the FIBUS Institute. This isn’t Photoshop; picCOLOR’s image analysis capabilities can be used for scientific applications like particle flow analysis. Dr. Müller has supplied us with new revisions of his program for some time now, all the while optimizing picCOLOR for new advances in CPU technology, including SSE extensions, multiple cores, and Hyper-Threading. Many of its individual functions are multithreaded.

At our request, Dr. Müller graciously agreed to re-tool his picCOLOR benchmark to incorporate some real-world usage scenarios. As a result, we now have four tests that employ picCOLOR for image analysis: particle image velocimetry, real-time object tracking, a bar-code search, and label recognition and rotation.

The FX-8150 delivers measurable gains over the Phenom II X6 1100T in picCOLOR’s real-world tests, but it’s still not as quick as the Core i5-2500K.

picCOLOR also includes some synthetic tests of common image processing functions, and the FX-8150 proves to be substantially faster than older AMD CPUs in these tests. Again, though, the slowest Sandy Bridge quad-core is faster.

Video encoding

x264 HD benchmark

This benchmark tests one of the most popular H.264 video encoders, the open-source x264. The results come in two parts, for the two passes the encoder makes through the video file. I’ve chosen to report them separately, since that’s typically how the results are reported in the public database of results for this benchmark.

Another ray of light for Bulldozer here, in the more multithreaded second pass of the x264 encoding process. The FX-8150 matches the pricier Core i7-2600K—and clearly outruns the i5-2500K. Too bad about that first pass, which is also part of the overall picture.

Windows Live Movie Maker 14 video encoding

For this test, we used Windows Live Movie Maker to transcode a 30-minute TV show, recorded in 720p .wtv format on my Windows 7 Media Center system, into a 320×240 WMV-format video format appropriate for mobile devices.

Live Movie Maker combines both passes into a single encoding time metric, and it places the FX-8150 just a few seconds ahead of the Phenom II X4 980.

3D modeling and rendering

Cinebench rendering

The Cinebench benchmark is based on Maxon’s Cinema 4D rendering engine. It’s multithreaded and comes with a 64-bit executable. This test runs with just a single thread and then with as many threads as CPU cores (or threads, in CPUs with multiple hardware threads per core) are available.

Here’s another one of those tests where the FX-8150 improves on the Phenom II X6, but then the X6’s performance was fairly strong compared to the Core i5-2500K already. The most interesting result, in my view, is the FX-8150’s fairly low score of 1.03 with a single thread, a little less than the score of the Phenom II X6 1100T, which runs at a lower clock frequency. Bulldozer hasn’t entirely held the line on instructions per clock. Also, the Core i5-2500K is nearly 50% faster than the FX-8150 with a single thread. Bulldozer simply makes up the difference with more threads—at least, in this case it does.

POV-Ray rendering

We’re using the latest beta version of POV-Ray 3.7 that includes native multithreading and 64-bit support.

Another solid showing from the FX-8150 in the “chess2” test, which is more widely multithreaded than POV-Ray’s benchmark scene.

Valve VRAD map compilation

This next test processes a map from Half-Life 2 using Valve’s VRAD lighting tool. Valve uses VRAD to pre-compute lighting that goes into games like Half-Life 2.

The FX slips a little here, falling behind the Phenom II X6 yet again.

Scientific computing

MyriMatch proteomics

Our benchmarks sometimes come from unexpected places, and such is the case with this one. David Tabb is a friend of mine from high school and a long-time TR reader. He has provided us with an intriguing new benchmark based on an application he’s developed for use in his research work. The application is called MyriMatch, and it’s intended for use in proteomics, or the large-scale study of protein. I’ll stop right here and let him explain what MyriMatch does:

In shotgun proteomics, researchers digest complex mixtures of proteins into peptides, separate them by liquid chromatography, and analyze them by tandem mass spectrometers. This creates data sets containing tens of thousands of spectra that can be identified to peptide sequences drawn from the known genomes for most lab organisms. The first software for this purpose was Sequest, created by John Yates and Jimmy Eng at the University of Washington. Recently, David Tabb and Matthew Chambers at Vanderbilt University developed MyriMatch, an algorithm that can exploit multiple cores and multiple computers for this matching. Source code and binaries of MyriMatch are publicly available.

In this test, 5555 tandem mass spectra from a Thermo LTQ mass spectrometer are identified to peptides generated from the 6714 proteins of S. cerevisiae (baker’s yeast). The data set was provided by Andy Link at Vanderbilt University. The FASTA protein sequence database was provided by the Saccharomyces Genome Database. MyriMatch uses threading to accelerate the handling of protein sequences. The database (read into memory) is separated into a number of jobs, typically the number of threads multiplied by 10. If four threads are used in the above database, for example, each job consists of 168 protein sequences (1/40th of the database). When a thread finishes handling all proteins in the current job, it accepts another job from the queue. This technique is intended to minimize synchronization overhead between threads and minimize CPU idle time.

The most important news for us is that MyriMatch is a widely multithreaded real-world application that we can use with a relevant data set. I should mention that performance scaling in MyriMatch tends to be limited by several factors, including memory bandwidth, as David explains:

Inefficiencies in scaling occur from a variety of sources. First, each thread is comparing to a common collection of tandem mass spectra in memory. Although most peptides will be compared to different spectra within the collection, sometimes multiple threads attempt to compare to the same spectra simultaneously, necessitating a mutex mechanism for each spectrum. Second, the number of spectra in memory far exceeds the capacity of processor caches, and so the memory controller gets a fair workout during execution.

Here’s how the processors performed.

This is another case where I definitely expected the FX-8150 to improve on the performance of the Phenom II X6. Alas, it wasn’t to be.

STARS Euler3d computational fluid dynamics

Charles O’Neill works in the Computational Aeroservoelasticity Laboratory at Oklahoma State University, and he contacted us to suggest we try the computational fluid dynamics (CFD) benchmark based on the STARS Euler3D structural analysis routines developed at CASELab. This benchmark has been available to the public for some time in single-threaded form, but Charles was kind enough to put together a multithreaded version of the benchmark for us with a larger data set. He has also put a web page online with a downloadable version of the multithreaded benchmark, a description, and some results here.

In this test, the application is basically doing analysis of airflow over an aircraft wing. I will step out of the way and let Charles explain the rest:

The benchmark testcase is the AGARD 445.6 aeroelastic test wing. The wing uses a NACA 65A004 airfoil section and has a panel aspect ratio of 1.65, taper ratio of 0.66, and a quarter-chord sweep angle of 45º. This AGARD wing was tested at the NASA Langley Research Center in the 16-foot Transonic Dynamics Tunnel and is a standard aeroelastic test case used for validation of unsteady, compressible CFD codes.

The CFD grid contains 1.23 million tetrahedral elements and 223 thousand nodes . . . . The benchmark executable advances the Mach 0.50 AGARD flow solution. A benchmark score is reported as a CFD cycle frequency in Hertz.

So the higher the score, the faster the computer. Charles tells me these CFD solvers are very floating-point intensive, but they’re oftentimes limited primarily by memory bandwidth. He has modified the benchmark for us in order to enable control over the number of threads used. Here’s how our contenders handled the test with optimal thread counts for each processor.

The FX-8150’s gains over the Phenom II X6 here could easily be explained by the move to faster memory alone. Unfortunately for AMD, there’s a whole range of Intel CPUs from various generations that outperform the FX, even with slower DRAM.

Power consumption and efficiency

We used a Yokogawa WT210 digital power meter to capture power use over a span of time. The meter reads power use at the wall socket, so it incorporates power use from the entire system—the CPU, motherboard, memory, graphics solution, hard drives, and anything else plugged into the power supply unit. (The monitor was plugged into a separate outlet.) We measured how each of our test systems used power across a set time period, during which time we ran Cinebench’s multithreaded rendering test.

Please note that, although we tested a range of AMD processors, only the FX-8150 and the Phenom II X6 (the results marked “990FX”) were tested on the same motherboard. The others were tested on an 890GX-based board from Gigabyte whose power consumption characteristics differ. Oh, and we tested the FX-8150 with four DIMMs here, since that’s the config all of the other dual-channel systems shared, and it only seemed fair to match the DIMM count for power testing. Fortunately, the move to lower memory clocks didn’t impact rendering completion times.

We’ll start with the show-your-work stuff, plots of the raw power consumption readings.

We can slice up these raw data in various ways in order to better understand them. We’ll start with a look at idle power, taken from the trailing edge of our test period, after all CPUs have completed the render. Next, we can look at peak power draw by taking an average from the ten-second span from 15 to 25 seconds into our test period, when the processors were rendering.

At idle, the 990FX-based sytem with a Phenom II X6 processor draws 94W at the wall socket. Simply as a result of dropping in an FX-8150 instead, idle power consumption plummets to 76W. That’s the impact of Bulldozer’s ability to gate off power to its idle cores and north bridge. Also, notice that the Phenom II X6 draws even less power at idle on the 890GX board: 82W. Our 990FX may be a bit of a power hog at idle. We’ll have to swap the FX-8150 into a different Socket AM3+ board soon and see if we can’t get it down into the mid-60-watt range, like the Sandy Bridge systems. Seems possible.

At peak, the FX-8150 system draws about 22W more than the same system equipped with a Phenom II X6 1100T. That’s not an entirely shocking result, even though the chips have the same TDP rating. Bulldozer’s improved Turbo Core is more effective at wringing every last ounce out of a given thermal envelope, and that translates into higher measured power draw at peak. With a more modest 95W thermal envelope, the Sandy Bridge-based competition delivers its superior performance at much lower total system power levels.

We can highlight power efficiency by looking at total energy use over our time span. This method takes into account power use both during the render and during the idle time. We can express the result in terms of watt-seconds, also known as joules. (In this case, to keep things manageable, we’re using kilojoules.) Note that since we had to expand the duration of the test periods for the Pentium EE 840 and Core 2 Duo E6400, we’re including data from a longer period of time for those two.

We can pinpoint efficiency more effectively by considering the amount of energy used for the task alone. Since the different systems completed the render at different speeds, we’ve isolated the render period for each system. We’ve then computed the amount of energy used by each system to render the scene. This method should account for both power use and, to some degree, performance, because shorter render times may lead to less energy consumption.

Even though it ostensibly benefits from finer process technology, the FX-8150 actually uses slightly more power than the Phenom II X6 1100T does in rendering this scene. That’s not the sort of progress we had hoped to see.

AVX performance

None of the benchmarks you’ve seen on the preceding pages make of use AVX instructions, with the exception of the AES subset used in TrueCrypt. At this point in time, finding applications or benchmarks that make use of AVX isn’t easy. Fortunately, I was able to find several, and at least one program looks to be reasonably well optimized for Bulldozer: AIDA64 from FinalWire. AIDA64 includes several small, synthetic tests that can be accelerated with Bulldozer’s new instructions.

In order to measure the impact of those new instructions on performance, I tested both the FX-8150 and the Core i5-2500K in two configurations: with and without Windows 7 SP1 installed. Service Pack 1 is required for AVX support, so testing without it has the impact of disabling AVX. The results marked “No AVX” below are those without SP1.

The first test, CPU Hash, uses AMD’s XOP instructions on Bulldozer. The next two, FPU Julia and FPU Mandel, make use of FMA4. Tamas Miklos of FinalWire, maker of AIDA64, tells us these benchmarks were developed using pre-release Bulldozer systems. He further says:

Our code is 100% optimized for Bulldozer. We don’t see much room for improvement. We’ve had a chance to talk to AMD, and we’ve explained (in details) how our benchmarks work, and what tricks we use on Bulldozer. They seemed to be content about what we do and how we do on Bulldozer, and they didn’t tell us any hints on possible improvements.

So we should have a resonable opportunity to see Bulldozer’s full potential with the latest instructions.

Here’s another one of those occasional instances where the Phenom II X6 was already faster than Sandy Bridge, and again, the FX-8150 is a little faster still. Looks like there’s a roughly 10% gain with AVX/XOP instructions in use.

The two tests above make use of FMA4, and these really aren’t the sort of results we were anticipating. In both cases, the Phenom II X6 110T is faster than the FX-8150 with AVX and FMA4 enabled. Hrmph.

We do have another round of AVX tests, from the latest version of SiSoft’s Sandra. We don’t have any word on how well these tests are optimized for Bulldozer or whether they use XOP and FMA4. They do appear to make use of AVX on the FX-8150, though.

Although these quick benchmarks are labeled “Multimedia” in Sandra, in truth they’re simply fractal computations like the AIDA64 Julia and Mandel tests.

Well, Bulldozer looks great in the integer test, but the FPU results don’t look much different that they did in AIDA64. At least the FX-8150 is faster than the Phenom II X6, I guess.

We’re disheartened by these results, but AVX is still early in its life, so we’re hesitant to draw any definitive lessons from them. AMD did supply us with custom-built versions of the x264 video encoder that use XOP and FMA4 late in the review process. We’ll have to try those out soon.

Overclocking

Given that this is a “speed demon” architecture and that AMD managed a Guinness world record frequency of 8.4GHz using an FX-8150 cooled with liquid helium, we were really looking forward to doing a little overclocking with our copy of the FX-8150. We didn’t have any dangerous liquids in play, but we did have a fairly beefy tower cooler, at least. We set our mobo to use its most aggressive fan speed profiles and fired up AMD’s Overdrive software, which makes dialing up new speeds on an unlocked CPU like this one a snap, to see what we could do.

Our starting point was the stock operation of the chip. Our FX-8150 runs at 3.6GHz and 1.2625V by default. When Turbo Core kicks in, the CPU ranges up to 1.4V and 4.2GHz. We figured we’d begin at just 200MHz beyond that top Turbo speed, 4.4GHz, at 1.4V. Seems like an easy first step, right? When we fired off Overdrive’s CPU stability test, however, it quickly came back with an error. We had to raise the voltage to 1.425V in order to get the chip to pass just three minutes in that stability test. The rest of our overclocking work log looked like so:

4.6GHz, 1.425V – BSOD 4.6GHz, 1.4375V – Error 4.6GHz, 1.45V – AOD crash 4.6GHz, 1.4625V – AOD crash 4.6GHz, 1.4625V – AOD crash 4.5GHz, 1.4375V – AOD crash 4.5GHz, 1.45V – Error 4.5GHz, 1.4625V – AOD crash

Yep, 4.4GHz was about it. Perhaps we were a little timid, but raising the voltage beyond 1.465V on a brand-new, pre-release 32-nm processor felt like asking for trouble to us, especially with just air cooling. Then again, the crashes and errors we were seeing came quickly, well before the chip had a chance to heat up beyond the capacity of our cooler. We’d almost surely run up against a frequency limitation in the chip—and that’s unusual. CPU these days tend to be primarily thermally constrained, and you can usually push them a fair ways past their default frequencies before running into stability issues.

Worried that we weren’t reaching our chip’s full potential, we pinged AMD PR on the matter, who pointed us to a section in the reviewer’s guide (a document we shamlessly ignore after extracting any useful info) that suggests 4.5GHz is a reasonable expectation for FX-8150 overclocking with an air cooler. We also discovered, at the same time, that AMD had disabled three of the chip’s four modules during the Guinness World Record run. That’s not something we’d expect, you know, real users who care about performance to do.

The FX-8150’s performance does appear to scale well with clock frequency increases. Had AMD been able to ship a Bulldozer-based part at close to 4.4GHz, the story told on the preceding pages might have had a different feel to it.