AMD’s Trinity chip is making a debut, but it’s not exactly a fresh face. We reviewed the mobile version of Trinity back in May and had mostly positive things to say about it. The second generation of AMD’s do-everything, converged APU offered solid progress over the first-generation “Llano” chip on many fronts. Not too long after Trinity’s mobile release, desktop versions of it started shipping exclusively in systems from large PC makers. Those wishing to build their own systems based on the chip, or to buy them from smaller PC vendors, had to wait. AMD took its time ushering this chip into broader sales channels, but the time is finally upon us. Trinity is now available as a retail product, as are motherboards based on the new Socket FM2 platform.

Trinitarian doctrine

Since Trinity is a known quantity, we won’t recount its architecture in great detail. You can read our review of the mobile version for that info. The basics are fairly straightforward, though. Trinity is, in many ways, a direct answer to Intel’s Ivy Bridge processors. The two CPUs incorporate many of the same functions, including things like PCI Express connectivity and graphics that were formerly delegated to support chips. The name of the game is integration, because integration saves power, reduces costs, and shrinks the footprint of a system. The latest PC processors are beginning to look very similar to the system-on-a-chip (or “SoC”) products that power smart phones and tablets.

Although Trinity is built on the same 32-nm SOI process technology as its predecessor, Llano, it offers architectural upgrades all around. The four older Phenom-era CPU cores have been replaced by dual “Piledriver” modules. Each module has two integer cores, a beefy-but-shared floating-point unit, and 2MB of shared L2 cache. Piledriver is the code name for an updated version of AMD’s still-new Bulldozer microarchitecture, with improvements to its per-clock instruction throughput and voltage-frequency response. Trinity brings Piledriver to the desktop for the first time.

AMD has refreshed the Radeon graphics on this chip, as well, moving from the older VLIW5-style shader core used in the Radeon HD 5000 series to the VLIW4 shaders from the Radeon HD 6900 series. This isn’t the latest GCN architecture from today’s Radeons, but it supports a full DirectX 11 feature set and should be an incremental improvement in efficiency.

The companion video acceleration block, however, is ripped right out of the Radeon HD 7000 series, and there have been updates to the display and memory controllers, as well. In fact, the only item of note that isn’t really up to date is the PCIe connectivity, which remains at Gen2. Third-gen PCIe offers twice the data rate.

One thing that doesn’t fit easily into the diagram above is the more refined integration of these different pieces. Trinity has considerably fewer visible seams compared to Llano. Among the major improvements is power management. Trinity can dial back the clock speed and voltage of its graphics component in response to CPU-heavy workloads. Llano could rein in the CPU when graphics-heavy workloads required it, but not vice-versa.

Plenty o’ flavors

Model Modules/ Integer cores Base core clock speed Max Turbo clock speed Total L2 cache capacity IGP ALUs IGP clock TDP Price A10-5800K 2/4 3.8 GHz 4.2 GHz 4 MB 384 800 MHz 100 W $122 A10-5700K 2/4 3.4 GHz 4.0 GHz 4 MB 384 760 MHz 65 W $122 A8-5600K 2/4 3.6 GHz 3.9 GHz 4 MB 256 760 MHz 100 W $101 A6-5500 2/4 3.2 GHz 3.7 GHz 4 MB 256 760 MHz 65 W $101 A6-5400K 1/2 3.6 GHz 3.8 GHz 1 MB 192 760 MHz 65 W $67 A4-5300 1/2 3.4 GHz 3.6 GHz 1 MB 128 724 MHz 65 W $53 Athlon X4 750K 2/4 3.4 GHz 4.0 GHz 4 MB – – 100 W $81 Athlon X4 740 2/4 3.2 GHz 3.7 GHz 4 MB – – 65 W $71

The table above shows the full lineup of Trinity desktop processors. Note the 65W and 100W power envelopes, just the same as the prior-gen Llano products. With the move to 22-nm process tech, Intel reduced its desktop power envelopes; even the most expensive Core i7 has a peak power rating of just 77W. AMD has supplied us with two chips to review: the A8-5600K and the A10-5800K. Both are K-series parts with unlocked multipliers for easy overclocking, but they unfortunately both have 100W TDP limits. We suspect the 65W versions may be more appealing to many folks.

Trinity APU pricing doesn’t rise above the $122 mark. AMD has kept the price tags modest, a tacit acknowledgement of the performance picture. By contrast, the Ivy-based Core i7-3770K sells for $332, well over twice the price of the A10-5800K.

For what it’s worth, we’ve neglected to list the complex suite of Radeon model numbers attached to the integrated graphics. The A10 series, for instance, has Radeon HD 7660D graphics, and the A8 series has Radeon HD 7560D graphics. What you need to know, really, are the ALU counts and clock speeds shown above. Oh, and it’s worth mentioning that most of these APUs support AMD’s Dual Graphics feature. That is, they can pair a low-end Radeon graphics card with the IGP in a CrossFire-style multi-GPU config. That’s not our favorite option given the added complexity, the asymmetry between GPUs, and the potential for multi-GPU micro-stuttering—but it is an option for those who want it.

A new platform: Socket FM2

The changes to Trinity are sweeping enough that they require a new CPU socket. Thus, Llano’s Socket FM1 gives way to the new Socket FM2.

Physically, Socket FM2 looks very similar to multiple generations of desktop sockets from AMD, but the pin layout is different to prevent the insertion of an incompatible processor by all but the most determined.

The basic platform layout is depicted in the diagram to the right. Trinity requires only a single support chip for I/O, but AMD offers several variants of that product. The entry level version is the A55, which has enough features for a basic PC. The A75 enables a few extras, including USB 3.0 support, six SATA 6Gbps ports, and some overclocking features. Top o’ the line is the A85X, with eight SATA 6Gbps ports, even more overclocking options, and support for dual discrete GPUs in CrossFire configurations. Having a trio of chipsets for a CPU lineup that spans the rather limited gamut from $53 to $122 seems like overkill to us. AMD must have been planning for better days.

Perhaps those days will come eventually. AMD expects Socket FM2 to stick around for a while, at least long enough to support the generation of APUs after Trinity. Presumably, that means the APU code-named “Kaveri,” which should have 2-4 Steamroller cores and Radeon graphics based on the current GCN architecture.

Motherboard makers have introduced a robust slate of Socket FM2-compatible offerings to play host to Trinity, including MSI’s snappily named FM2-A85XA-G65 mobo pictured above, which served in our testbed. This is a relatively high-end board, with dual PCIe slots for CrossFire and a gaggle of SATA ports. Around back, it serves up four display outputs, with everything from VGA to DVI, HDMI, and DisplayPort.

The competition

Pictured above is the Core i3-3225, a dual-core, quad-threaded Ivy Bridge chip clocked at 3.3GHz with a 3MB L3 cache. The “5” at the end of the model number means something important, believe it or not: this chip has Intel’s full-fat HD 4000 graphics implementation, not a cut-down variant. The list price for the i3-3225 is $134, making it arguably the A10-5800K’s closest competitor. As a low-end part, the i3-3225 is missing certain amenities like Turbo Boost and, somewhat freakishly, support for the AES-NI instructions that accelerate encryption. (Intel’s product segmentation is way, way too complicated.)

The one place where this Core i3 and the A10-5800K diverge most obviously is on power: the Core i3-3225 has a TDP, or max power rating, of just 55W. The 5800K’s power envelope is nearly twice the size at 100W, which gives it more headroom to push on both CPU and graphics performance.

We also have some competition lined up for the A8-5600K in the form of the Pentium G2120. The G2120 lists for only $86, so it’s a bit cheaper than the A6-5600K, but we think it’s the closest competitor in Intel’s lineup. The Pentium G2120 is also a seriously gimpy chip. Although it’s based on 22-nm Ivy Bridge silicon and has two cores running at 3.1GHz, the G2120 lacks support for a whole lexicon of marketing names and acronyms, including AVX, Turbo, Hyper-Threading, AES-NI, HD 4000 graphics, and QuickSync. Sometimes, it simply refuses to do math, until you ask again nicely. Even so, the G2120 fits into the same 55W power envelope as the Core i3-3225, so it has a huge handicap versus the 100W A8-5600K.

We’ll see how the new Trinity-based APUs compare to these chips and a huge host of others on the following pages, in what has to be the most data-rich review we’ve ever produced. Apologies in advance for the overload.

Our testing methods

We ran every test at least three times and reported the median of the scores produced.

The test systems were configured like so:

Processor Phenom II X4 850 Phenom II X4 980 Phenom II X6 1100T AMD FX-4170 AMD FX-6200 AMD

FX-8150

Pentium

G2120 Core i3-3225 Core

i5-2400 Core i5-2500K Core

i7-2600K Core i5-3470 Core i5-3570K Core i7-3770K Core

i7-3960X Core i7-3820 Motherboard Asus

Crosshair V Formula MSI

Z77A-GD65 Intel

DX79SI North bridge 990FX Z77

Express X79

Express South bridge SB950 Memory size 8 GB (2 DIMMs) 8 GB (2 DIMMs) 16 GB

(4 DIMMs) Memory type AMD

Entertainment Edition DDR3 SDRAM Corsair Vengeance DDR3 SDRAM Corsair Vengeance DDR3 SDRAM Memory speed 1600 MT/s 1600 MT/s 1600 MT/s Memory timings 9-9-9-24

1T 9-9-9-24

1T 9-9-9-24

1T Chipset drivers AMD

chipset 12.3 INF

update 9.3.0.1020 iRST 11.1.0.1006 INF

update 9.2.3.1022 RSTe 3.0.0.3020 Audio Integrated SB950/ALC889 with Realtek 6.0.1.6602 drivers Integrated Z77/ALC898 with Realtek 6.0.1.6602 drivers

Integrated X79/ALC892 with Realtek 6.0.1.6602 drivers IGP drivers – 15.26.12.64.2761 –

Processor AMD

A8-3850 AMD

A8-5600K AMD A10-5800K Core

i5-655K Core i5-760 Core i7-875K Motherboard Gigabyte

A75M-UD2H MSI

FM2-A85XA-G65 Asus P7P55D-E Pro North bridge A75

FCH A85

FCH P55

PCH South bridge Memory size 8 GB

(2 DIMMs) 8 GB

(2 DIMMs) 8 GB

(2 DIMMs) Memory type Corsair Vengeance DDR3 SDRAM AMD

Entertainment Edition DDR3 SDRAM Corsair Vengeance DDR3 SDRAM Memory speed 1600 MT/s 1600 MT/s 1333 MT/s Memory timings 9-9-9-24

1T 9-9-9-24

1T 8-8-8-20 1T Chipset drivers AMD

chipset 12.3 AMD

chipset 12.8 INF

update 9.3.0.1020 iRST 11.1.0.1006 Audio Integrated A75/ALC889 with Realtek 6.0.1.6602 drivers Integrated A75/ALC889 with Realtek 6.0.1.6602 drivers Integrated P55/VIA VT1828S with Microsoft drivers IGP drivers Catalyst

12.8 Catalyst

12.8 –

They all shared the following common elements:

Hard drive Kingston

HyperX SH100S3B 120GB SSD Discrete graphics XFX

Radeon HD 7950 Double Dissipation 3GB with Catalyst 12.3 drivers OS Windows 7 Ultimate x64 Edition

Service Pack 1 (AMD systems only: KB2646060, KB2645594 hotfixes) Power supply Corsair

AX650

Thanks to Corsair, XFX, Kingston, MSI, Asus, Gigabyte, Intel, and AMD for helping to outfit our test rigs with some of the finest hardware available. Thanks to Intel and AMD for providing the processors, as well, of course.

We used the following versions of our test applications:

Some further notes on our testing methods:

The test systems’ Windows desktops were set at 1920×1080 in 32-bit color. Vertical refresh sync (vsync) was disabled in the graphics driver control panel.

We used a Yokogawa WT210 digital power meter to capture power use over a span of time. The meter reads power use at the wall socket, so it incorporates power use from the entire system—the CPU, motherboard, memory, graphics solution, hard drives, and anything else plugged into the power supply unit. (The monitor was plugged into a separate outlet.) We measured how each of our test systems used power across a set time period, during which time we encoded a video with x264.

After consulting with our readers, we’ve decided to enable Windows’ “Balanced” power profile for the bulk of our desktop processor tests, which means power-saving features like SpeedStep and Cool’n’Quiet are operating. (In the past, we only enabled these features for power consumption testing.) Our spot checks demonstrated to us that, typically, there’s no performance penalty for enabling these features on today’s CPUs. If there is a real-world penalty to enabling these features, well, we think that’s worthy of inclusion in our measurements, since the vast majority of desktop processors these days will spend their lives with these features enabled. We did disable these power management features to measure cache latencies, but otherwise, it was unnecessary to do so.

The tests and methods we employ are usually publicly available and reproducible. If you have questions about our methods, hit our forums to talk with us about them.

IGP performance – Skyrim

We’ll start by looking at integrated graphics performance. For these tests, we decided to focus on the higher-end A10 and Core i3 processors, since they have the faster IGPs and are more likely to be compelling offerings. We’ve also taken a look at the impact of memory speed on IGP performance, since memory bandwidth can be a pretty notable constraint. Our default test config used 1600MHz memory, and we also tested the A10 and Core i3 with 1866MHz memory at pretty tight timings: 9-10-9-27 1T. We’d hoped to test even higher memory frequencies, but neither platform took well to 2133MHz memory, even with relatively conservative timings and extra voltage.

We tested performance while taking a stroll around the town of Whiterun in Skyrim. You can see the image quality settings we used above, which are about as spartan as possible in Skyrim.





Frame time

in milliseconds FPS

rate 8.3 120 16.7 60 20 50 25 40 33.3 30 50 20

Our gaming tests are very different from what you’re likely to see elsewhere. We’ve captured the time required to render every single frame from each of our five test runs, because we believe FPS averages tend to mask the short slowdowns that can break the sense of fluid animation. For more information on how we test and why, please see this article.

You can click the buttons beneath the plots above to see results for the different types of processors. Since we’re plotting frame times, lower numbers are better, and big spikes upward are bad—they represent delays in frame delivery. If you’re new to the idea latency-focused game testing, the table to the right may help. It shows frame times and how they correspond to FPS rates. Just a look at the raw plots above will tell you much of what you need to know about how these CPUs perform. The Core i3-3225 produces fewer frames at generally higher latencies than the A10, and its frame time spikes tend to be more dramatic.

Although FPS averages can be deceiving, in this case, the relatively high average numbers tend to be backed up by our alternative method, the 99th percentile frame time. (This metric just says that 99% of all frames were rendered in x milliseconds or less.) The overall latency picture for all of the IGPs isn’t bad. Except for the last 1% of frames, all of these solutions produce a constant flow of updates at a rate of over 30 FPS. Skyrim doesn’t look pretty at these settings, but it will run smoothly enough on any of these IGPs.

The A10 is measurably faster than the Core i3-3225, and you can feel the difference while playing. The difference between the A10 and its Llano predecessor, the A8-3850, is much subtler, only a couple of milliseconds in the 99th percentile metric. Even slighter is the impact of faster memory on the A10.

The 99th percentile frame time is just one point along a curve, and we can have a look at the broader curve to give us a better sense of the overall latency picture. As you can see, the A10 produces much lower frame latencies generally than the Core i3.

The 99th percentile frame time attempts to capture a sense of the overall latency picture while ruling out the outliers. We can also focus on the worst-case frame times, which makes sense, since we want to avoid those hiccups and pauses while playing. Our method of quantifying “badness” is adding up all of the time spent working on frames beyond a given threshold—usually, we set the mark at 50 milliseconds, which equates to 20 FPS. We figure if frame rates drop below about that mark, the illusion of motion is at risk. Also, 50 milliseconds is equal to three vertical refresh intervals on a 60Hz display. If you’re waiting longer than that for the next frame, there’s likely some pain there.

As you might expect given the other numbers above, most of these solutions don’t spend much time beyond our threshold. They really can run Skyrim pretty well at these (kinda lousy) image quality settings. Interestingly enough, the Core i3 benefits quite a bit from the move to 1866MHz memory; its time spent beyond our threshold drops to zero from nearly a third of a second before.

IGP performance – Batman: Arkham City

We tested Arkham City while grappling and gliding across the rooftops of Gotham in a bit of Bat-parkour. We’re moving rapidly through a big swath of the city, so the game engine has to stream in more detail periodically. You can see the impact in the frame time plots: every CPU shows occasional spikes throughout the test run.

Again, we’ve had to reduce image quality settings to their lowest possible level in order to accommodate these relatively pokey integrated graphics processors.





Even with all of the frame time spikes, the numbers above look reasonably good for the most part. The FPS average and 99th percentile frame times pretty much mirror each other, which is usually a sign of health, and the latency curves are all similar in shape, with no big spikes upward until we reach the last few percentage points worth of frames.

All of the numbers point to the same thing, too, which is a clear playability advantage for the A10-5800K over the Core i3-3225.

IGP performance – Battlefield 3





Uh oh. Those plots for the Core i3 configs look ugly and prickly. Let’s see what it means.

Looking at the FPS average, you might think the Core i3-3225 isn’t far behind the Llano-based A8-3850, but the 99th percentile frame time tells a different story.

A look at the latency curve illustrates the problem. The Core i3 has particular trouble with the last 10-12% of frames rendered, where latencies shoot up dramatically.

Given the shape of the latency curve, this result isn’t surprising. The Trinity-based A10 and the Llano-based A8 waste very little time working on frames beyond our 50-ms threshold, but the Core i3 spends just over—or just under, with the faster memory config—one second of our 60-second test run working on long-latency frames. That 32 FPS average might tempt you to think the Core i3 is reasonably competent, but in this case, it isn’t.

IGP performance – Crysis 2

But will it run Crysis? We fired up Crysis 2 on a lark to see if it could run on any of these IGPs. As one of the most graphically intensive games around, we really didn’t expect much. Turns out that it did indeed run, even on the Intel IGP. Credit Intel for getting a Crysis game to run on its IGP, even if it isn’t terribly fast. There was a day not long ago when running a game like this on an Intel graphics solution was a sure recipe for failure.





Hmm. The FPS average and 99th percentile results don’t match at all. What’s the story? Well, it’s pretty easy to see how the AMD results are riddled with spikes throughout, even though the plots show a relatively decent core of low-latency frames. That core translates into a healthy-looking FPS average, but not all is well.

The curves tell the story. The AMD IGPs struggle with about 4-5% of the frames in the scene—and we know from the plots those problem frames are interspersed throughout the test session. As a result, the A10-5800K’s curve meets the Core i3-3225’s at around the 98th or 99th percentile, even though the A10 is faster otherwise.

Playing Crysis 2 on any of these IGPs kind of stinks, though in different ways. All of the IGPs burn quite a bit of time beyond our threshold.

Interestingly enough, the two least “bad” configs here are the IGPs paired with 1866MHz memory. That illustrates how important a bottleneck memory bandwidth is for integrated graphics. This constraint is likely to be more of a problem going forward, as transistor budgets for integrated graphics grow, especially if mainstream systems stick with the same dual-channel DDR3 memory standard.

IGP performance – Civilization V

We have one more gaming test to include before moving on to bigger and better things. This test is a simple scripted one that spits out an FPS average, because there are only so many hours in the day for testing.

Yikes. We’re running Civ V at just about the lowest possible image quality settings, and although it doesn’t crash, it’s pretty much hopeless on the Intel HD 4000 IGP. The A10-5800K handles it reasonably well, it would seem, with an average of 43 FPS.

Converged applications: LuxMark

One of AMD’s goals for APUs going forward is to use the parallel computing power of the integrated graphics processor to assist the CPU cores where possible. Although GPU computing has taken off in specialized sectors like scientific computing and HPC, we are still in the early days of GPU computing for consumer applications. AMD has been making strides in persuading developers to use OpenCL to accelerate certain classes of applications, though, and it has supplied reviewers with a handful of programs to demonstrate the potential there.

These “accelerated” programs fall into several groups. Some of them are just video transcoders that make use of the dedicated encoding hardware built into new CPUs, features like Intel’s QuickSync and AMD’s HD Media Accelerator. We’ve recently taken a look at the hardware video encoding options on the PC, so you can read about them if you wish. However, the more interesting programs in our book don’t just use dedicated custom logic; they employ real GPU computing, likely through the OpenCL API, to handle tasks previously reserved for the CPU cores.

We tried out accelerated versions of The GIMP image processor and WinZip compression in our review of Trinity’s mobile variant, but the program we find most interesting to date is LuxMark, which uses OpenCL to tackle ray-traced rendering. Ray-tracing is a classic “embarrassingly parallel” application, so it’s a good test case to demonstrate the potential of data-parallel compute hardware. Also, we’ve already incorporated LuxMark into our wider CPU suite, which includes a huge selection of chips, so we have ample context for the performance numbers it spits out.

LuxMark should do a nice job of harnessing the capabilities of new CPUs. Since OpenCL code is by nature parallelized and relies on a real-time compiler, it adapts easily to new instructions. For instance, Intel and AMD offer integrated client drivers for OpenCL on x86 processors, and they both claim to support AVX. The AMD APP driver even supports Bulldozer’s distinctive instructions, FMA4 and XOP.

We’ll start with CPU-only results from a broad swath of processors. These results come from the AMD APP driver for OpenCL, since it tends to be faster on both Intel and AMD CPUs, funnily enough.

Using their CPU cores alone, the new Trinity APUs are only a smidgen faster than the chip they replace, the Llano-based A8-3850. Why? One reason is that the two “Piledriver” modules in Trinity have only one shared FPU each. Each of Llano’s four cores has its own dedicated FPU, so although Trinity benefits from the extra-wide vector math enabled by its support for AVX instructions, it’s not much faster than Llano.

Intel’s Core i3-3225 is only a dual-core processor, but it has two FPUs and can track and execute four threads via Hyper-Threading, so the architectural similarities to Trinity are closer than you might think. The Core i3’s FPUs support AVX, as well, and they achieve higher throughput than Trinity’s, even though they don’t use the fused multiply-add instruction. (FMA support is slated for Intel’s next-gen Haswell chip.)

Without AVX or Hyper-Threading, the Pentium G2120 finishes dead last, well behind the A8-5600K.

Moving the workload over to the IGPs uniformly produces lower performance than the same processors achieve with only their CPU cores. The IGP in AMD’s Trinity is substantially faster than Intel’s HD 4000 graphics, but neither CPU’s IGP can match its x86 cores.

If we invoke both the CPU cores and the IGPs at the same time, we see higher overall performance than with just one type of computing unit engaged—and the A10’s combined throughput is ever so slightly higher than the Core i3-3225’s. There’s a hint of potential here; combined performance is roughly equal to the AMD FX-6200’s, a chip with three Bulldozer modules.

To give you a better sense of the prospects for mixed-mode computing, let’s have a look at a much more capable GPU, the Radeon HD 7950, when driven by the various processors we’ve tested.

Now that’s more like it. Moving some workloads over to a fast enough GPU can really pay off. The Radeon HD 7950 achieves more than twice the throughput of the Core i7-3770K’s quad CPU cores, regardless of which processor is driving it. (The 7950 is somewhat faster when combined with Intel processors, likely because of their higher single-threaded performance.)

Of course, this GPU has its own fast, dedicated memory subsystem, so we’re not just adding a whole truckload of FLOPS; we’re adding bandwidth in support of those FLOPS. The discrete card also has its own rather substantial power envelope. Extracting additional performance out of the beefier IGPs of the future may run up against socket limitations that a discrete card doesn’t face. That’s especially true for applications that map well to GPUs and IGPs, since they tend to be very bandwidth- and power-intensive.

Here’s what happens when we invoke the CPU cores and the Radeon HD 7950 together. Somewhat surprisingly, performance drops for most configurations, except for the recent Intel processors that can track eight threads or more. Apparently, the lower-end CPUs would be better off spending their time just acting in support of the discrete Radeon.

Power consumption and efficiency

Our workload for this test was encoding a video with x264, based on a command ripped straight from the x264 benchmark you’ll see later. This encoding job is a two-pass process. The first pass is lightly multithread and will give us the chance to see how power consumption looks when mechanisms like Turbo and core power gating are in use. The second pass is more widely multithreaded.

We’ve tested all of the CPUs in our default configuration, which includes a discrete Radeon card. We’ve also popped out the discrete card to get a look at power consumption for the A10, Core i3, and A8-3850.

These plots of power use during our test period give you a sense of what to expect. The wide gap between the max power ratings of the AMD APUs (100W) and the competing Intel parts (55W) is unmistakably reflected in the power-use readings we took at the wall socket.

When idling at the Windows desktop, the Trinity chips rival their Intel competition for power efficiency. Without a discrete card installed, the A10-5800K sips power at idle. That 24W number is a testament to this chip’s mobile roots.

AMD’s desktop APUs leave behind those mobile roots in dramatic fashion when presented with some work to do. Our A10-equipped system draws 152W at the wall socket, about 50% more than a similarly equipped system based on the fastest Ivy Bridge, the Core i7-3770K. The Core i3-3225 system’s peak power draw is well under half the A10 system’s.

We can quantify efficiency by looking at the amount of power used, in kilojoules, during the entirety of our test period, both when the chips are busy and at idle. By that measure, the A10-5800K system is less power efficient overall than our Llano-based A8-3850 system. Removing the discrete graphics card helps, but not nearly enough: the Core i3-3225 system with a discrete Radeon still consumes less energy over the test period than the A10 system does without a video card installed.

Perhaps our best measure of CPU power efficiency is task energy: the amount of energy used while encoding our video. This measure rewards CPUs for finishing the job sooner, but it doesn’t account for power draw at idle.

The Trinity systems combine relatively high power draw and fairly lengthy rendering times, so their energy efficiency is among the worst of the CPUs we’ve tested. There’s no getting around this fact. On the desktop, these chips with their 100W TDPs are a far cry from their mobile counterparts, yet they’re not fast enough to conserve energy by finishing the job quickly.

The Elder Scrolls V: Skyrim

Now it’s time to pop in a graphics card and look at gaming performance. We’ve raised the display resolution and image quality settings substantially, but the CPU should still be the primary performance limiter. Again, we’re using our latency-focused game testing methods. If you’re unfamiliar with what we’re doing, you might want to check out our recent CPU gaming performance article, which has a subset of the data here and explains our methods reasonably well.





The scope of our ambition is laid bare, as we present frame-by-frame results for 22 different CPUs. I’ll admit, we have gone entirely overboard here. My only defense is that people keep asking for more data! No, I don’t know what’s wrong with them, either.

The FPS average and 99th percentile results mirror each other handsomely. However, what they show isn’t good for AMD. This is quite the reversal of what happens when you’re running games on the IGPs. The Pentium G2120, an $86 processor, performs better in this test than any CPU AMD has ever produced. And, yes, ye olde Phenom II X4 980 remains AMD’s fastest gaming chip, at least in this test case.

In the past, we’ve attributed the struggles of the newer AMD chips in this test to their relatively weak per-thread performance, and we still think that’s the case with Trinity. Notice how the dual-module chips like the A10-5800K and the FX-4170 outperform the quad-module FX-8150. The chips with fewer modules reach slightly higher clock speeds, giving them an edge in lightly threaded performance.

We had hoped Piledriver’s modest IPC improvements would make a noticeable impact, but that doesn’t seem to be the case. Compare the Bulldozer-based FX-4170 to the A10-5800K. The FX-4170 runs at 4.2GHz with a 4.3GHz Turbo peak, while the A10-5800K runs at 3.8/4.2GHz. Despite a difference in Turbo frequencies of just 100MHz, the FX-4170 remains faster than the 5800K. The A10 does achieve similar performance in a 100W TDP, while the FX-4170’s power envelope is 125W, so that’s progress—just not progress in per-clock throughput.





The latency curves capture the trouble with the newer AMD CPUs—it’s that spike upward for the last 5% of frames. Flip between the plots, and you’ll see that the Phenom II X4 980’s curve looks much nicer than the newer chips’.

Ahh, our old measure of “badness” keeps us grounded once again. Although AMD is slower than Intel in this test scenario, none of the chips perform horribly. Virtually no time is spent beyond our customary 50-ms threshold, and even 33 ms isn’t much of a challenge, so we’ve ratcheted our threshold down to 16.7 milliseconds—the equivalent of 60 FPS. Some of the fastest processors come very close to delivering a steady stream of frames a 60 FPS or better. The Trinity-based APUs can’t match that—and in fact are among the weakest CPUs here—but we’re asking them to meet a very tough standard.

Batman: Arkham City





In spite of the spiky nature of the frame time plots for Arkham City, the FPS averages and 99th percentile graphs again roughly track together. One exception is the Pentium G2120. Its FPS average tops all but one AMD processor, but the Pentium drops down the ranks in the more latency-sensitive measurement. Nevertheless, the general story told here isn’t terribly different from what we saw in Skyrim.





One ray of light for AMD is the relative performance of the Trinity chips versus the A8-3850. The A10 spends less than half the time on long-latency frames that the A8-3850 does. Trouble is, the Core i3-3225 spends less than a quarter of the time the A10 does beyond our threshold.

Battlefield 3





Judging by the FPS average, you’d think the various CPUs would all be equally adequate, pretty much. But have a look at the 99th percentile results and—whoops. The Pentium G2120 really struggles, and you can see it happening if you look at the frame time plot above. It’s riddled with spikes above 30 and 40 milliseconds.

The reason, most likely, is that the Pentium G2120 is the only CPU here that can track only two threads, one for each physical core. Apparently, one reason BF3 runs so well on the other processors is excellent multithreading. Even the slowest quad-core part (the A8-5600K, in this case) performs admirably, as do the dual-core Intel chips with Hyper-Threading. Have a look at the latency curves below, and you’ll see that they all look about the same, except for the G2120’s radical turn northward.





So yeah, the FPS average tells us the difference between the Pentium G2120 and the A8-5600K is a single frame per second: 81 FPS versus 82. To take another swing at a deceased equine, the difference between the two is much larger than the FPS average suggests.

Crysis 2





Notice the spike at the beginning of the test run; it happens on each and every CPU. You can feel the hitch while playing. Apparently, the game is loading some data for the area we’re about to enter. Faster CPUs tend to reduce the size of the spike.

Doh! The FPS and 99th percentile results don’t track again. Is he going to give us another lecture about frame latencies?

Nah. You get the idea. The Pentium G2120 again pays the price for being the only dual-threaded contestant.

Another thing worth noting is how closely packed the various CPUs are at the 99th percentile. At that one point, at least, there’s little practical difference between the fastest Core i7 and the two Trinity APUs.





Ooh! Ooh! Look at the curve for the A10 versus the FX-4170. (The A10 largely overlaps with the FX-8150.) The A10 delivers lower latencies from the 50th to the 80th percentiles or thereabouts. Could be a Piledriver IPC improvement spotted in the wild, perhaps. Hush, kids, and enjoy the view. Also, I’m still geeking out over the fine differences between the curves for various speed grades of Intel processors.

All of the CPUs are pretty competent, if you boil it down to our indicator of badness. The exception, of course, is the Pentium G2120. Perhaps we didn’t ask nicely enough.

Multitasking: Gaming while transcoding video

A number of readers over the years have suggested that some sort of real-time multitasking test would be a nice benchmark for multi-core CPUs. That goal has proven to be rather elusive, but we think our new game testing methods may allow us to pull it off. What we did is play some Skyrim, with a 60-second tour around Whiterun, using the same settings as our earlier gaming test. In the background, we had Windows Live Movie Maker transcoding a video from MPEG2 to H.264. Here’s a look at the quality of our Skyrim experience while encoding.









So, who had the Pentium G2120 being the whipping boy here? Good call, although it wasn’t hard to see coming. Disappointingly, the Trinity chips turn out to be slower than their older sibling, the A8-3850, in our latency-oriented metrics—and yes, folks, the Core i3-3225 is again quite a bit faster than any of ’em.

Civilization V

Civ V will run this benchmark in two ways, either while using the graphics card to draw everything on the screen, just as it would during a game, or entirely in software, without bothering with rendering, as a pure CPU performance test.

Either way you run it, the Trinity APUs are near the back of the pack, only ahead of their Llano predecessor.

Productivity

Compiling code in GCC

Another persistent request from our readers has been the addition of some sort of code-compiling benchmark. With the help of our resident developer, Bruno Ferreira, we’ve finally put together just such a test. Qtbench tests the time required to compile the QT SDK using the GCC compiler. Here is Bruno’s note about how he put it together:

QT SDK 2010.05 – Windows, compiled via the included MinGW port of GCC 4.4.0. Even though apparently at the time the Linux version had properly working and supported multithreaded compilation, the Windows version had to be somewhat hacked to achieve the same functionality, due to some batch file snafus. After a working multithreaded compile was obtained (with the number of simultaneous jobs configurable), it was time to get the compile time down from 45m+ to a manageable level. This required severe hacking of the makefiles in order to strip the build down to a more streamlined version that preferably would still compile before hell froze over. Then some more fiddling was required in order for the test to be flexible about the paths where it was located. Which led to yet more Makefile mangling (the poor thing).

The number of jobs dispatched by the Qtbench script is configurable, and the compiler does some multithreading of its own, so we did some calibration testing to determine the optimal number of jobs for each CPU.

TrueCrypt disk encryption

TrueCrypt supports acceleration via Intel’s AES-NI instructions, so the encoding of the AES algorithm, in particular, should be very fast on the CPUs that support those instructions. We’ve also included results for another algorithm, Twofish, that isn’t accelerated via dedicated instructions.

7-Zip file compression and decompression

SunSpider JavaScript performance

Now that we’ve moved from games to productivity applications, Trinity has an opportunity to win a few victories over its Ivy Bridge-based rivals. The biggest win comes in the TrueCrypt AES test, where the processors with support for AES-NI fare much better than those without. Although Ivy Bridge can support AES-NI, that capability is fused off in the Core i3 and the Pentium.

In many cases, once again, the Trinity chips aren’t any faster than the A8-3850, depressingly enough. However, the A10-5800K cranks through SunSpider much sooner than the A8-3850 and, if you’re looking for evidence of Piledriver improvements, it’s quicker than the FX-4170, as well. This may or may not be an IPC improvement. It’s possible the A10 is just spending more time resident at its peak Turbo speed, thanks to Piledriver’s power improvements.

Image processing

The Panorama Factory photo stitching

The Panorama Factory handles an increasingly popular image processing task: joining together multiple images to create a wide-aspect panorama. This task can require lots of memory and can be computationally intensive, so The Panorama Factory comes in a 64-bit version that’s widely multithreaded. I asked it to join four pictures, each eight megapixels, into a glorious panorama of the interior of Damage Labs.

In the past, we’ve added up the time taken by all of the different elements of the panorama creation wizard and reported that number, along with detailed results for each operation. However, doing so is incredibly data-input-intensive, and the process tends to be dominated by a single, long operation: the stitch. Thus, we’ve simply decided to report the stitch time, which saves us a lot of work and still gets at the heart of the matter.

picCOLOR image processing and analysis

picCOLOR was created by Dr. Reinert H. G. Müller of the FIBUS Institute. This isn’t Photoshop; picCOLOR’s image analysis capabilities can be used for scientific applications like particle flow analysis. Dr. Müller has supplied us with new revisions of his program for some time now, all the while optimizing picCOLOR for new advances in CPU technology, including SSE extensions, multiple cores, and Hyper-Threading. Many of its individual functions are multithreaded.

At our request, Dr. Müller graciously agreed to re-tool his picCOLOR benchmark to incorporate some real-world usage scenarios. As a result, we now have four tests that employ picCOLOR for image analysis: particle image velocimetry, real-time object tracking, a bar-code search, and label recognition and rotation. For the sake of brevity, we’ve included a single overall score for those real-world tests.

Video encoding

x264 HD benchmark

This benchmark tests one of the most popular H.264 video encoders, the open-source x264. The results come in two parts, for the two passes the encoder makes through the video file. I’ve chosen to report them separately, since that’s typically how the results are reported in the public database of results for this benchmark.

Windows Live Movie Maker 14 video encoding

For this test, we used Windows Live Movie Maker to transcode a 30-minute TV show, recorded in 720p .wtv format on my Windows 7 Media Center system, into a 320×240 WMV-format video format appropriate for mobile devices.

The Core i3 and A10 split our image processing and video encoding tests pretty evenly. The A8-5600K fares better against the Pentium G2120, capturing the lead in everything but picCOLOR.

3D rendering

Cinebench rendering

The Cinebench benchmark is based on Maxon’s Cinema 4D rendering engine. It’s multithreaded and comes with a 64-bit executable. This test runs with just a single thread and then with as many threads as CPU cores (or threads, in CPUs with multiple hardware threads per core) are available.

POV-Ray rendering

These rendering applications aren’t OpenCL-accelerated like our LuxMark test, but they do give the FPU a good workout. Overall, the Core i3 and A10 are pretty evenly matched. The Pentium G2120 can’t really hang with the A8-5600K, though.

Scientific computing

MyriMatch proteomics

Our benchmarks sometimes come from unexpected places, and such is the case with this one. David Tabb is a friend of mine from high school and a long-time TR reader. He has provided us with an intriguing new benchmark based on an application he’s developed for use in his research work. The application is called MyriMatch, and it’s intended for use in proteomics, or the large-scale study of protein. I’ll stop right here and let him explain what MyriMatch does:

In shotgun proteomics, researchers digest complex mixtures of proteins into peptides, separate them by liquid chromatography, and analyze them by tandem mass spectrometers. This creates data sets containing tens of thousands of spectra that can be identified to peptide sequences drawn from the known genomes for most lab organisms. The first software for this purpose was Sequest, created by John Yates and Jimmy Eng at the University of Washington. Recently, David Tabb and Matthew Chambers at Vanderbilt University developed MyriMatch, an algorithm that can exploit multiple cores and multiple computers for this matching. Source code and binaries of MyriMatch are publicly available.

In this test, 5555 tandem mass spectra from a Thermo LTQ mass spectrometer are identified to peptides generated from the 6714 proteins of S. cerevisiae (baker’s yeast). The data set was provided by Andy Link at Vanderbilt University. The FASTA protein sequence database was provided by the Saccharomyces Genome Database. MyriMatch uses threading to accelerate the handling of protein sequences. The database (read into memory) is separated into a number of jobs, typically the number of threads multiplied by 10. If four threads are used in the above database, for example, each job consists of 168 protein sequences (1/40th of the database). When a thread finishes handling all proteins in the current job, it accepts another job from the queue. This technique is intended to minimize synchronization overhead between threads and minimize CPU idle time.

The most important news for us is that MyriMatch is a widely multithreaded real-world application that we can use with a relevant data set. I should mention that performance scaling in MyriMatch tends to be limited by several factors, including memory bandwidth, as David explains:

Inefficiencies in scaling occur from a variety of sources. First, each thread is comparing to a common collection of tandem mass spectra in memory. Although most peptides will be compared to different spectra within the collection, sometimes multiple threads attempt to compare to the same spectra simultaneously, necessitating a mutex mechanism for each spectrum. Second, the number of spectra in memory far exceeds the capacity of processor caches, and so the memory controller gets a fair workout during execution.

STARS Euler3d computational fluid dynamics

Charles O’Neill works in the Computational Aeroservoelasticity Laboratory at Oklahoma State University, and he contacted us to suggest we try the computational fluid dynamics (CFD) benchmark based on the STARS Euler3D structural analysis routines developed at CASELab. This benchmark has been available to the public for some time in single-threaded form, but Charles was kind enough to put together a multithreaded version of the benchmark for us with a larger data set. He has also put a web page online with a downloadable version of the multithreaded benchmark, a description, and some results here.

In this test, the application is basically doing analysis of airflow over an aircraft wing. I will step out of the way and let Charles explain the rest:

The benchmark gtestcase is the AGARD 445.6 aeroelastic test wing. The wing uses a NACA 65A004 airfoil section and has a panel aspect ratio of 1.65, taper ratio of 0.66, and a quarter-chord sweep angle of 45º. This AGARD wing was tested at the NASA Langley Research Center in the 16-foot Transonic Dynamics Tunnel and is a standard aeroelastic test case used for validation of unsteady, compressible CFD codes.

The CFD grid contains 1.23 million tetrahedral elements and 223 thousand nodes . . . . The benchmark executable advances the Mach 0.50 AGARD flow solution. A benchmark score is reported as a CFD cycle frequency in Hertz.

So the higher the score, the faster the computer. Charles tells me these CFD solvers are very floating-point intensive, but they’re oftentimes limited primarily by memory bandwidth. He has modified the benchmark for us in order to enable control over the number of threads used. Here’s how our contenders handled the test with optimal thread counts for each processor.

Yeah, you’re not likely to use any of these low-end processors to do this sort of work, unless you’re a poor grad student or something. (Keep emailing me with your tests, guys.) Let’s see if we can wrap this monster up.