I have to

tell you, sometimes, being a critical reviewer in the realm of technology is not an easy task. The problem comes down to the sheer rate of improvement among the products we review. If we were Car and Driver, it would look something like this. One year, we’d be reviewing a car that could accelerate from zero to 60 in eight seconds. A year later, we’d be testing a car in the same price range with a six-second 0-60 time. Another year after that, the standard would be down to four seconds. The next year? Three. Soon, pressing the accelerator would subject the driver to forces strong enough to be lethal in the right amounts.

Which is, for a car guy, a barrel of fun.

We have that sort of dynamic going on with computer chips, and it’s also quite entertaining, if you’re so inclined. I’ve gone from listening to short programs load in from tape on an Atari 800 to 12-megapixel monitors arrays playing amazing-looking games in full motion. This is not normal in any other walk of life.

Now, don’t get me wrong. I can pick nits with the best of ’em. But some days, I’m still amazed that I don’t have to listen to a series of bleeps and bloops for 30 minutes before I get to play Borderlands. At times like that, this new processor Intel will be officially introducing soon is almost incomprehensible. The Core i7-980X builds on the foundation established by the first Core i7 processors back in late 2008, but it raises the core count from four to six and adds a bundle of performance in the process. Given this thing’s performance and other qualities, I’m having a difficult time finding reasons to complain. Keep reading, and you’ll see what I mean.

Gulftown chips on a wafer. Source: Intel.

Introducing Gulftown

If you’ve been following Intel CPUs lately, you’re probably well-versed in code names. Knowing them is helpful because the complexity of Intel’s product portfolio is surpassed only by that of its naming scheme. Consequently, we’ve started referring to Clarkdale, Lynnfield, and Bloomfield rather than attempting to enumerate all possible products based on those bits of silicon. The Core i7-980X adds a new code name to that constellation: Gulftown.

Like the dual-core Clarkdale Core i3/i5 processors introduced earlier this year, Gulftown is a part of the Westmere family of 32-nm chips. This six-core processor is primarily known, in its server/workstation guise, as Westmere-EP; Gulftown is the code name for the desktop variants of the chip. Gulftown is intended to be a drop-in replacement for the existing members of the Core i7-900 series, all of which are based on the quad-core chip code-named Bloomfield.

If your head hasn’t exploded yet from code-name overload, I congratulate you. The main things you need to know about Gulftown are reproduced in the table below, which should act as something of a code-name decoder.

Code name Key products Cores Threads Last-level cache size Process

node (Nanometers) Estimated transistors (Millions) Die area (mm²) Penryn Core 2 Duo 2 2 6 MB 45 410 107 Bloomfield Core i7 4 8 8 MB 45 731 263 Lynnfield Core i5, i7 4 8 8 MB 45 774 296 Westmere Core i3, i5 2 4 4 MB 32 383 81 Gulftown Core i7-980x 6 12 12 MB 32 1170 248 Deneb Phenom II 4 4 6 MB 45 758 258 Propus/Rana Athlon II

X4/X3 4 4 512 KB x 4 45 300 169 Regor Athlon II X2 2 2 1 MB x 2 45 234 118

Compared to Bloomfield, Gulftown has 50% more cores and cache, yet it fits into the same basic power envelope at the same clock speed. Gulftown packs substantially more transistors into a smaller die area than Bloomfield, too. All of this magic comes courtesy of Intel’s new 32-nm chip fabrication process, which combines second-generation high-k + metal gate transistors with first-generation immersion lithography.

The image above shows Gulftown’s layout nicely. As a drop-in replacement for Bloomfield, Gulftown has no integrated PCI Express connectivity (a la Lynnfield) and no integrated graphics (a la Clarkdale). Instead, it relies on a QuickPath Interconnect to link it to the X58 chipset.

Interestingly, Intel’s architects call the uncore area running up the center of the chip “the tube.” (Well, I thought it was interesting, anyway.) Your eye may also be drawn to the top left corner of the chip, where there’s a pretty big area with not much going on. In a briefing, Dave Hill, Westmere’s lead architect, acknowledged this “white space” and noted only that he wasn’t going to talk about the reasons for it. Presumably, Intel would want to minimize wasted space on a design like this one, so I’m intrigued. Almost looks to me like one could eliminate the apparent white space on both sides of the memory controller and the I/O, uncore, and memory controller would wrap pretty snugly around four cores their associated L3 cache. As far as we know, though, Intel has no plans to release a native quad-core derivative of Westmere. Instead, the firm will press ahead with a quad-core version of Sandy Bridge, the upcoming architectural refresh slated for the 32-nm process.

Speaking of which, the chips in the Westmere family are a “tick” in Intel’s vaunted tick-tock cadence. They’re a refinement of the quad-core Nehalem architecture introduced at 45 nanometers, with a relatively conservative set of enhancements outside of the obvious changes in core counts and cache sizes. Sandy Bridge will be a “tock” with more radical architectural remodeling. Still, the same Oregon-based team that created Nehalem also did Westmere, so the ins and outs of the processor were already familiar to them. They couldn’t resist making a few tweaks along the way. Most notable among them is the addition of seven new instructions tailored to accelerate the most common data encryption algorithms.

Another improvement, carried over from the Lynnfield Core i5/i7 chips, is the addition of a gate that can cut off power to most elements of the “uncore” when the chip is idling in its lowest sleep states, substantially reducing power consumption and even leakage power. This provision extends the power gate concept first implemented in Nehalem processors. Gulftown has seven power gates, one for each core and one for the uncore. Not all elements of the uncore are affected by the power gate. Notably, the chip’s built-in power management processor isn’t shut off, for obvious reasons. Meanwhile, the memory controller, QuickPath Interconnect, and L3 cache have their voltage reduced to “retention levels.” The chip’s architects say there’s no substantial increase in the time required for the CPU to wake up from its deeper sleep states.

Other Westmere changes are perhaps even more esoteric. The APIC timer now remains running all of the time, even during sleep. Large pages, up to 1GB in size, are now supported, and some improvements have been made for the sake of virtualization performance. Despite the presence of more and larger caches, the data pre-fetch algorithms for the caches remain the same.

One other modification in Gulftown will please folks trying to achieve higher memory clocks. With Bloomfield, the maximum memory speed is half the uncore frequency. As a result, Bloomfield’s uncore must run at 4GHz in order to accommodate 2GHz DIMMs. Like Lynnfield, Gulftown’s uncore only needs to run at 1.5X the max memory speed, so 2GHz memory frequencies are possible with the uncore at 3GHz.

The Core i7-980X Extreme gets a fancy cooler

Gulftown processors will drop into an LGA1366-style socket, like those used on all X58 motherboards, and should generally be compatible with current boards with the help of a BIOS update. Intel’s own DX58S0 “Smackover” board can handle a Core i7-980X after a quick BIOS flash, as did the Gigabyte X58A-UD5 in our test system. As is often the case, though, the move to a smaller fab process has prompted some voltage changes, so you’ll want to check with your motherboard maker to verify compatibility. Like Bloomfield, the i7-980X supports three channels of DDR3 memory at up to 1066MHz. Oddly, Intel has withheld its official endorsement of higher memory frequencies, although the chip’s memory controller will easily run at higher speeds.

Model Cores Threads Base core

clock speed Peak

Turbo clock speed L3

cache size Memory channels TDP Price Core i5-750 4 4 2.66 GHz 3.20 GHz 8 MB 2 95W $196 Core i7-860 4 8 2.80 GHz 3.46 GHz 8 MB 2 95W $284 Core i7-870 4 8 2.93 GHz 3.60 GHz 8 MB 2 95W $562 Core i7-920 4 8 2.66 GHz 2.93 GHz 8 MB 3 130W $284 Core i7-930 4 8 2.80 GHz 3.06 GHz 8 MB 3 130W $294 Core i7-960 4 8 3.20 GHz 3.46 GHz 8 MB 3 130W $562 Core i7-975

Extreme 4 8 3.33 GHz 3.60 GHz 8 MB 3 130W $999 Core i7-980X

Extreme 6 12 3.33 GHz 3.60 GHz 12 MB 3 130W $999

The table above shows Intel’s current Core i7 lineup. The Core i7-980X is the firstand so far onlyGulftown-based product to come to market. As an Extreme edition, the 980X has an unlocked multiplier to facilitate overclocking. If you’re willing to cough up a grand for its best processor, Intel won’t stand in the way of you having a little fun with it. As you can see, the 980X essentially supplants the Core i7-975 Extreme at the same price and frequency, with more cores and cache.

That’s about it for the Core i7-980X’s competition. We have included the fastest desktop processor from AMD, the Phenom II X4 965, in our testing, of course, but it lists for only $185 and simply can’t match the performance of the fastest Intel CPUs. AMD does have a six-core version of its Opteron processor that fared pretty well in our last round of server/workstation CPU tests, but the firm has so far elected not to bring it to the desktop.

Pictured above is the Core i7-980X (trust me, it’s under there) installed in our Gigabyte X58A-UD5 mobo, along with Intel’s nifty stock cooler for this CPU. That’s 12GB of Corsair Dominator DIMMs in the picture, by the waya new arrival in Damage Labsalthough we tested with just three DIMMs and 6GB for the sake of continuity with our existing results.

The new stock cooler will come with retail boxed versions of the Core i7-980X, and thank goodness, it has a screw-based installation mechanism with a retention bracket that goes on the underside of the motherboard. Intel claims the retention mech has been tested with shock forces up to 50 Gs, which should prevent it from breaking off and bouncing around inside the case of a pre-built PClike the tab-based Intel cooler that I installed in my brother-in-law’s PC did, killing a GeForce GTX 260 in the process.

The cooler has both Quiet and Performance modes, which can be set with a switch on the heatsink. We found it to be fairly hushed in quiet mode and pretty darned effective in performance mode, as you’ll soon see.

And now, we have an incredibly large set of CPU test results to navigate, comparing the Core i7-980X to everything from a Core i7-870 to a five-year-old Pentium 4. I’m going to keep the commentary to a minimum since we’re still fresh off of our last massive CPU roundup, and the only big change here is the addition of the i7-980X. Let’s get started.

Test notes

We’ve underclocked the Core i5-661 to 2.8GHz in order to simulate the Core i3-540. Although we did change the core clock to the proper speed, the processor’s uncore clock remained at the i5-661’s stock frequency. We believe shipping Core i3-540 processors have a 2.13GHz uncore clock, while the i5-661 has a 2.4GHz uncore clock, so our simulated processor may perform slightly better than the real item due to a higher L3 cache speed. The differences are likely to be very minor, based on our experience with Lynnfield partsthe L3 cache is incredibly fast, regardlessbut we thought you should know about that possibility.

Additionally, our Core i7-960 is an underclocked Core i7-975 Extreme, but in that case, we’re fairly certain all of the clocks match what they should, since Bloomfield gives us a little more control over such things. In order to run the Core i7-960’s memory at 1333MHz, we raised its uncore clock to 2.66GHz. That comes with the territory, and I expect many Core i7-960 owners have done the same.

As is our custom, we’ve omitted the simulated processor speed grades from our power consumption testing.

After consulting with our readers, we’ve decided to enable Windows’ “Balanced” power profile for the bulk of our desktop processor tests, which means power-saving features like SpeedStep and Cool’n’Quiet are operating. (In the past, we only enabled these features for power consumption testing.) Our spot checks demonstrated to us that, typically, there’s no performance penalty for enabling these features on today’s CPUs. If there is a real-world penalty to enabling these features, well, we think that’s worthy of inclusion in our measurements, since the vast majority of desktop processors these days will spend their lives with these features enabled. We did disable these power management features to measure cache latencies, but otherwise, it was unnecessary to do so.

Our testing methods

As ever, we did our best to deliver clean benchmark numbers. Tests were run at least three times, and we reported the median of the scores produced.

Our test systems were configured like so:

Processor Athlon II X2 255 3.1GHz Athlon II X3 440 3.0GHz Athlon II X4 630 2.8GHz Athlon II X4 635 2.9GHz Phenom II X2 550 3.1GHz Phenom II X4 910e 2.6GHz Phenom II X4 965 3.4GHz

Pentium E6500 2.93GHz Core

2 Duo E7600 3.06GHz Core 2 Quad Q6600 2.4GHz Pentium

4 670 3.8GHz

Core

2 Duo E8600 3.33GHz Core 2 Quad Q9400 2.66GHz Motherboard Gigabyte

MA785G-UD2H Asus

P5G43T-M Pro Asus

P5G43T-M Pro Asus

P5G43T-M Pro North bridge 785GX G43

MCH G43

MCH G43

MCH South bridge SB750 ICH10R ICH10R ICH10R Memory size 4GB

(2 DIMMs) 4GB

(2 DIMMs) 4GB

(2 DIMMs) 4GB

(2 DIMMs) Memory

type Corsair CM3X2G1600C9DHXNV DDR3 SDRAM Corsair CM3X2G1800C8D DDR3 SDRAM Corsair CM3X2G1800C8D DDR3 SDRAM Corsair CM3X2G1800C8D DDR3 SDRAM Memory

speed 1333

MHz 1066

MHz 800

MHz 1333

MHz Memory

timings 8-8-8-20 2T 7-7-7-20 2T 7-7-7-20 2T 8-8-8-20 2T Chipset drivers – INF

update 9.1.1.1020 Rapid Storage Technology 9.5.0.1037 INF

update 9.1.1.1020 Rapid Storage Technology 9.5.0.1037 INF

update 9.1.1.1020 Rapid Storage Technology 9.5.0.1037 Audio Integrated SB750/ALC889A with Realtek 6.0.1.5995 drivers Integrated ICH10R/ ALC887 with

Realtek 6.0.1.5995 drivers Integrated ICH10R/ALC887 with Realtek 6.0.1.5995 drivers Integrated ICH10R/ALC887

with Realtek 6.0.1.5995 drivers

Processor Core

i5-750 2.66GHz Core i7-870 2.93GHz Core

i3-530 2.93GHz Core

i3-540 3.06GHz Core i5-661 3.33GHz Core

i7-920 2.66GHz Core

i7-960 3.2GHz Core i7-975 Extreme 3.33GHz Core i7-980X Extreme 3.33GHz Motherboard Gigabyte

P55A-UD6 Asus

P7H57D-V EVO Gigabyte

EX58-UD3R Gigabyte

X58A-UD5R North bridge P55

PCH H57

PCH X58

IOH X58

IOH South bridge ICH10R ICH10R Memory size 4GB

(2 DIMMs) 4GB

(2 DIMMs) 6GB

(3 DIMMs) 6GB

(3 DIMMs) Memory type Corsair CM3X2G1600C8D DDR3 SDRAM Corsair CMD4GX3M2A1600C8 DDR3 SDRAM OCZ OCZ3B2133LV2G DDR3 SDRAM Corsair TR3X6G1600C8D DDR3 SDRAM Memory

speed 1333

MHz 1333

MHz 1066

MHz 1333

MHz Memory

timings 8-8-8-20 2T 8-8-8-20 2T 7-7-7-20 2T 8-8-8-20 2T Chipset drivers INF

update 9.1.1.1020 Rapid Storage Technology 9.5.0.1037 INF

update 9.1.1.1020 Rapid Storage Technology 9.5.0.1037 INF

update 9.1.1.1020 Rapid Storage Technology 9.5.0.1037 INF

update 9.1.1.1020 Rapid Storage Technology 9.5.0.1037 Audio Integrated P55 PCH/ALC889 with Realtek 6.0.1.5995 drivers Integrated H57 PCH/ALC889 with Realtek 6.0.1.5995 drivers Integrated ICH10R/ALC888 with Realtek 6.0.1.5995 drivers Integrated ICH10R/ALC889 with Realtek 6.0.1.5995 drivers

They all shared the following common elements:

Hard drive WD

RE3 WD1002FBYS 1TB SATA Discrete

graphics Asus

ENGTX260 TOP SP216 (GeForce GTX 260) with ForceWare 195.62 drivers OS Windows

7 Ultimate x64 Edition RTM OS

updates DirectX

August 2009 update Power

supply PC

Power & Cooling Silencer 610 Watt

I’d like to thank Asus, Corsair, Gigabyte, OCZ, and WD for helping to outfit our test rigs with some of the finest hardware available. Thanks to Intel and AMD for providing the processors, as well, of course.

The test systems’ Windows desktops were set at 1600×1200 in 32-bit color. Vertical refresh sync (vsync) was disabled in the graphics driver control panel.

We used the following versions of our test applications:

The tests and methods we employ are usually publicly available and reproducible. If you have questions about our methods, hit our forums to talk with us about them.

Power consumption and efficiency

We have reams of test results to wade though, but we’ll begin with our power consumption tests, since they’re especially relevant to a new 32-nm processor like the Core i7-980X.

For these tests, we used an Extech 380803 power meter to capture power use over a span of time. The meter reads power use at the wall socket, so it incorporates power use from the entire systemthe CPU, motherboard, memory, graphics solution, hard drives, and anything else plugged into the power supply unit. (The monitor was plugged into a separate outlet.) We measured how each of our test systems used power across a set time period, during which time we ran Cinebench’s multithreaded rendering test.

We’ll start with the show-your-work stuff, plots of the raw power consumption readings. We’ve broken things down by socket type in order to keep them manageable. Please note that, because our Asus H57 motherboard tends to draw more power than we’d like, we’ve tested power consumption for the Core i5-530 and the Core i5-661 on our P55 mobo, instead.

We can slice up these raw data in various ways in order to better understand them. We’ll start with a look at idle power, taken from the trailing edge of our test period, after all CPUs have completed the render.

Next, we can look at peak power draw by taking an average from the ten-second span from 15 to 25 seconds into our test period, when the processors were rendering.

The Core i7-980X’s power draw, both at max and idle, mirrors that of the Core i7-975 quite closely. Heck, it’s a few watts lower at peak, despite the addition of two more cores and extra cache.

We can highlight power efficiency by looking at total energy use over our time span. This method takes into account power use both during the render and during the idle time. We can express the result in terms of watt-seconds, also known as joules. (In this case, to keep things manageable, we’re using kilojoules.)

The X58 platform’s relatively high power use at idle keeps the i7-980X from performing well by this measure.

We can pinpoint efficiency more effectively by considering the amount of energy used for the task. Since the different systems completed the render at different speeds, we’ve isolated the render period for each system. We’ve then computed the amount of energy used by each system to render the scene. This method should account for both power use and, to some degree, performance, because shorter render times may lead to less energy consumption.

In our most direct measurement of power efficiency, the Core i7-980X takes top honors. With six cores and, thanks to Hyper-Threading, 12 hardware threads, the 980X makes short work of Cinebench’s test render. By finishing so quickly, the 980X-based system requires the least energy to render this scene. Holding the line on clock speeds and raising the core count is a very effective strategy for attaining energy-efficient performance in multi-threaded applications, and Intel has followed that template almost perfectly with Gulftown.

Memory subsystem performance

Now that we’ve considered power efficiency, we’ll move on to our performance results, beginning with some synthetic tests of the CPUs’ memory subsystems. These results don’t track directly with real-world performance, but they do give us some insights into the CPU and system architectures involved. For this first test, the graph is pretty crowded. I’ve tried to be selective, generally only choosing one representative from each architecture. This test is multithreaded, so more coreswith associated L1 and L2 cachescan lead to higher throughput.

With six L1 data caches, six L2 caches, and a massive 12MB L3 cache, the Core i7-980X is the fastest solution at nearly every data point.

This graph becomes almost impossible to read once we get to the larger block sizes, where we’re really measuring main memory bandwidth. Stream is a better test of that particular attribute.

Gulftown essentially matches Bloomfield here, with near-identical bandwidth scores.

The 980X’s very low memory access latencies are even more impressive given the fact that its L3 cache is 50% bigger than Bloomfield’s. (Larger caches typically have longer latencies.) Intel informs us that Gulftown’s L3 cache runs at the same speed as Bloomfield’s, so there’s no improvement due to higher frequencies. I do think, however, that we may have to adjust our sample to the 32MB block size soon. Latencies for Gulftown at the 16MB size may be getting partially cushioned by the 12MB L3 cache. At the 32MB sample size, latencies for the i7-975 and i7-980X are almost identical and work out to about 51 ns.

For what it’s worth, this benchmark reports that the latency for the Core i7-975’s L3 cache is 36 cycles (of the CPU core), while the i7-980’s is 43 cycles.

Borderlands

This is my favorite game in a long, long time, so I had to use in it our latest CPU test suite. Borderlands is based on Unreal Engine technology and includes built-in speed test, which we used here. We tested with the game set to its highest quality settings at a range of resolutions. The results from the lowest resolutions will highlight the separation between the CPUs best, so I’d pay the most attention to them. The higher resolution results demonstrate what happens when the GeForce GTX 260 graphics card begins to restrict frame rates.

Well, yeah. Borderlands runs quickly enough on a sub-$100 processor like the Athlon II X2 255, so the Core i7-980X shouldn’t find it a challenge. There’s little improvement from the i7-975 to the i7-980X, but this game engine doesn’t use enough threads to take full advantage of Gulftownnot that it needs to.

DiRT 2

This excellent new racer packs a nicely scriptable performance test. We tested at the game’s “high” quality presets with 4X antialiasing.

So continues our object lesson in how most of today’s games don’t really require the fastest CPUs. The 980X does well; so does everything but the Pentium 4.

Modern Warfare 2

With Modern Warfare 2, we used FRAPS to record frame rates over the course of a 60-second gameplay session. We conducted this gameplay session five times on each CPU and have reported the median score from each processor. We’ve also graphed the frame rates from a single, representative session for each. We tested this game at a relatively low 1024×768 resolution, with no AA, but otherwise using the highest in-game visual quality settings.

Look, folks. Those IBM CPUs in the Xbox 360 aren’t gonna set any land speed records. For now, the largest-budget games and biggest hits are likely to have relatively modest processor needs.

Left 4 Dead 2

We tested Left 4 Dead 2 by playing back a custom demo using the game’s timedemo function. Again, we had all of the image quality options cranked, and we tested with 16X anisotropic filtering and 4X antialiasing. The game’s multi-core rendering option was, of course, enabled.

Valve’s Source engine is no challenge to any modern processor, either. The 980X again shows that it’s good for gamingbut so is the Core i3-530.

Source engine particle simulation

Next up is a test we picked up during a visit to Valve Software, the developers of the Half-Life games. They had been working to incorporate support for multi-core processors into their Source game engine, and they cooked up some benchmarks to demonstrate the benefits of multithreading.

This test runs a particle simulation inside of the Source engine. Most games today use particle systems to create effects like smoke, steam, and fire, but the realism and interactivity of those effects are limited by the available computing horsepower. Valve’s particle system distributes the load across multiple CPU cores.

At last, a more targeted test where Gulftown gets to show us what it can do. If game developers make heavy use of these effects in gamesand if they don’t accelerate them via the GPU, which even Valve now seems to be doingthen newer Intel processors with Hyper-Threading should handle them especially well. Older ones with Hyper-Threading, not so much.

Productivity

We have, for quite some time now, used WorldBench in our CPU tests. Over that time, we’ve found that some of WorldBench’s tests can be rather temperamental and may refuse to run periodically. We’ve also found that some of the same tests tend to have inconsistent results that aren’t always influenced much by processor performance. Other applications in WorldBench 6, like the Windows Media Encoder 9 test, make little or no use of multithreading, despite the fact that such applications are typically nicely multithreaded these days. As a result, we’ve decided to limit our use of WorldBench to a selection of its applications, rather than the full suite.

MS Office productivity

Firefox web browsing

Multitasking – Firefox and Windows Media Encoder

Both the Office and Firefox/Windows Media Encoder tests have an element of multitasking built into them, but Gulftown’s extra cores and hardware threads aren’t much help when the applications themselves involve mostly serial operations and (in the case of this older version of Windows Media Encoder) only a few threads. The 980X performs well here, but no better than its predecessors.

File compression and encryption

7-Zip file compression and decompression

Whoa. 7-Zip puts Gulftown’s six cores to use, with stunning resultsover 10X the performance of a Pentium 4 and just under twice the performance of the Core i7-920.

WinZip file compression

This old version of WinZip in the WorldBench suite uses maybe one or two threads, and the results are predictable. Again, the 980X comes out looking pretty good, but it’s not really any faster than Ye Olde Core 2 Duo E8600.

TrueCrypt disk encryption

Here’s a new addition at our readers’ request. This full-disk encryption suite includes a performance test, for obvious reasons. We tested with a 50MB buffer size and, because the benchmark spits out a lot of data, averaged and summarized the results in a couple of different ways.

This, folks, is without any help from Gulftown’s new instructions that accelerate encryption. My understanding is that a version of TrueCrypt with support for Westmere’s new instructions is forthcoming, and we’ll try to test it once it’s available. Still, Gulftown is fast enough on its own to encrypt data more than quickly enough for most storage subsystems.

Yeah, this little data dump is for those of you who are really, really interested in a particular encryption routine. Enjoy.

Image processing

The Panorama Factory photo stitching

The Panorama Factory handles an increasingly popular image processing task: joining together multiple images to create a wide-aspect panorama. This task can require lots of memory and can be computationally intensive, so The Panorama Factory comes in a 64-bit version that’s widely multithreaded. I asked it to join four pictures, each eight megapixels, into a glorious panorama of the interior of Damage Labs.

In the past, we’ve added up the time taken by all of the different elements of the panorama creation wizard and reported that number, along with detailed results for each operation. However, doing so is incredibly data-input-intensive, and the process tends to be dominated by a single, long operation: the stitch. So this time around, we’ve simply decided to report the stitch time, which saves us a lot of work and still gets at the heart of the matter.

The 980X will stitch together your panorama in half the time it takes a Core 2 Quad Q6600 or an Athlon II X4 635.

picCOLOR image processing and analysis

picCOLOR was created by Dr. Reinert H. G. Müller of the FIBUS Institute. This isn’t Photoshop; picCOLOR’s image analysis capabilities can be used for scientific applications like particle flow analysis. Dr. Müller has supplied us with new revisions of his program for some time now, all the while optimizing picCOLOR for new advances in CPU technology, including SSE extensions, multiple cores, and Hyper-Threading. Many of its individual functions are multithreaded.

Recently, at our request, Dr. Müller graciously agreed to re-tool his picCOLOR benchmark to incorporate some real-world usage scenarios. As a result, we now have four new tests that employ picCOLOR for image analysis. I’ve included explanations of each test from Dr. Müller below.

Particle Image Velocimetry (PIV) is being used for flow measurement in air and water.

The medium (air or water) is seeded with tiny particles (1..5um diameter, smoke or oil fog in air,

titanium dioxide in water). The tiny particles will follow the flow more or less exactly, except may be

in very strong sonic shocks or extremely strong vortices. Now, two images are taken within a very

short time interval, for instance 1us. Illumination is a very thin laser light sheet. Image resolution is

1280×1024 pixels. The particles will have moved a little with the flow in the short time interval and

the resulting displacement of each particle gives information on the local flow speed and direction.

The calculation is done with cross-correlation in small sub-windows (32×32, or 64×64 pixel) with some

overlap. Each sub-window will produce a displacement vector that tells us everything about flow speed

and direction. The calculation can easily be done multithreaded and is implemented in picCOLOR with

up to 8 threads and more on request.

To give you some context for these results, picCOLOR’s scores are indexed against a Pentium III 1GHz system; a score of 1.0 represents its performance. The Core i7-980X is 54 times that fast in this test.

Real Time 3D Object Tracking is used for tracking of airplane wing and helicopter blade deflection and deformation in wind tunnel tests. Especially for comparison with numerical simulations, the exact deformation

of a wing has to be known. An important application for high speed tracking is the testing of wing flutter, a

very dangerous phenomenon. Here, a measurement frequency of 1000Hz and more is required to solve the

complex and possibly disastrous motion of an aircraft wing. The function first tracks the objects in 2 images

using small recognizable markers on the wing and a stereo camera set-up. Then, a 3D-reconstruction

follows in real time using matrix conversions. . . . This test is single threaded, but will be converted to 3 threads in the future.

Multi Barcodes: With this test, several different bar codes are searched on a large image (3200×4400 pixel).

These codes are simple 2D codes, EAN13 (=UPC) and 2 of 5. They can be in any rotation and can be extremely fine

(down to 1.5 pixel for the thinnest lines). To find the bar codes, the test uses several filters (some of them multithreaded). The bar code edge processing is single threaded, though.

Label Recognition/Rotation is being used as an important pre-processing step for character reading (OCR).

For this test in the large bar code image all possible labels are detected and rotated to zero degree text rotation.

In a real application, these rotated labels would now be transferred to an OCR-program – there are several good programs

available on the market. But all these programs can only accept text in zero degree position. The test uses morphology

and different filters (some of them multithreaded) to detect the labels and simple character detection functions to locate the text and to determine the rotational angle of the text. . . . This test uses Rotation in the last important step, which is fully multithreaded with up to 8 threads.

The 980X’s strong performance continues, but the newcomer isn’t able to distinguish itself from its predecessors in operations with fewer threads. I’m not sure what happened in the Multi Barcodes test, where it was even a little slower.

picCOLOR’s synthetic tests measure a number of the program’s individual functions, and the program then computes an average score, again indexed versus a 1GHz Pentium III. The 980X grabs the top spot here by just a bit.

Media encoding and editing

x264 HD benchmark

This benchmark tests one of the most popular H.264 video encoders, the open-source x264. The results come in two parts, for the two passes the encoder makes through the video file. I’ve chosen to report them separately, since that’s typically how the results are reported in the public database of results for this benchmark.

Remember, kids: pass two is where the magic happens. That’s where the six cores of Gulftown truly get exercised, with happy resultsnearly a 50% increase in encoding rate over the Core i7-975.

Windows Live Movie Maker 14 video encoding

For this test, I used Windows Live Movie Maker to transcode a 30-minute TV show, recorded in 720p .wtv format on my Windows 7 Media Center system, into a 320×240 WMV-format video format appropriate for mobile devices.

Wow, Microsoft. You really couldn’t see this coming? This is a video encoding app, pretty easily parallelized, and you thought, “Hey, eight threads ought to be enough for anybody.” Really?

LAME MT audio encoding

LAME MT is a multithreaded version of the LAME MP3 encoder. LAME MT was created as a demonstration of the benefits of multithreading specifically on a Hyper-Threaded CPU like the Pentium 4. Of course, multithreading works even better on multi-core processors.

Rather than run multiple parallel threads, LAME MT runs the MP3 encoder’s psycho-acoustic analysis function on a separate thread from the rest of the encoder using simple linear pipelining. That is, the psycho-acoustic analysis happens one frame ahead of everything else, and its results are buffered for later use by the second thread. That means this test won’t really use more than two CPU cores.

We have results for two different 64-bit versions of LAME MT from different compilers, one from Microsoft and one from Intel, doing two different types of encoding, variable bit rate and constant bit rate. We are encoding a massive 10-minute, 6-second 101MB WAV file here.

LAME MT remains in our test suite after many years as an example of the limits of multithreaded softwareand, by extension, multi-core processors. Yes, you can encode multiple files at the same time faster on a six-core, 12-thread machine like a Gulftown, but we’re not aware of an encoder that uses more than two threads well while encoding a single audio file. Hence, the i7-980X is no faster than the dual-core Core i5-661.

3D modeling and rendering

Cinebench rendering

The Cinebench benchmark is based on Maxon’s Cinema 4D rendering engine. It’s multithreaded and comes with a 64-bit executable. This test runs with just a single thread and then with as many threads as CPU cores (or threads, in CPUs with multiple hardware threads per core) are available.

Ah, rendering. The embarrassingly parallel task that spawned the GPU. This is happy territory for the Core i7-980X and unhappy territory for any of its competitors. It’s nearly twice the speed of the Phenom II X4 965 here.

By the way, there is a newer version of Cinebench out, release 11.5, that hopefully resolves some problems with performance scaling at higher core and thread counts. That’s not much of a problem for the Core i7-980X here, obviously, but we have seen issues with dual-socket, Nehalem-based systems. Unfortunately, we didn’t have the stomach for re-testing twenty-some processors with Cinebench 11.5 for the sake of this review.

POV-Ray rendering

We’re using the latest beta version of POV-Ray 3.7 that includes native multithreading and 64-bit support.

In the chess2 scene, Gulftown accomplishes in 37 seconds what the Pentium 4 670 does in 10 minutes. POV-Ray’s benchmark depends largely on a long, single threaded operation, like some sort of strange testament to Amdahl’s Law. That’s why the i7-980X can’t improve performance there as much.

3ds max modeling and rendering

The first 3ds max test measures 3D modeling speed, not rendering, which is why it shows no gain with the 980X.

Valve VRAD map compilation

This next test processes a map from Half-Life 2 using Valve’s VRAD lighting tool. Valve uses VRAD to pre-compute lighting that goes into games like Half-Life 2.

The presence of the Pentium 4 kind of distorts the scale of this bar chart, but the 980X again delivers a nice reduction in rendering time.

Scientific computing

Folding@Home

Next, we have a slick little Folding@Home benchmark CD created by notfred, one of the members of Team TR, our excellent Folding team. For the unfamiliar, Folding@Home is a distributed computing project created by folks at Stanford University that investigates how proteins work in the human body, in an attempt to better understand diseases like Parkinson’s, Alzheimer’s, and cystic fibrosis. It’s a great way to use your PC’s spare CPU cycles to help advance medical research. I’d encourage you to visit our distributed computing forum and consider joining our team if you haven’t already joined one.

The Folding@Home project uses a number of highly optimized routines to process different types of work units from Stanford’s research projects. The Gromacs core, for instance, uses SSE on Intel processors, 3DNow! on AMD processors, and Altivec on PowerPCs. Overall, Folding@Home should be a great example of real-world scientific computing.

notfred’s Folding Benchmark CD tests the most common work unit types and estimates the number of points per day that a CPU could earn for a Folding team member. The CD itself is a bootable ISO. The CD boots into Linux, detects the system’s processors and Ethernet adapters, picks up an IP address, and downloads the latest versions of the Folding execution cores from Stanford. It then processes a sample work unit of each type.

On a system with two CPU cores, for instance, the CD spins off a Tinker WU on core 1 and an Amber WU on core 2. When either of those WUs are finished, the benchmark moves on to additional WU types, always keeping both cores occupied with some sort of calculation. Should the benchmark run out of new WUs to test, it simply processes another WU in order to prevent any of the cores from going idle as the others finish. Once all four of the WU types have been tested, the benchmark averages the points per day among them. That points-per-day average is then multiplied by the number of cores on the CPU in order to estimate the total number of points per day that CPU might achieve.

This may be a somewhat quirky method of estimating overall performance, but my sense is that it generally ought to work. We’ve discussed some potential reservations about how it works here, for those who are interested.

We have, in the past, included results for multiple WU types, but given the fact that per-core performance results are distorted when Hyper-Threading allows multiple threads to be run simultaneously, we’ve decided simply to report the overall score this time.

A nice result, but I should note that you can probably expect to accumulate many more points per day if you use the SMP client for Folding. I’m hoping notfred will succumb and change the benchmark to use the SMP client soon. If not, we may have to retire this test, since the SMP client seems to be what everyone is using these days.

MyriMatch proteomics

Our benchmarks sometimes come from unexpected places, and such is the case with this one. David Tabb is a friend of mine from high school and a long-time TR reader. He has provided us with an intriguing new benchmark based on an application he’s developed for use in his research work. The application is called MyriMatch, and it’s intended for use in proteomics, or the large-scale study of protein. I’ll stop right here and let him explain what MyriMatch does:

In shotgun proteomics, researchers digest complex mixtures of proteins into peptides, separate them by liquid chromatography, and analyze them by tandem mass spectrometers. This creates data sets containing tens of thousands of spectra that can be identified to peptide sequences drawn from the known genomes for most lab organisms. The first software for this purpose was Sequest, created by John Yates and Jimmy Eng at the University of Washington. Recently, David Tabb and Matthew Chambers at Vanderbilt University developed MyriMatch, an algorithm that can exploit multiple cores and multiple computers for this matching. Source code and binaries of MyriMatch are publicly available.

In this test, 5555 tandem mass spectra from a Thermo LTQ mass spectrometer are identified to peptides generated from the 6714 proteins of S. cerevisiae (baker’s yeast). The data set was provided by Andy Link at Vanderbilt University. The FASTA protein sequence database was provided by the Saccharomyces Genome Database. MyriMatch uses threading to accelerate the handling of protein sequences. The database (read into memory) is separated into a number of jobs, typically the number of threads multiplied by 10. If four threads are used in the above database, for example, each job consists of 168 protein sequences (1/40th of the database). When a thread finishes handling all proteins in the current job, it accepts another job from the queue. This technique is intended to minimize synchronization overhead between threads and minimize CPU idle time.

The most important news for us is that MyriMatch is a widely multithreaded real-world application that we can use with a relevant data set. MyriMatch also offers control over the number of threads used, so we’ve tested with one to eight threads.

I should mention that performance scaling in MyriMatch tends to be limited by several factors, including memory bandwidth, as David explains:

Inefficiencies in scaling occur from a variety of sources. First, each thread is comparing to a common collection of tandem mass spectra in memory. Although most peptides will be compared to different spectra within the collection, sometimes multiple threads attempt to compare to the same spectra simultaneously, necessitating a mutex mechanism for each spectrum. Second, the number of spectra in memory far exceeds the capacity of processor caches, and so the memory controller gets a fair workout during execution.

Here’s how the processors performed.

The drop from 59 seconds with the Core i7-975 to 41 seconds with the 980X is pretty darned good for a benchmark that purports to be largely bound by memory bandwidth. Gulftown is efficient enough with its memory accesses, perhaps in part due to its larger cache, to extract more performance from its additional cores.

STARS Euler3d computational fluid dynamics

Charles O’Neill works in the Computational Aeroservoelasticity Laboratory at Oklahoma State University, and he contacted us to suggest we try the computational fluid dynamics (CFD) benchmark based on the STARS Euler3D structural analysis routines developed at CASELab. This benchmark has been available to the public for some time in single-threaded form, but Charles was kind enough to put together a multithreaded version of the benchmark for us with a larger data set. He has also put a web page online with a downloadable version of the multithreaded benchmark, a description, and some results here.

In this test, the application is basically doing analysis of airflow over an aircraft wing. I will step out of the way and let Charles explain the rest:

The benchmark testcase is the AGARD 445.6 aeroelastic test wing. The wing uses a NACA 65A004 airfoil section and has a panel aspect ratio of 1.65, taper ratio of 0.66, and a quarter-chord sweep angle of 45º. This AGARD wing was tested at the NASA Langley Research Center in the 16-foot Transonic Dynamics Tunnel and is a standard aeroelastic test case used for validation of unsteady, compressible CFD codes.

The CFD grid contains 1.23 million tetrahedral elements and 223 thousand nodes . . . . The benchmark executable advances the Mach 0.50 AGARD flow solution. A benchmark score is reported as a CFD cycle frequency in Hertz.

So the higher the score, the faster the computer. Charles tells me these CFD solvers are very floating-point intensive, but oftentimes limited primarily by memory bandwidth. He has modified the benchmark for us in order to enable control over the number of threads used. Here’s how our contenders handled the test with different thread counts.

Hmm, what was I saying above about memory bandwidth, efficiency, and caches? The same must apply here, where the 980X’s performance again scales up awfully well. AMD’s fastest Phenom II achieves well under half the computational rate of the i7-980X.

Overclocking

After our Clarkdale overclocking exploits yielded very healthy overclocksspeeds of 4.4 and 4.5GHz for the two chips we triedI had high expectations for their Gulftown cousin. Since the 980X has an unlocked multiplier, I simply turned up the multiplier and CPU core voltage in order to overclock it.

At its stock 1.25V, our Gulftown didn’t take well to higher frequenciesa humble 3.6GHz was all it would do. Fortunately, taking the voltage up to 1.41V did the trick, and our 980X was stable with a 31X multiplier, which should yield 4.13GHz. In fact, I left Turbo Boost enabled during my overclocking attempts, and once the system had booted into Windows, the 980X simply ran at 4.26GHz pretty much all of the time, with one thread or 12, even during our Prime95 torture test. That’s a bit better than the 4GHz we coaxed out of our Core i7-975 a while back.

Here’s how the 980X performs at that speed. (Note that I’ve also included a few other overclocked CPUs from here. The “H57” notations are explained there.)

You’re really not going to extract much more out of DiRT 2 with a faster CPU, I’m afraid, but Cinebench is clearly another story entirely. Good grief.

What about power consumption at this speed and voltage?

Now you can see why Intel chose to hold the line on clock frequencies for Gulftown. There’s room in the 32-nm process, obviously, to reach higher speeds, but you’ll need to increase the voltage to get there. Higher voltage means exponentially higher power draw, taking the Core i7-980X well outside of the established power and heat boundaries.

Fortunately, though, Intel’s cooler for the Core i7-980X is up to the task of cooling the CPU when it’s overclocked to this degree. In our torture tests, using the cooler’s Performance mode, CPU temperatures were in the high fifties Celsius and steady. The fan wasn’t exactly silent at that point, but it was a good deal quieter than the worst CPU and GPU coolers I’ve heard.

The value proposition

Now that we’ve buried you under mounds of information, what can we make of it all? One way to filter the information is to consider the value proposition for each CPU model. Exercises like this one are inherently fraught with various, scary dangersgiving the wrong impression, committing bad math, overemphasizing price, coming off as irredeemably cheesybut our value comparisons have proven to be popular over time, so with the capable assistance of TR System Guide guru Cyril Kowaliski, I’ve taken another crack at it.

What we’ve done is mash up all of our performance data in one, big summary value for each processor. The performance data for each benchmark was converted to a percentage using the Pentium 4 670 as the baseline. We’ve included nearly every benchmark we used in our overall index, with the exception of the purely synthetic tests like Stream. We excluded MyriMatch and Euler3D, since not all processors were tested in those benchmarks. In cases where the benchmarks had multiple components, we used an overall mean rather than including every component score individually. Each benchmark should thus be represented and weighted equally in the final tally. (The one case where we didn’t average together a single application’s output was WorldBench’s two 3ds max tests, since one measures 3D modeling performance and the other rendering.)

This overall performance index makes me a little bit wary, because it’s simply a mash-up of results from various tests, rather than an index carefully weighted to express a certain set of priorities. Still, our test suite itself is intended to cover the general desktop PC’s usage model, so the index ought to suffice for this exercise.

We then took prices for each CPU from the official Intel and AMD price lists. Note that AMD’s prices include a small cut since our last CPU roundup. For our historical comparison, we’ve also included the Core 2 Quad Q6600 and the Pentium 4 670 in a couple of places at their initial launch prices.

If we simply take overall performance and divide by price, we get results that look like this:

By this measure, you should almost always buy one of the cheapest CPUs on the market. This bar chart gives us a strong sense of value,but it may focus our attention a little too exclusively on CPU prices alone. For many of us, time is money, and faster computer hardware is relatively inexpensive. What we really want to know is where we can find the best combination of price and performance for our needs. To give us a better visual sense of that, we’ve devised our nefarious scatter plots.

The faster a processor is, the higher on the chart it will be. The cheaper it is, the closer to the left edge. The better values, then, tend to be closer to the top-left corner of the plot. If you wish, you can find your price range and look for the best performer in that area.

For our purposes today, the most noteworthy result is how the Core i7-980X Extreme delivers a major performance boost over the Core i7-975 Extreme at the exact same price, which gives it a much nicer position on the scatter plot, in spite of its relatively high price.

That gets us closer to the heart of the matter, but in reality, the price of a processor is just one component of a PC’s total cost, and the various platforms do have some price disparities between them. To give some context, we’ve selected a series of components for each processor and platform that might go into a fairly high-end PC of the sort a Core i7-980X might inhabit. The specs were largely based on the Double-Stuff config from our recent system guide. Our goal was to achieve rough parity by selecting full-featured ATX motherboards with dual PCIe x16 slots, each with a full 16-lanes of connectivity if possible. Here are the components we picked for the different platforms, along with system prices:

Platform Total price Motheboard Memory Common components AMD 790FX $1839.89 Gigabyte GA-790FXTA-UD5 ($184.99) 8GB Corsair DDR3-1600 ($219.98) XFX Radeon HD 5870 1GB graphics card ($399.99), Intel X25-M G2 160GB ($499.00), 2x Western Digital Caviar Green 2TB ($359.98), LG WH08LS20 Blu-ray burner ($179.99), Asus Xonar DX ($89.99), Cooler Master Cosmos 1000 ($179.99), Corsair HX750W ($149.99) Intel X48 $1784.89 Asus P5E3 Pro ($129.99) Intel P55 $1839.89 Gigabyte GA-P55A-UD4P ($184.99) Intel X58 $2034.89 Gigabyte GA-X58A-UD5 ($279.99) 12GB OCZ DDR3-1600 ($319.98)

What happens when we factor these rather considerable system prices into our value equation?

Voila! You’re pretty much compelled to buy a Gulftown now, folks. Fire up the credit card and go for broke. The numbers don’t lie! In the context of a high-end system like this one, the additional performance offered by the Core i7-980X is actually worth the price of entry.

The scatter plots tell the same story in nearly as compelling a manner. You’re getting a major increase in overall performance by stepping up to the Core i7-980X, and the added cost isn’t a huge part of the total expenditure.

Of course, the AMD fans in the housewho have a deep and abiding affection for cheap processorsand some other value mavens would have me remind you that less expensive CPUs will look like better values in the context of cheaper systems. See our last roundup or our even lower-cost analysis for some examples. Our focus today is on more expensive CPUs and systems since we’re reviewing the 980X.

Performance per dollar isn’t the whole story these days, though. The power efficiency of a processor increasingly helps determine its value proposition for a host of reasons, from total system costs to noise levels to the size of your electric bill. We measured full system power draw and considered efficiency earlier in this article; now, we can factor in system prices to give us a sense of power-efficient performance per dollar.

The 980X’s combination of a high price and excellent power efficiency puts it in the top third of the pack on our bar charts. Both the bar chart and the scatter plot tell us that the 980X represents a major improvement over the Core i7-975 Extreme at the same price. If you were running a render farm, a Gulftown (or more likely, a Westmere-EP Xeon) system could be a very rational purchase once energy costs were taken into account. Still, some of Intel’s less expensive processors offer pretty good power efficiency for much less money, so they’re at the top of the bar chart.