As you may know,

Intel has enjoyed a resurgence in its server and workstation processor business over the past several years, due in no small part to regular and effective refinements to its core CPU technology. The introduction of the “Nehalem” quad-core Xeons last year was the biggest step forward the firm has taken in many years, with a whole new system architecture nicely complementing a revamped processor microarchitecture. The results were major gains in scalability, performance, and power efficiency compared to the prior generation of Xeonsalong with renewed strength for Intel’s competitive standing versus its main rival, AMD.

By contrast, this year’s revision of the Xeon is comparatively simple, even modest. The new 32-nm Xeons, code-named Westmere-EP, raise the on-chip core count by twoto a total of six cores per chipwhile fitting into the same socket and cooling infrastructure as the Nehalem Xeons before them. The Westmere Xeons’ clock frequencies are largely similar, as is the per-clock performance of each core.

A change like that is easy to grasp, but it’s also easy to underestimate. In the thread-rich realm of server-class applications, with a robust system architecture like this one, adding two more cores can boost performance by nearly 50%. From another angle, that boost could translate into a similarly large increase in energy efficiency, because half again as much work is being accomplished for each watt-hour the system consumes. If talk like that doesn’t float your boat, you’re probably not a system administrator responsible for a room full of servers. I’d wager most folks in such roles would happily accept a 50% gain in power-efficient performance each year, if they could get it.

The question is whether the Westmere-EP Xeons really deliver on their advertised promise. We’ve had a number of systems cranking away in Damage Labs for the last little while in order to find out, and, without giving away the game entirely, the news is even better than you might think. Since our last look at workstation/server-class processors, the state of the art in such systems has changed on multiple fronts, from the growing prevalence of platforms tailored for power efficiency to the proliferation of solid-state disks. Our revised suite of test systems provides a nice overview of the landscape. Read on to see how it all fits together.

Westmere-EP: both less and more

Intel’s 32-nm chip fabrication process is what makes Westmere possible. This relatively new fabrication technology allows substantially more gatesand thus transistors, logic, and ultimately coresto fit into a given amount of chip area than the 45-nm processes used formerly by Intel and still today by AMD. In this generation of process tech, Intel has carried over its high k + metal gate transistors, first used at 45 nm, and moved to immersion lithographyin which a liquid medium is used to better focus lightfor the first time. By now, Intel is well into ramping its 32-nm production, with the dual-core Clarkdale and six-core Gulftown processors making up a large proportion of its consumer mobile and desktop CPU lineups. In fact, our review of these Xeon processors is rather late; the Westmere-based Xeon 5600 series has been shipping to customers for a number of months, as well.

Code name Key products Cores Threads Last-level cache size Process

node (Nanometers) Estimated transistors (Millions) Die area (mm²) Harpertown Xeon 5400 2 x 2 2 x 2 2 x 6 MB 45 2 x 410 2 x 107 Nehalem-EP Xeon 5500 4 8 8 MB 45 731 263 Westmere-EP Xeon 5600 6 12 12 MB 32 1170 248 Shanghai Opteron 2300 4 4 6 MB 45 758 258 Istanbul Opteron 2400 6 6 6 MB 45 904 346 Lisbon Opteron 4100 6 6 6 MB 45 904 346 Magny-Cours Opteron 6100 2 x 6 2 x 6 2 x 6 MB 45 2 x 904 2 x 346

The remarkable thing about the Westmere-EP Xeons, as illustrated in the table above, is that they incorporate two more cores and 50% more cacheL3 size is up from 8MB in Nehalem to 12MB hereyet they are actually smaller chips than their predecessors.

A close-up of a Westmere-EP wafer. Source: Intel.

AMD hasn’t made a process transition lately, and GlobalFoundries currently lags behind Intel by roughly a year, if not more. Thus, Westmere’s competition is a much larger chip, at 346 mm², with the same core count. In fact, the most direct competition for the Westmere Xeons is arguably the Opteron 6000 series, which is based on two of those larger chips packaged together in each socket. The contrasts here are stark enough to incite me to use italics twice in two paragraphs, so we’re not talking small potatoes. Smaller chips, of course, are generally more desirable for a number of reasons, including lower manufacturing costs and typically lower power draw with tamer thermals.

By and large, Westmere-EP is essentially a Nehalem Xeon that’s been ported over to the new 32-nm process, but it has received a host of notable tweaks along the way, not least of which is the aforementioned addition of 50% more cores and cache. Thanks to Intel’s version of simultaneous multithreading, known as Hyper-Threading, a six-core Xeon can track and execute 12 hardware threads. Two Westmere Xeons in a 2P system present an imposing total of 24 threads to the OS.

The other modifications in Westmere-EP are minor but numerous. Some of them boost performance in various ways. A suite of seven new instructions, collectively dubbed AES-NI, can accelerate cryptography. The chip’s integrated memory controller now supports two DIMMs per channel at 1333MHz, raising the limit from 1066MHz in Nehalem. Also, the number of memory buffers has risen from 64 to 88, offering the potential for higher peak bandwidth at a given memory frequency. And, as is almost customary these days, certain latencies have been reduced in the CPU’s virtualization hardware, potentially enhancing performance for consolidated servers.

Another set of changes in this new silicon focuses on advancing power efficiency. The Nehalem Xeons introduced a gate capable of shutting off power to idle cores; Westmere adds a power gate for the “uncore” portion of the chip capable of reducing the voltage to the memory controller, L3 cache, and QuickPath interconnect when both sockets in a 2P system are idle. Another potential heavy hitter for server installations will be the memory controller’s ability to support low-voltage DDR3 memory, which has become available in recent months. The chip’s APIC timer now continues running when the CPU goes into a deep sleep state, too.

A pair of Xeon X5670 processors

From this Westmere-EP silicon, Intel has spun an entire range of new Xeons dubbed the 5600 series. We detailed the various models here when those products were first introduced. The 5600 lineup and its pricing appear to have remained largely static since then.

What’s on the bench

The Xeons we have for review represent the best of the 5600 series on one axis or another, and we’ve tested them in different types of systems as appropriate. The most extreme of the bunch is the Xeon X5680, which has a base clock speed of 3.33GHz and can raise its frequency as high as 3.6GHz via Turbo Boost when the thread count and thermal headroom permit. The X5680’s max power and thermal rating, or thermal design power (TDP), is 130W, which puts it on the high end of the power spectrum. As Intel’s fastest 2P processor, this model commands a hefty price premium, too. A single X5680 will set you back $1663.

Our test platform for this beast is a relatively large, floor-standing workstation enclosure with a SuperMicro X8DA3 motherboard and a 700W power supply. That combination is comfortably up to the task of cooling and powering a system with a pair of 130W processors.

We should note that, although 5600-series Xeons are billed as drop-in replacements for the 5500-series Xeons before them, at Intel’s recommendation, we upgraded the motherboard in this test system rather than using the older version of the X8DA3 used in our Xeon 5500 review. That older X8D3A was pre-production hardware from the early days of Nehalem, so the change was needed for optimal operation. However, Intel tells us many Xeon 5500-based systems should allow for seamless drop-in upgrades to Westmere Xeons. As is usually the case in these scenarios, you’ll want to check with your motherboard or system vendor for compatibility information.

Asus’ RS700-E6 1U server

The X5680’s 130W TDP will probably rule it out of most server installations. Xeons in the 95W power band are more common, and the X5670 is Intel’s fastest offering at that TDP. The X5670 runs only slightly slower than the X5680, with a 2.93GHz base clock and a 3.33GHz Turbo peak. Stepping down to the X5670 will give you a nice break on max power ratings, but at $1440, it’s not much less expensive.

We’ve tested the X5670 in an Asus 1U server system, pictured above. We also dropped a pair of Xeon X5570 processors into this systemthe prior-gen Nehalem offering at the same frequency and TDPto see how the two generations of Xeons compare.

The low-power Willbrook server

To many folks, the Xeon L5640 may be the sexiest of these new CPUs. Its six cores run at 2.26GHz and can spool up to 2.8GHz via Turbo Boost, yet this Xeon’s TDP rating is a calm and collected 60W. Naturally, that fact makes the L5640 a fantastic candidate for a power-efficient server. You will pay a premium for this sort of power efficiency, though: the L5640 lists at $996 per chip.

Our test system matches a pair of L5640s with a custom motherboard from Intel officially known as the S5500WBand unofficially code-named Willowbrook. Although this Willowbrook board is based on the same Tylersburgexcuse me, I mean “Intel 5500 series”chipset as our other Xeon systems, Intel has specifically optimized this board for reduced power consumption. Those optimizations include a carefully tuned voltage regulator design and more widely spaced components intended to permit airflow and reduce the energy required by cooling fans. The firm claims a 32W savings at idle and a 42W savings under load versus its own S5520UR motherboard.

To that potent mix of power-efficient components, we’ve added six DIMMs of low-power Samsung DDR3 memory. These DIMMs operate at only 1.35V, and Samsung happily touts them as a greener alternative to traditional DDR3 modules.

As you may be gathering by now, this entire platform ought to be quite nicely tailored for low-power operation. To give us a sense of how the enhancements in the Westmere Xeons alone contribute to this system’s efficiency and performance, we’ve tested a couple of quad-core Xeon L5520 processors in this same system. The L5520 has the same 2.26GHz base clock at 60W TDP as the L5640, but its 2.53GHz Turbo max is lower, and its memory speed tops out at 1066MHz.

A competitive imbroglio

As our regular readers will attest, we usually try to test products against their closest competition whenever possible. For the Xeon 5600 series, that competition would most definitely be the latest Opterons from AMD. In order to keep pace with Intel’s formidable performance gains in recent years, AMD has elected to double up on the number of chips it delivers in a single package. The resulting processors, code-named Magny-Cours, were formally announced in late March as the Opteron 6100 series. With 12 cores and four channels of DDR3 memory per socket, these new Opterons promise substantial gains over the six-core Istanbul chips introduced a year ago, even though the basic building block is essentially the same hunk of silicon.

Doubling up on chips per socket can be a savvy strategy in the server market, one that Intel itself validated with its Harpertown Xeons back in 2007. Seeking to upgrade performance by raising clock speeds is a tricky endeavor, because it requires increases in chip voltage that can raise power draw exponentially. By keeping clock speeds low, and thus voltages in check, AMD has made room for multiple chips per socket while staying within its traditional power bands. For widely threaded workloads, this approach could pay solid performance dividends.

Several of the Opteron 6100 models look like good matches for the CPUs we’re testing. The Opteron 6176 SE with 12 cores, a 2.3GHz core clock, and a 105W ACP rating looks like a plausible rival to the Xeons X5680 and X5670 we have on hand. The 6176 SE’s $1386 price tag makes it a close competitor, too. Meanwhile, the Opteron 6164 HE at $744 might well be the closest competition for the Xeon L5640. With 12 cores at 1.7GHz and a 65W ACP, the 6164 HE could make things interesting, at least.

More recently, AMD has announced the Opteron 4100 series, code named Lisbon during its development. These CPU use only a single chip but add DDR3 support like Magny-Cours. The 4100-series Opterons are aimed primarily at compact, high-density server installations, and a bit of mystery surrounds the potential customers for these products. AMD expects the 4100 series to appeal to big web companies that buy large numbers of servers through custom design groups at major OEMs, but it has said not to expect sales figures from that business to become public. Whether such talk foreshadows stealthy success or silent-but-abysmal failure, we do not know.

We do know the 4100 series isn’t positioned directly against the Westmere Xeons, by and large. The fastest Lisbon chip is the Opteron 4184, with six cores at 2.8GHz and a 75W ACP, and it lists for $316. At that price, the 4184 competes against the quad-core Xeon X5500 processors that remain in Intel’s product portfolio.

Unfortunately, we don’t have any of these newer Opterons to test. We have worked with both AMD and Intel for years to make these reviews possible, and both companies have supplied us (and other publications) with samples of their latest products. Initially, this product cycle was no different. We even made a two-day visit to AMD’s server group in Texas to talk about the new Opterons a couple of months back, yet we’re still awaiting word on review samples. That is, frankly, one reason this review is a little late to publication; we had sincerely hoped to include a head-to-head comparison of the latest CPUs.

Of course, some of the blame for the absence of newer Opterons here lies with us. We should have seen the writing on the wall and pursued other avenues for getting hold of a system sooner, either by working with a server maker or just buying the stuff ourselves. We’re still hoping to put the new Opterons through their paces, but we couldn’t delay publication of this review any longer. For the time being, we’ve tested the latest Xeons against the products they replaceand against the older generation of Opterons. That’s not our favored outcome, but we should be able to get a good sense of the Westmere Xeons’ relative performance, regardless.

Fortunately, we do have an Opteron platform you may not have seen tested in the wild just yet. Tyan was kind enough to supply us with its S8212 motherboard, which is based on AMD’s SR5690 chipset, better known as the Fiorano platform. Fiorano is AMD’s first attempt to produce its own server platform in quite a few years, and it adds a few critical features to the Opteron’s quiver. Among them: support for the HyperTransport 3 and PCI Express 2.0 interconnects, both with higher throughput than the older versions of those standards. Although our Fiorano system doesn’t make use of DDR3 memory, it is otherwise comprised of basically the same components as any newer Opteron 4100 system, with the same SR5690 chipset and six-core, 45-nm processors. In this case, we’ve used Opteron 2435 CPUs clocked at 2.6GHz with a 75W power envelope. The analogous model in the 4100 series would be the Opteron 4180, which shares the same clock frequency and max power draw rating.

For what it’s worth, we built our Fiorano test rig using the same type of floor-standing enclosure and power supply as our Xeon X5680 box, so comparisons between those two should be reasonably direct, if something of a mismatch.

We have a power-optimized representative from the Opteron fold, as well, in the form of a 1U server with an efficient 650W PSU and a pair of Opteron 2425 HE processors. The Opteron 2425 HE is a six-core, 2.1GHz part with a 55W ACP. This system is based on an older SuperMicro H8DMU+ motherboard with an Nvidia chipset. Although it lacks a few new features, I believe this board is more power-efficient overall than most existing Fiorano-based mobos, which is why we chose to test the Opteron HEs on it.

Test notes

All of our test systems benefited greatly in terms of power consumption and performance from the addition of solid-state drives for fast, local storage.

The folks at OCZ helped equip our test systems with enterprise-class Vertex EX SSDs. The single-level-cell flash memory in these drives can endure more write-erase cycles than the multi-level-cell flash used in consumer drives, so it’s better suited for server applications. SLC memory writes data substantially faster than MLC flash, as well. The only catch is that SLC flash is quite a bit pricier, as are the drives based on it. For the right application, though, a drive like the Vertex EX can be very much worth it. Heck, we even noticed the effects of these drives during our test sessions. Boot times were ridiculously low for all of the systems, and program start-up times were practically instantaneous.

We’ve also beefed up our lab equipment by stepping up to a Yokogawa WT210 power meter. The Extech unit we’ve used in the past would occasionally return an obviously erroneous value, and for that reason, the Extech hasn’t been sanctioned for use with SPECpower_ssj when the results are to be published via SPEC. The WT210 is a much more accurate meter that meets with SPEC’s approval and integrates seamlessly with the SPECpower_ssj power measurement components.

Our testing methods

As ever, we did our best to deliver clean benchmark numbers. We typically run each test three times and report the median result. In the case of the SPEC benchmarks, though, we’ve reported the results from the single best run achieved.

Our test systems were configured like so:

Processor Opteron

2425 HE 2.1GHz Opteron

2435 2.6GHz Xeon

L5520 2.26GHz Xeon

X5570 2.93GHz Xeon X5670 2.93GHz Xeon

X5680 3.33GHz Xeon

L5640 2.26GHz Motherboard SuperMicro

H8DMU+ Tyan

S8212 Intel

S5500WB Asus

Z8PS-D12-1U SuperMicro

X8DA3 North bridge Nvidia

nForce Pro 3600 AMD

SR5690 Intel

5500 Intel

5520 Intel

5520 South bridge Nvidia

nForce Pro 3600 SP5100 ICH10R ICH10R ICH10R Memory size 16GB

(4 DIMMs) 16GB

(8 DIMMs) 12GB

(6 DIMMs) 24GB

(6 DIMMs) 24GB

(6 DIMMs) Memory type Kingston

PC2-6400 registered ECC DDR2 SDRAM Avant Technology PC2-6400 registered ECC DDR2 SDRAM Samsung

PC3L-10600R registered ECC DDR3 SDRAM Samsung

PC3-10600R registered ECC DDR3 SDRAM Samsung

PC3-10700 registered ECC DDR3 SDRAM Memory speed 800

MT/s 800

MT/s 1066

MT/s 1333

MT/s 1333

MT/s 1333

MT/s Memory timings 6-6-6-18

1T 6-5-5-18

1T 7-7-7-20

1T 9-9-9-24

1T 9-9-9-24

1T 9-9-9-24

1T Chipset drivers 9.28 – INF update

9.1.1.1025 Rapid Storage Technology 9.6 INF update

9.1.1.1025 Rapid Storage Technology 9.6 INF update

9.1.1.1025 Rapid Storage Technology 9.6 Graphics Integrated ATI ES1000 with 8.240.50.3000 drivers Integrated

ASPEED with 6.0.10.89 drivers Integrated Matrox G200e with 5.97.6.6 drivers Integrated

ASPEED with 6.0.10.90 drivers Nvidia

GeForce 8400 GS with ForceWare 257.15 Power supply Cold

Watt CWA2-650- 10-SM01-1 650W Ablecom

PWS-702A-1R 700W Delta

Electronics DPS650SB 650W Delta

Electronics DPS770BB 770W Ablecom

PWS-702A-1R 700W Hard drive OCZ

Vertex EX 64GB SSD with firmware rev. 1.5 OS Windows

Server 2008 R2 Enterprise x64

We used the following versions of our test applications:

The tests and methods we employ are usually publicly available and reproducible. If you have questions about our methods, hit our forums to talk with us about them.

Memory subsystem performance

This synthetic test gives us a quick, visual map of the cache hierarchies of these processors. As you can see, the six L1 and L2 caches of the Westmere Xeons deliver considerable cumulative throughputover a terabyte per second at peak in our 2P Xeon X5680 system.

The Xeon X5670 is over 50% faster than the X5570 at the 256KB and 512KB data points, most likely due to the increased coverage offered by two more L1 and L2 caches. Because those caches are inclusive (that is, the L2 caches replicate the contents of the L1s), the total effective size of the X5670’s L1 and L2 caches is 192KB, while the X5570’s add up to 128KB.

Unfortunately, this program’s sample points are too coarsely distributed to show us the impact of Westmere-EP’s larger L3 cache.

The impact of the added buffering in Westmere-EP’s memory controller isn’t hard to spot. The Xeon X5670 transfers a couple of gigabytes per second more data than the X5570, given the same core and memory frequencies.

The L5640’s lower performance with the scale and copy patterns initially puzzled us, but we expect this processor’s throughput is limited by its lower (2.13GHz) memory controller clock. The X5570, X5670, and X5680 have a faster 2.66GHz memory controller. The L5520 shares the same 2.13GHz memory controller frequency, and its performance with those patterns is nearly identical to the L5640’s, even though its DIMMs operate at 1066MHz.

Incidentally, we’ve modified our Stream testing method from last time out. We’ve found that we get the best throughput on the Xeons by assigning one thread to each physical core in the system. That’s why our results are slightly better than you may have seen before.

Memory access latencies haven’t changed much from Nehalem to Westmere, despite the growth of the L3 cache from 8MB to 12MB. In fact, we had to move our sample point for this graph to 32MB because the larger cache was masking any latency it adds at 16MB.

We can get a closer look at access latencies throughout the memory hierarchy with the 3D graphs below. I’ve colored the block sizes that correspond to different cache levels, with yellow being L1 data cache and brown representing main memory.

The effect is impossible to see in the charts above, but our utility reports that L3 cache latency has grown from 32 cycles on the X5570 to 39 cycles on the X5670. (The L1 and L2 caches on the two chips have identical latencies.) Thinking in terms of cycles is tricky, though, because the results are reported in core cycles and the L3 cache is clocked independently. In this case, the comparison works because the two CPU models share the same core and uncore frequencies.

SPECjbb2005

SPECjbb 2005 simulates the role a server would play executing the “business logic” in the middle of a three-tier system with clients at the front-end and a database server at the back-end. The logic executed by the test is written in Java and runs in a JVM. This benchmark tests scaling with one to many threads, although its main score is largely a measure of peak throughput.

As you may know, system vendors spend tremendous effort attempting to achieve peak scores in benchmarks like this one, which they then publish via SPEC. We have used a relatively fast JVM, the 64-bit version of Oracle’s JRockIt JRE, and we’ve tuned each system reasonably well. Still, it was not our intention to match the best published scores, a feat we probably couldn’t accomplish without access to the IBM JVM, which looks to be the fastest option at present. Similarly, although we’ve worked to be compliant with the SPEC run rules for this benchmark, we have not done the necessary work to prepare these results for publication via SPEC, nor do we intend to do so. Thus, these scores should be considered experimental, research-mode results only.

As always, please, no wagering.

We used the following command line options:

Xeons 12 core/24 thread/24GB/6 instances: start /AFFINITY [F00000, 0F0000, 00F000, 000F00, 0000F0, 00000F] java -Xms3900m -Xmx3900m -Xns3260m -XXaggressive -Xlargepages:exitOnFailure=true -Xgc:genpar -XXgcthreads:4 -XXcallprofiling -XXtlasize:min=4k,preferred=1024k Xeons 12 core/24 thread/12GB/6 instances: start /AFFINITY [F00000, 0F0000, 00F000, 000F00, 0000F0, 00000F] java -Xms2800m -Xmx2800m -Xns2500m -XXaggressive -Xlargepages:exitOnFailure=true -Xgc:genpar -XXgcthreads:4 -XXcallprofiling -XXtlasize:min=4k,preferred=1024k Xeons 8 core/16 thread/24GB/2 instances: start /AFFINITY [FF00, 00FF] java -Xms3900m -Xmx3900m -Xns3260m -XXaggressive -Xlargepages:exitOnFailure=true -Xgc:genpar -XXgcthreads:8 -XXcallprofiling -XXtlasize:min=4k,preferred=1024k Xeons 8 core/16 thread/12GB/2 instances: start /AFFINITY [FF00, 00FF] java -Xms2800m -Xmx2800m -Xns2500m -XXaggressive -Xlargepages:exitOnFailure=true -Xgc:genpar -XXgcthreads:8 -XXcallprofiling -XXtlasize:min=4k,preferred=1024k Opterons 12 core/16GB/2 instances: start /AFFINITY [FC0, 03F] java -Xms3900m -Xmx3900m -Xns3260m -XXaggressive -Xlargepages:exitOnFailure=true -Xgc:genpar -XXgcthreads:6 -XXcallprofiling -XXtlasize:min=4k,preferred=1024k

In keeping with the SPECjbb run rules, we tested at up to twice the optimal number of warehouses per system, with the optimal count being the total number of hardware threads.

In all cases, Windows Server’s “lock pages in memory” setting was enabled for the benchmark user. In the Xeon systems’ BIOSes, we disabled the “hardware prefetch” and “adjacent cache line prefetch” options.

The X5670 isn’t quite 50% faster than the Xeon X5570 here, but this is a healthy performance gain within the same infrastructure and power envelope, regardless. The low-power Xeon L5640 posts a similar gain over the L5520, which is sufficient to put the L5640 ahead of the X5570and that feels like noteworthy progress. The picture will no doubt come into sharper focus when we add the question of power efficiency to the mix.

SPECpower_ssj2008

Like SPECjbb2005, this benchmark is based on multithreaded Java workloads and uses similar tuning parameters, but its workloads are somewhat different. SPECpower is also distinctive in that it measures power use at different load levels, stepping up from active idle to 100% utilization in 10% increments. The benchmark then reports power-performance ratios at each load level.

SPEC’s run rules for this benchmark require the collection of ambient temperature, humidity, and altitude data, as well as power and performance, in order to prevent the gaming of the test. Per SPEC’s recommendations, we used a separate system to act as the data collector. Attached to it were a Digi WatchPort/H temperature and humidity sensor and our Yokogawa WT210 power meter. Although our new power meter might well pass muster with SPEC, what we said about our SPECjbb results being “research mode only” applies here, too.

We used the same basic performance tuning and system setup parameters here that we did with SPECjbb2005.

SPECpower_ssj results are a little more complicated to interpret than your average benchmark. We’ve plotted the output in several ways into order to help us understand it.

Although the plot above looks like some sort of odd coral formation, this may well be the most intuitive way of presenting these data. Each of the load levels in the benchmark is represented by a point on the plot, and the two axes are straightforward enough. The higher the point is on the plot, the higher the performance. The further to the right it is, the more power was consumed at that load level.

Immediately, we can divine that the Xeon X5680 has the highest overall performance and the highest power consumption. The Xeon X5670 represents a substantial reduction in power draw versus the X5680 with only a minor drop in operations per second. Meanwhile, the Xeon X5570 draws nearly as much power as the X5670 at the upper load levels but doesn’t deliver nearly as much throughput. The Opteron 2435’s power draw is also quite similar, but its performance is lower still.

The Willowbrook system with the low-power Xeons is in a class of its own. Inside that system, the Xeon L5640 achieves roughly 200K more ops per second than the L5520 with only marginally higher power draw. Indeed, the L5640 appears to be the undisputed champ here, peaking higher than the Xeon X5570 while consuming over 100 fewer watts.

We can confirm the L5640’s standing with a look at the performance-to-power ratios and the summarized overall standings.

Yep, the Willowbrook/L5640 combination takes the top spot. Furthermore, the power efficiency progress from Nehalem to Westmere is illustrated vividly in both the 95W and 60W Xeons. The 95W Xeon X5670 even turns out to be more efficient than the 60W Xeon L5520 at all but the highest load levels, giving it a modest lead in the overall score.

Cinebench rendering

We can take another look at power consumption and energy-efficient performance by using a test whose time to completion varies with performance. In this case, we’re using Cinebench, a 3D rendering benchmark based on Maxon’s Cinema 4D rendering engine.

The six-core Xeons dominate the performance results, more or less as expected. We’ll pause to note the architectural efficiency of the current Xeons. Even at a lower clock frequency, the six-core, 2.26GHz Xeon L5640 outperforms the six-core, 2.6GHz Opteron 2435.

Still, single-threaded performance essentially hasn’t advanced from the past generation to this one, as Amdahl’s Law stubbornly refuses to give way to Moore’s. The one exception is the Xeon L5460, whose unusally high Turbo frequency leeway of 533MHz allows it basically to match the Xeon X5670 in the single-threaded test.

As the multithreaded version of this test ran, we measured power draw at the wall socket for each of our test systems across a set time period.

A quick look at the data tells us much of what we need to know. Still, we can quantify these things with more precision. We’ll start with a look at idle power, taken from the trailing edge of our test period, after all CPUs have completed the render.

The 5600-series Xeons bring a slight but measurable increase in power draw at idle, but they’re clearly within the same range as their predecessors. The most remarkable numbers here come from the Willowbrook system. In case this hasn’t sunk in yet, with low-power Xeons aboard, it’s idling at around 65W.

Next, we can look at peak power draw by taking an average from the ten-second span from 10 to 20 seconds into our test period, during which the processors were rendering.

Peak power draw is also up somewhat in the 5600-series Xeons, but not enough to create any real concern.

One way to gauge power efficiency is to look at total energy use over our time span. This method takes into account power use both during the render and during the idle time. We can express the result in terms of watt-seconds, also known as joules.

The Willowbrook system’s minimal power draw at idle at makes this one a rout.

We can quantify efficiency even better by considering specifically the amount of energy used to render the scene. Since the different systems completed the render at different speeds, we’ve isolated the render period for each system. We’ve then computed the amount of energy used by each system to render the scene. This method should account for both power use and, to some degree, performance, because shorter render times may lead to less energy consumption.

Once again, the Westmere Xeons are measurably more efficient than the prior generation of Xeonsand Opterons, for what it’s worth. Even the X5680 looks pretty good here, aided by the fact that it finishes renderingand thus ends our measurementin such short order.

MyriMatch proteomics

Our benchmarks sometimes come from unexpected places, and such is the case with this one. David Tabb is a friend of mine from high school and a long-time TR reader. He has provided us with an intriguing benchmark based on an application he’s developed for use in his research work. The application is called MyriMatch, and it’s intended for use in proteomics, or the large-scale study of proteins. I’ll stop right here and let him explain what MyriMatch does:

In shotgun proteomics, researchers digest complex mixtures of proteins into peptides, separate them by liquid chromatography, and analyze them by tandem mass spectrometers. This creates data sets containing tens of thousands of spectra that can be identified to peptide sequences drawn from the known genomes for most lab organisms. The first software for this purpose was Sequest, created by John Yates and Jimmy Eng at the University of Washington. Recently, David Tabb and Matthew Chambers at Vanderbilt University developed MyriMatch, an algorithm that can exploit multiple cores and multiple computers for this matching. Source code and binaries of MyriMatch are publicly available.

In this test, 5555 tandem mass spectra from a Thermo LTQ mass spectrometer are identified to peptides generated from the 6714 proteins of S. cerevisiae (baker’s yeast). The data set was provided by Andy Link at Vanderbilt University. The FASTA protein sequence database was provided by the Saccharomyces Genome Database. MyriMatch uses threading to accelerate the handling of protein sequences. The database (read into memory) is separated into a number of jobs, typically the number of threads multiplied by 10. If four threads are used in the above database, for example, each job consists of 168 protein sequences (1/40th of the database). When a thread finishes handling all proteins in the current job, it accepts another job from the queue. This technique is intended to minimize synchronization overhead between threads and minimize CPU idle time.

The most important news for us is that MyriMatch is a widely multithreaded real-world application that we can use with a relevant data set. MyriMatch also offers control over the number of threads used.

I should mention that performance scaling in MyriMatch tends to be limited by several factors, including memory bandwidth, as David explains:

Inefficiencies in scaling occur from a variety of sources. First, each thread is comparing to a common collection of tandem mass spectra in memory. Although most peptides will be compared to different spectra within the collection, sometimes multiple threads attempt to compare to the same spectra simultaneously, necessitating a mutex mechanism for each spectrum. Second, the number of spectra in memory far exceeds the capacity of processor caches, and so the memory controller gets a fair workout during execution.

Here’s how the processors performed.

In this ostensibly memory-bound test, the X5670 shaves 11 secondsor nearly 30%off of the shortest execution time posted by the processor it succeeds.

STARS Euler3d computational fluid dynamics

Charles O’Neill works in the Computational Aeroservoelasticity Laboratory at Oklahoma State University, and he contacted us to suggest we try the computational fluid dynamics (CFD) benchmark based on the STARS Euler3D structural analysis routines developed at CASELab. This benchmark has been available to the public for some time in single-threaded form, but Charles was kind enough to put together a multithreaded version of the benchmark for us with a larger data set. He has also put a web page online with a downloadable version of the multithreaded benchmark, a description, and some results here.

In this test, the application is basically doing analysis of airflow over an aircraft wing. I will step out of the way and let Charles explain the rest:

The benchmark testcase is the AGARD 445.6 aeroelastic test wing. The wing uses a NACA 65A004 airfoil section and has a panel aspect ratio of 1.65, taper ratio of 0.66, and a quarter-chord sweep angle of 45º. This AGARD wing was tested at the NASA Langley Research Center in the 16-foot Transonic Dynamics Tunnel and is a standard aeroelastic test case used for validation of unsteady, compressible CFD codes.

The CFD grid contains 1.23 million tetrahedral elements and 223 thousand nodes . . . . The benchmark executable advances the Mach 0.50 AGARD flow solution. A benchmark score is reported as a CFD cycle frequency in Hertz.

So the higher the score, the faster the computer. Charles tells me these CFD solvers are very floating-point intensive, but oftentimes limited primarily by memory bandwidth. He has modified the benchmark for us in order to enable control over the number of threads used. Here’s how our contenders handled the test with different thread counts.

As you’ll note, we’re seeing some pretty broad variance in the results of this test at lower thread counts, which suggests it may be stumbling over these systems’ non-uniform memory architectures. In an attempt to circumvent that problem, I decided to try running two instances of this benchmark concurrently, with each one affinitized to a socket, and adding the results into an aggregate compute rate. Doing so offers a nice performance boost.

The Xeon X5670 betters the X5570’s simulation rate by about 25%, again in a workload where memory bandwidth has traditionally been a constraint.

POV-Ray ray tracing

We’ve been using POV-Ray as a benchmark for a very, very long time, and we’re not going to stop now. The latest version is multithreaded and makes good use of all of the cores and hardware threads available to it.

As we saw in Cinebench, highly parallel, compute-intensive graphics workloads lend themselves well to increased core counts, so the new Xeons fulfill much of their potential.

x264 HD video encoding

This benchmark tests performance with one of the most popular H.264 video encoders, the open-source x264. The results come in two parts, for the two passes the encoder makes through the video file. I’ve chosen to report them separately, since that’s typically how the results are reported in the public database of results for this benchmark.

I have to say, I sure hope they get a few Westmere Xeons inside the YouTube data center. If you’ve ever uploaded an HD video there, you’ll know what I mean.

7-Zip file compression

This final entry in our test suite more or less confirms what we already know about the Westmere Xeons. They’re quite adept at file compression, as well as many other things.