Ever since the introduction of the first Opteron, Intel has faced a formidable foe in the x86 server and workstation markets. AMD’s decision to integrate a memory controller into its processors and use a narrow, high-speed interconnect between CPUs and I/O chips has made it a perennial contender in this space. Even recently, while Intel’s potent Core microarchitecture has given it a lead in the majority of performance tests, Xeons have been somewhat hamstrung on two fronts: on the power-efficiency front by their prevailing use FB-DIMM type memory, and on the scalability front by the use of a front-side bus and a centralized memory controller.

Those barriers for the Xeon are about to be swept away by today’s introduction of new processors based on the chip code-named Nehalem, a new CPU design that brings with it a revised system architecture that will look very familiar to folks who know the Opteron. Try this on for size: a single-chip quad-core processor with a relatively small L2 cache dedicated to each core, backed up by larger L3 cache shared by all cores. Add in an integrated memory controller and a high-speed, low-latency socket interconnect. Sounds positively.. Opteronian, to coin a word, but that’s also an apt description of Nehalem.

Of course, none of this is news. Intel has been very forthcoming about its plans for Nehalem for some time now, and the high-end, single-socket desktop part based on this same silicon has been selling for months as the Core i7. Just as with the Opteron, though, Nehalem’s true mission and raison d’etre is multi-socket systems, where its architectural advantages can really shine. Those advantages look to be formidable because, to be fair, the Nehalem team set out to do quite a bit more than merely copy the Opteron’s basic formula. They attempted to create a solution that’s newer, better, and faster in most every way, melding the new system architecture with Intel’s best technologies, including a heavily tweaked version of the familiar Core microarchitecture.

Since this is Intel, that effort has benefited from world-class semiconductor fabrication capabilities in the form of Intel’s 45nm high-k/metal gate technology, the same process used to produce “Harpertown” Xeons. At roughly 751 million transistors and a die area of 263 mm², though, the Nehalem EP is a much larger chip. (Harpertown is comprised of a pair of dual-core chips, each of which has 410 million transistors in an area 107 mm².) The similarity with AMD’s “Shanghai” Opteron core is, again, striking in this department: Shanghai is estimated at 758 million transistors and measures 258 mm².

The Xeon W5580

We have already covered Nehalem at some length, since it’s already out in the market in single-socket form. Let me direct you to my review of the Core i7 if you’d like more detail about the microarchitecture. If you want even more depth, I suggest reading David Kanter’s Nehalem write-up, as well. Rather than cover all of the same ground again here, I’ll try to offer an overview of the changes to Nehalem most relevant to the server and workstation markets.

A brief tour of Nehalem

As we’ve noted, Nehalem’s quad execution cores are based on the four-issue-wide Core microarchitecture, but they have been modified rather extensively to improve performance per clock and to take better advantage of the new system architecture. One of the most prominent additions is the return of simultaneous multithreading (SMT), known in Intel parlance as Hyper-Threading. Each Nehalem core can track and execute two hardware threads, to keep its execution units more fully occupied. This capability has dubious value on the desktop in the Core i7, but it makes perfect sense for Xeon-based servers, where most workloads are widely multithreaded. With 16 hardware threads in a dual-socket config, the new Xeons take threading in this class of system to a new level.

Additionally, the memory subsystem, including the cache hierarchy, has been broadly overhauled. Each core now has 32K L1 instruction and data caches, along with a dedicated 256K L2 cache. A new L3 cache is 8MB in size and serves all four cores; it’s part of what Intel calls the “uncore” and is clocked independently, typically at a lower speed than the cores.

The chip’s integrated memory controller, also an “uncore” component, interfaces with three 64-bit channels of DDR3 memory, with support for both registered and unbuffered DIMM types, along with ECC. Intel has decided to jettison FB-DIMMs for dual-socket systems, with their added power draw and access latencies. The use of DDR3, which offers higher operating frequencies and lower voltage requirements than DDR2, should contribute to markedly lower platform power consumption. The bandwidth is considerable, as well: a dual-socket system with six channels of DDR3-1333 memory has theoretical peak throughput of 64 GB/s.

That’s a little more than one should typically expect, though, because memory frequencies are limited by the number of DIMMs per channel. A Nehalem-based Xeon can host only one DIMM per channel at 1333MHz, two per channel at 1066MHz, and three per channel at 800MHz. The selection of available memory speeds is also limited by the Xeon model involved. Intel expects 1066MHz memory, which allows for 12-DIMM configurations, to be the most commonly used option. The highest capacity possible at present, with all channels populated, is 144GB.

Nehalem’s revised memory hierarchy also supports an important new feature: Extended Page Tables, which is again like a familiar Opteron capability, Nested Page Tables. Like NPT, EPT accelerates virtualization by relieving the hypervisor of the burden of software-based page table emulation. NPT and EPT have the potential to reduce the overhead of virtualization substantially.

The third and final major uncore element in Nehalem is the QuickPath Interconnect, or QPI. Much like HyperTransport, QPI is a narrow, high-speed, low-latency, point-to-point interconnect used in both socket-to-socket connections and links to I/O chips. QPI operates at up to 6.4 GT/s in the fastest Xeons, where it yields a peak two-way aggregate transfer rate of 25.6 GB/sagain, a tremendous amount of bandwidth. The CPUs coordinate cache coherency over the QPI link by means of a MESIF protocol, which extends the traditional Xeon MESI protocol with the addition of a new Forwarding state that should reduce traffic in certain cases. (For more on the MESIF protocol, see here.)

One of the implications of the move to QPI and an integrated memory controller is that the new Xeons’ memory subsystems are non-uniform. That is, getting to local memory will be notably quicker than retrieving data owned by another processor. Non-uniform memory architectures (NUMA) have some tricky performance ramifications, not all of which have been sufficiently addressed by modern OS schedulers, even now. The Opteron has occasionally run into problems on this front, and now Xeons will, too. One can hope that Intel’s move to a NUMA design will prompt broader and deeper OS- and application-level awareness of memory locality issues.

Power efficiency has become a key consideration in server CPUs, and the new Xeons include a range of provisions intended to address this issue. In fact, the chip employs a dedicated microcontroller to manage power and thermals. Nehalem EP includes more power states (15) than Harpertown (4) and makes faster transitions between them, with a typical switch time of under two microseconds, compared to four microseconds for Harpertown. Nehalem’s lowest power states make use of a power gate associated with each execution core; this gate can cut voltage to to an idle core entirely, eliminating even leakage power and taking its power consumption to nearly zero.

The power management microcontroller also enables an intriguing new feature, the so-called “Turbo mode.” This feature takes advantage of the additional power and thermal headroom available when the CPU is at partial utilization, say with a single- or dual-threaded application, by dynamically raising the clock speed of the busy cores beyond their rated frequency. The clock speed changes involved are relatively conservative: one full increment of the CPU multiplier results in an increase of 133MHz, and most of the new Xeons can only go two “ticks” beyond their usual multiplier ceilings. Still, the highest end W- and X- series Xeons can reach up to three ticks, or 400MHz, beyond their normal limits. Unlike the generally advertised clock frequency of the CPU, this additional Turbo mode headroom is not guaranteed and may vary from chip to chip, depending upon its voltage needs and resulting thermal profile. What headroom is available brings a “free,” if modest, performance boost to lightly threaded applications.

A new platform, too

Of course, this sweeping set of changes brings with it a host of platform-level alterations, not least of which is the modification of the role and naming of what has been traditionally called the north bridge chip, or the memory controller hub (MCH) in Intel’s world. Say hello, instead, to the I/O Hub, or IOH.

A block digram of the Tylerburg chipset. Source: Intel.

The new Xeons’ first IOH has been known by its code name, Tylersburg-36D, and will now be officially called the Intel 5520 chipset. True to its name, this IOH is focused almost entirely on PCI Express connectivity, with one QPI link to each of the two processors and a total of 42 PCIe lanes onboard36 of them PCIe Gen2 and six Gen1. Those lanes can be apportioned in groups of various sizes for specific needs. Tylersburg also has an ESI port for connecting with an Intel south bridge chip, one of the members of the ICH9/10/R family; these chips provide SATA and USB ports, along with various forms of legacy connectivity.

Tylerburg’s dual QPI links open up the possibility of dual IOH chips, which Intel has decided to enable for certain configurations. In this scenario, each Tylersburg chip is linked via QPI to a different CPU, and the two IOH chips are linked via QPI, as well. The primary IOH chip handles various system management and legacy I/O duties, while the secondary one simply provides 36 additional lanes of PCIe Gen 2 connectivity, for a total of 72 lanes in the system (plus six Gen1 lanes). That’s a tremendous amount of connectivity, but it’s in keeping with the platform’s high-bandwidth theme.

Two large coolers and one DDR3 DIMM per channel in our test rig

The new Xeons’ LGA1366-style socket

Nehalem-based Xeons come in a much larger package (left) than the prior Xeon generation (right)

The new Xeons drop into a new, LGA1366-style socket that looks, unsurprisingly, just like the Core i7’s. The CPU itself is housed in a larger package, as well, that dwarfs the Harpertown Xeons and their predecessors.

Pricing and availability

Here’s a quick overview of the new dual-socket Xeons models, along with key features and pricing.

Model Clock speed Cores L3 cache QPI link speed Max DDR3 speed TDP Turbo? Hyper- Threading? Price Xeon W5580 3.2GHz 4 8MB 6.4 GT/s 1333MHz 130 W Y Y $1600 Xeon X5570 2.93GHz 4 8MB 6.4 GT/s 1333MHz 95 W Y Y $1386 Xeon X5560 2.8GHz 4 8MB 6.4 GT/s 1333MHz 95 W Y Y $1172 Xeon X5550 2.66GHz 4 8MB 6.4 GT/s 1333MHz 95 W Y Y $958 Xeon E5540 2.53GHz 4 8MB 5.86 GT/s 1066MHz 80 W Y Y $744 Xeon E5530 2.4GHz 4 8MB 5.86 GT/s 1066MHz 80 W Y Y $530 Xeon E5520 2.26GHz 4 8MB 5.86 GT/s 1066MHz 80 W Y Y $373 Xeon L5520 2.26GHz 4 8MB 5.86 GT/s 1066MHz 60 W Y Y $530 Xeon E5506 2.13GHz 4 4MB 4.8 GT/s 800MHz 80 W N N $266 Xeon L5506 2.13GHz 4 4MB 4.8 GT/s 800MHz 60 W N N $422 Xeon E5504 2.00GHz 4 4MB 4.8 GT/s 800MHz 80 W N N $224 Xeon E5502 1.86GHz 2 4MB 4.8 GT/s 800MHz 80 W N N $188

Nehalem has a plethora of knobs and dials available for product differentiation, and Intel has apparently decided to twiddle with them all. Each of them impacts performance in its own way, so choosing the right processor for your needs may prove to be something less than straightforward.

On top of all of the possibilities you see in the table above, there’s the issue of L3 cache speed, a notable attribute that impacts performance, but one Intel hasn’t opted to document too clearly (as we learned with the Core i7.) As I understand it, the uncore elements in Nehalem chips can be clocked independently of one another, so the speed of the memory controller or the QPI link doesn’t necessarily correspond to the frequency of the L3 cache. The pair of processors we have for this first review, of the decidedly ultra-high-end, workstation-oriented Xeon W5580 variety, have a 2.66GHz L3 cache. So does the Xeon X5570, the top server model.

The Opteron also edges forward

Unfortunately, we don’t have a direct Opteron competitor to test against the Xeon W5580, primarily because AMD doesn’t make a dual-socket CPU that expensive. We do, however, have a pair of new “Shanghai” Opterons, model 2389, with a 2.9GHz core clock frequency (and a 2.2GHz L3/north bridge clock.) These are not “SE” parts, so they offer higher performance within the same power/thermal envelope as most mainstream Opterons, with a 75W ACP rating.

The bigger news here may be the addition, at last, of HyperTransport 3.0 support to these Opterons. HT3 essentially doubles the bandwidth of a HyperTransport link, and at 2.2GHz, the link between our Opteron 2389s should operate at 4.4 GT/s and provide a total of 17.6 GB/s of bandwidthquite close to the 19.2 GB/s supplied by the 4.8 GT/s QPI link on mainstream Nehalem variants. Upgrading to HT3 was as simple as dropping these new Opterons into our existing test system. This system’s Nvidia core logic chipset doesn’t support HT3, but the socket-to-socket interconnect automatically came up with HT3 enabled.

The Opteron 2389 currently lists for $989, which makes it a direct competitor for the Xeon X5550. I’d certainly like to show you a performance comparison between these two chips, but unfortunately, time constraints and a minor flood in my office have prevented me from pursuing the matter. Perhaps soon.

Our testing methods

As ever, we did our best to deliver clean benchmark numbers. Tests were run at least three times, and the results were averaged.

Our test systems were configured like so:

Processors

Dual Xeon E5450 3.0GHz

Dual Xeon X5492 3.4GHz Dual

Xeon L5430 2.66GHz Dual

Xeon W5580 3.2GHz Dual

Opteron 2347 HE 1.9GHz Dual

Opteron 2356 2.3GHz

Dual Opteron 2384 2.7GHz

Dual Opteron 2389 2.9GHz System

bus 1333

MT/s (333MHz) 1600

MT/s (400MHz) 1333

MT/s (333MHz) QPI

6.4 GT/s (3.2GHz) HT

2.0 GT/s (1.0GHz) HT

2.0 GT/s (1.0GHz) HT

4.4 GT/s (2.2GHz) Motherboard SuperMicro

X7DB8+ SuperMicro

X7DWA Asus

RS160-E5 SuperMicro

X8DA3 SuperMicro

H8DMU+ SuperMicro

H8DMU+ BIOS

revision 6/23/2008 8/04/2008 8/08/2008 2/20/2009 3/25/08 10/15/08 North

bridge Intel

5000P MCH Intel

5400 MCH Intel

5100 MCH Intel

5520 MCH Nvidia

nForce Pro 3600 Nvidia

nForce Pro 3600 South

bridge Intel

6321 ESB ICH Intel

6321 ESB ICH Intel

ICH9R Intel

ICH10R Nvidia

nForce Pro 3600 Nvidia

nForce Pro 3600 Chipset

drivers INF

Update 9.0.0.1008 INF

Update 9.0.0.1008 INF

Update 9.0.0.1008 INF

Update 8.9.0.1006 – – Memory

size 16GB

(8 DIMMs) 16GB

(8 DIMMs) 6GB (6 DIMMs) 24GB (6 DIMMs) 16GB

(8 DIMMs) 16GB

(8 DIMMs) Memory

type 2048MB

DDR2-800 FB-DIMMs 2048MB

DDR2-800 FB-DIMMs 1024MB

registered ECC DDR2-667 DIMMs 4096MB

registered ECC DDR3-1333 DIMMs 2048MB

registered ECC DDR2-800 DIMMs 2048MB

registered ECC DDR2-800 DIMMs Memory

speed (Effective)

667MHz 800MHz

667MHz 1333MHz

667MHz 800MHz CAS

latency (CL) 5 5 5 10 5 6 RAS

to CAS delay (tRCD) 5 5 5 9 5 5 RAS

precharge (tRP) 5 5 5 9 5 5 Storage

controller Intel

6321 ESB ICH with Matrix Storage Manager 8.6 Intel

6321 ESB ICH with Matrix Storage Manager 8.6 Intel ICH9R with Matrix Storage Manager 8.6 Intel ICH10R with Matrix Storage Manager 8.6 Nvidia

nForce Pro 3600 LSI

Logic Embedded MegaRAID with 8.9.518.2007 drivers Power

supply Ablecom

PWS-702A-1R

700W Ablecom

PWS-702A-1R

700W FSP

Group FSP460-701UG 460W Ablecom

PWS-702A-1R

700W Ablecom

PWS-702A-1R

700W Ablecom

PWS-702A-1R

700W Graphics Integrated

ATI ES1000 with 8.240.50.3000 drivers Integrated

ATI ES1000 with 8.240.50.3000 drivers Integrated

XGI Volari Z9s with 1.09.10_ASUS drivers Nvidia

GeForce 8400 GS with ForceWare 182.08 drivers Integrated

ATI ES1000 with 8.240.50.3000 drivers Integrated

ATI ES1000 with 8.240.50.3000 drivers Hard

drive WD

Caviar WD1600YD 160GB OS Windows

Server 2008 Enterprise x64 Edition with Service Pack 1

We used the following versions of our test applications:

The tests and methods we employ are usually publicly available and reproducible. If you have questions about our methods, hit our forums to talk with us about them.

Memory subsystem performance

This test gives us a visual look at the throughput of the different levels of the memory hierarchy. Generally speaking, Intel’s caches seem to achieve higher bandwidth than AMD’s. Looking at the block sizes between 512KB and 16MB shows us that the Xeon W5580’s L2 caches appear to be quite a bit faster than the older Harpertown Xeons’, but the W5580’s throughput drops at 4MB, where its smaller L2 caches run out of space. The most striking result may be the new Xeons’ throughput once we spill into main memory. Let’s take a closer look at that data point.

The W5580’s main memory throughput nearly doubles that of the fastest Opterons and is just short of four times that of the fastest Harpertown Xeon, the X5492. That’s a staggering increase in measured bandwidth

Memory access latencies are down dramatically, as wellthe W5580’s round trip to main memory takes nearly half the time that the Xeon E5450’s does, and the W5580’s even quicker than the quickest Opteron. Incidentally, other than the fact that the 2389 has HT3 enabled, I’m unsure why the 2389’s memory performance isn’t quite as good as the 2384. They were tested in the same server and otherwise configured identically.

To rather gratuitously drive the point home, we can take a more complete look at memory access latencies in the charts below. Note that I’ve color-coded the block sizes that roughly correspond to the different caches on each of the processors. L1 data cache is yellow, L2 is light orange, L3’s darker orange, and main memory is brown.

Each stage of the new Xeon’s cache and memory hierarchy delivers shorter access times than the corresponding stage in the Shanghai Opteron’s, although the two certainly look similar, don’t they?

Bottom line: the Nehalem Xeon’s re-architected memory subsystem delivers the goods as advertised, with higher throughput and quicker access times than anything else we’ve tested.

SPECjbb 2005

SPECjbb 2005 simulates the role a server would play executing the “business logic” in the middle of a three-tier system with clients at the front-end and a database server at the back-end. The logic executed by the test is written in Java and runs in a JVM. This benchmark tests scaling with one to many threads, although its main score is largely a measure of peak throughput.

As you may know, system vendors spend tremendous effort attempting to achieve peak scores in benchmarks like this one, which they then publish via SPEC. We did not intend to challenge the best published scores with our results, but we did hope to achieve reasonably optimal tuning for our test systems. To that end, we used a fast JVMthe 64-bit version of Oracle’s JRockIt JRE R27.6and picked up some tweaks for tuning from recently published results. We used two JVM instances with the following command line options:

start /AFFINITY [0F, F0] java -Xms3700m -Xmx3700m -XXaggressive -XXlazyunlocking -Xlargepages:exitOnFailure=true -Xgc:genpar -XXgcthreads=4 -Xns3200m -XXcallprofiling -XXtlasize:min=4k,preferred=512k -XXthroughputcompaction

Notice that we used the Windows “start” command to affinitize our threads on a per-socket basis. We also tried affinitizing on a per-chip basis for the Harpertown Xeon systems, but didn’t see any performance benefit from doing so. One exception to the command line options above was our Xeon L5430/San Clemente system. Since it had only 6GB of memory, we had to back the heap size down to 2200MB for it.

Also, in order to affinitize for the 16 hardware threads of the Xeon W55800 system, we used masks of FF00 and 00FF. Although our Xeon W5580 system has more memory than the rest of the systemspractically unavoidable in an optimal configuration because of its six DIMM channelswe did not raise the heap size to take advantage of the additional space. (Although we did experiment with doing so and found it not to bring a substantial advantage.) In order to follow the rules of SPECjbb to the letter, we tested the Xeon W5580 with one to 16 warehouses with two JVMstopping out at twice the number of concurrent warehouses at which we expected performance to peak, thanks to Nehalem’s two hardware threads per core. (We also experimented with running four JVMs on this system, but as with the older Xeons, doing so didn’t improve throughput significantly.)

The new Xeons’ prowess here is absolutely staggering. We’ve rarely, if ever, seen this sort of performance increase from one CPU generation to the next. One can’t help but think how ominous this looks for AMD upon seeing these results.

I should note that you may see published scores even higher than these. We’re testing with an older version of the JRockIt JVM that’s not as well optimized for Nehalemor Shanghaias a newer version might be. Unfortunately, we haven’t yet been able to get our hands on a newer revision of this JVM, though I believe our present comparison should put the newer CPUs on relatively equal footing.

Before we move on, let’s take a quick look at power consumption during this test. SPECjbb 2005 is the basis for SPEC’s own power benchmark, which we had initially hoped to use in this review, but time constraints made that impractical. Nevertheless, we did capture power consumption for each system during a test run using our Extech 380803 power meter. All of the systems used the same model of Ablecom 700W power supply unit, with the exception of the Xeon L5430 server, which used an FPS Group 460W unit. Power management features (such as SpeedStep and Cool’n’Quiet) were enabled via Windows Server’s “Balanced” power policy.

Although it delivers much higher throughput, the Xeon W5580’s system’s peak power draw isn’t appreciably higher than that if its direct predecessor, the Xeon X5492. This is a top-end workstation Xeon model with a generous 130W TDP; I’d expect more mainstream Nehalem Xeons to draw quite a bit less power within their 80 and 95W TDPs. Hopefully we can test one soon.

Still, have a look at what happens when we consider performance per watt.

On the strength of its amazing throughput, even this 130W version of the new Xeon edges out the most effective Opteron in terms of power-efficient performance. Only a low-voltage version of the Harpertown Xeon, on the low-power San Clemente platform, proves more efficient.

Cinebench rendering

We can take a closer look at power consumption and energy-efficient performance by using a test whose time to completion varies with performance. In this case, we’re using Cinebench, a 3D rendering benchmark based on Maxon’s Cinema 4D rendering engine.

The performance leap with the new Xeons isn’t quite as stunning here as it is in SPECjbb, but it’s formidable nonetheless. The W5580 is nearly 50% faster than the Opteron 2389.

Once again, we measured power draw at the wall socket for each of our test systems across a set time period, during which we ran Cinebench’s multithreaded rendering test.

A quick look at the data tells us much of what we need to know, Still, we can quantify these things with more precision. We’ll start with a look at idle power, taken from the trailing edge of our test period, after all CPUs have completed the render.

Despite all of the bandwidth and its tremendous performanceand despite the 130W peak TDP of our Xeon W5580 processorsour Nehalem test system draws even less power at idle than our Opteron 2389-based system. Finally free from the wattage penalty of FB-DIMMs, Intel is again competitive on the platform power front.

Next, we can look at peak power draw by taking an average from the ten-second span from 15 to 25 seconds into our test period, during which the processors were rendering.

The combination of a 130W TDP and smart power management gives the Xeon W5580 more dynamic range in terms of power draw than any other solution tested. As with SPECjbb, the Xeon W5580 system’s peak pull is slightly higher than the Xeon X5492 system’s.

Another way to gauge power efficiency is to look at total energy use over our time span. This method takes into account power use both during the render and during the idle time. We can express the result in terms of watt-seconds, also known as joules.

We can quantify efficiency even better by considering specifically the amount of energy used to render the scene. Since the different systems completed the render at different speeds, we’ve isolated the render period for each system. We’ve then computed the amount of energy used by each system to render the scene. This method should account for both power use and, to some degree, performance, because shorter render times may lead to less energy consumption.

Here, too, the new Xeon just outdoes the Opteron 2384 in our perhaps best measure of power-efficient performance by using less energy to complete the task at hand.

MyriMatch proteomics

Our benchmarks sometimes come from unexpected places, and such is the case with this one. David Tabb is a friend of mine from high school and a long-time TR reader. He has provided us with an intriguing benchmark based on an application he’s developed for use in his research work. The application is called MyriMatch, and it’s intended for use in proteomics, or the large-scale study of proteins. I’ll stop right here and let him explain what MyriMatch does:

In shotgun proteomics, researchers digest complex mixtures of proteins into peptides, separate them by liquid chromatography, and analyze them by tandem mass spectrometers. This creates data sets containing tens of thousands of spectra that can be identified to peptide sequences drawn from the known genomes for most lab organisms. The first software for this purpose was Sequest, created by John Yates and Jimmy Eng at the University of Washington. Recently, David Tabb and Matthew Chambers at Vanderbilt University developed MyriMatch, an algorithm that can exploit multiple cores and multiple computers for this matching. Source code and binaries of MyriMatch are publicly available.

In this test, 5555 tandem mass spectra from a Thermo LTQ mass spectrometer are identified to peptides generated from the 6714 proteins of S. cerevisiae (baker’s yeast). The data set was provided by Andy Link at Vanderbilt University. The FASTA protein sequence database was provided by the Saccharomyces Genome Database. MyriMatch uses threading to accelerate the handling of protein sequences. The database (read into memory) is separated into a number of jobs, typically the number of threads multiplied by 10. If four threads are used in the above database, for example, each job consists of 168 protein sequences (1/40th of the database). When a thread finishes handling all proteins in the current job, it accepts another job from the queue. This technique is intended to minimize synchronization overhead between threads and minimize CPU idle time.

The most important news for us is that MyriMatch is a widely multithreaded real-world application that we can use with a relevant data set. MyriMatch also offers control over the number of threads used, so we’ve tested with one to eight threads.

I should mention that performance scaling in MyriMatch tends to be limited by several factors, including memory bandwidth, as David explains:

Inefficiencies in scaling occur from a variety of sources. First, each thread is comparing to a common collection of tandem mass spectra in memory. Although most peptides will be compared to different spectra within the collection, sometimes multiple threads attempt to compare to the same spectra simultaneously, necessitating a mutex mechanism for each spectrum. Second, the number of spectra in memory far exceeds the capacity of processor caches, and so the memory controller gets a fair workout during execution.

Here’s how the processors performed.

The Xeon W5580 completes this task in almost precisely half the time it takes the Xeon X5492another eye-popping performance.

STARS Euler3d computational fluid dynamics

Charles O’Neill works in the Computational Aeroservoelasticity Laboratory at Oklahoma State University, and he contacted us to suggest we try the computational fluid dynamics (CFD) benchmark based on the STARS Euler3D structural analysis routines developed at CASELab. This benchmark has been available to the public for some time in single-threaded form, but Charles was kind enough to put together a multithreaded version of the benchmark for us with a larger data set. He has also put a web page online with a downloadable version of the multithreaded benchmark, a description, and some results here.

In this test, the application is basically doing analysis of airflow over an aircraft wing. I will step out of the way and let Charles explain the rest:

The benchmark testcase is the AGARD 445.6 aeroelastic test wing. The wing uses a NACA 65A004 airfoil section and has a panel aspect ratio of 1.65, taper ratio of 0.66, and a quarter-chord sweep angle of 45º. This AGARD wing was tested at the NASA Langley Research Center in the 16-foot Transonic Dynamics Tunnel and is a standard aeroelastic test case used for validation of unsteady, compressible CFD codes.

The CFD grid contains 1.23 million tetrahedral elements and 223 thousand nodes . . . . The benchmark executable advances the Mach 0.50 AGARD flow solution. A benchmark score is reported as a CFD cycle frequency in Hertz.

So the higher the score, the faster the computer. Charles tells me these CFD solvers are very floating-point intensive, but oftentimes limited primarily by memory bandwidth. He has modified the benchmark for us in order to enable control over the number of threads used. Here’s how our contenders handled the test with different thread counts.

Here we are again with another stunning leap in performance. The Xeon W5580 performs this simulation at over the twice the rate of the fastest Opteron. Nehalem seems to be especially well suited for bandwidth-intensive scientific computing and HPC applications like the two on this page.

For what it’s worth, I should note that at lower thread counts, we saw a striking amount of variability from run to run with the Xeon X5580 system. At two threads, for instance, the scores came in at 1.5Hz, 2.4Hz, and 1.8Hz. So I wouldn’t put too much stock into those non-peak results. Things seemed to even out once we got to higher thread counts. This variance at low thread counts could be the result of one or several facets of the Nehalem architecture, including Turbo mode, NUMA, and SMT, all of which can contribute some performance variability, especially when interacting with a non-NUMA/SMT-aware application and perhaps a less-than-optimal thread scheduler. For what it’s worth, we didn’t see any such variability on our Opteron test systems.

Folding@Home

Next, we have a slick little Folding@Home benchmark CD created by notfred, one of the members of Team TR, our excellent Folding team. For the unfamiliar, Folding@Home is a distributed computing project created by folks at Stanford University that investigates how proteins work in the human body, in an attempt to better understand diseases like Parkinson’s, Alzheimer’s, and cystic fibrosis. It’s a great way to use your PC’s spare CPU cycles to help advance medical research. I’d encourage you to visit our distributed computing forum and consider joining our team if you haven’t already joined one.

The Folding@Home project uses a number of highly optimized routines to process different types of work units from Stanford’s research projects. The Gromacs core, for instance, uses SSE on Intel processors, 3DNow! on AMD processors, and Altivec on PowerPCs. Overall, Folding@Home should be a great example of real-world scientific computing.

notfred’s Folding Benchmark CD tests the most common work unit types and estimates performance in terms of the points per day that a CPU could earn for a Folding team member. The CD itself is a bootable ISO. The CD boots into Linux, detects the system’s processors and Ethernet adapters, picks up an IP address, and downloads the latest versions of the Folding execution cores from Stanford. It then processes a sample work unit of each type.

On a system with two CPU cores, for instance, the CD spins off a Tinker WU on core 1 and an Amber WU on core 2. When either of those WUs are finished, the benchmark moves on to additional WU types, always keeping both cores occupied with some sort of calculation. Should the benchmark run out of new WUs to test, it simply processes another WU in order to prevent any of the cores from going idle as the others finish. Once all four of the WU types have been tested, the benchmark averages the points per day among them. That points-per-day average is then multiplied by the total number of cores (or threads, in the case of SMT) in the system in order to estimate the total number of points per day that CPU might achieve.

This may be a somewhat quirky method of estimating overall performance, but my sense is that it generally ought to work. We’ve discussed some potential reservations about how it works here, for those who are interested. I have included results for each of the individual WU types below, so you can see how the different CPUs perform on each.

Because each of its cores is executing two threads at once, the Xeon W5580’s performance in the individual work unit tests is relatively lackluster. Once we reach the bottom line and look at total projected points per day, though, it achieves nearly a 50% gain over the Xeon X5492just another day at the office for Nehalem, I suppose.

3D modeling and rendering

POV-Ray rendering

We’re using the latest beta version of POV-Ray 3.7 that includes native multithreading and 64-bit support. Some of the beta 64-bit executables have been quite a bit slower than the 3.6 release, but this should give us a decent look at comparative performance, regardless.

To put this performance in our chess2 scene into perspective, the Xeon W5580 box finishes in 30 seconds what used to take over 10 minutes to complete.

Valve VRAD map compilation

This next test processes a map from Half-Life 2 using Valve’s VRAD lighting tool. Valve uses VRAD to pre-compute lighting that goes into its games.

The new Xeon’s dominance continues here.

x264 HD video encoding

This benchmark tests performance with one of the most popular H.264 video encoders, the open-source x264. The results come in two parts, for the two passes the encoder makes through the video file. I’ve chosen to report them separately, since that’s typically how the results are reported in the public database of results for this benchmark. These scores come from the newer, faster version 0.59.819 of the x264 executable.

I’m at a bit of a loss to express the reality of what we’re seeing. Across a broad mix of applications, the Xeon W5580 isby farthe fastest processor we’ve ever tested. Yes, this is a very high end part, but Intel’s new architecture is unquestionably effective.

Sandra Mandelbrot

We’ve included this final test largely just to satisfy our own curiosity about how the different CPU architectures handle from SSE extensions and the like. SiSoft Sandra’s “multimedia” benchmark is intended to show off the benefits of “multimedia” extensions like MMX, SSE, and SSE2. According to SiSoft’s FAQ, the benchmark actually does a fractal computation:

This benchmark generates a picture (640×480) of the well-known Mandelbrot fractal, using 255 iterations for each data pixel, in 32 colours. It is a real-life benchmark rather than a synthetic benchmark, designed to show the improvements MMX/Enhanced, 3DNow!/Enhanced, SSE(2) bring to such an algorithm. The benchmark is multi-threaded for up to 64 CPUs maximum on SMP systems. This works by interlacing, i.e. each thread computes the next column not being worked on by other threads. Sandra creates as many threads as there are CPUs in the system and assignes [sic] each thread to a different CPU. The benchmark contains many versions (ALU, MMX, (Wireless) MMX, SSE, SSE2, SSSE3) that use integers to simulate floating point numbers, as well as many versions that use floating point numbers (FPU, SSE, SSE2, SSSE3). This illustrates the difference between ALU and FPU power. The SIMD versions compute 2/4/8 Mandelbrot point iterations at once – rather than one at a time – thus taking advantage of the SIMD instructions. Even so, 2/4/8x improvement cannot be expected (due to other overheads), generally a 2.5-3x improvement has been achieved. The ALU & FPU of 6/7 generation of processors are very advanced (e.g. 2+ execution units) thus bridging the gap as well.

We’re using the 64-bit version of the Sandra executable, as well.

Well, OK, then.