Yesterday the embargoes lifted on reviews of AMD's six-core Opteron server processor, codenamed "Istanbul," but for all that's riding on Istanbul there weren't that many reviews out there. Tech Reports' Scott Wasson and Anandtech's Johan DeGalas, however, came through with very thorough reviews of Istanbul that pit it against its predecessor (quad-core Shanghai) and members of Intel's Xeon line.

Istanbul is essentially a quad-core Shanghai part with two extra cores and a faster HyperTransport interface. So it has the same 6MB L3 cache and the same dual-channel DDR2 controller, a fact that has some fascinating implications for how the part performs in different system topologies.

The dual-socket picture

Scott's review compared the six-core Istanbul to its quad-core Opteron predecessors and to the quad-core Xeon 5550 (Nehalem). All of the tests were run on dual-processor machines, and neither Scott nor Johan tested a four-processor rig.

What the tests ultimately showed was that the quad-core Xeon's combination of higher per-thread performance and superior memory bandwidth (triple-channel DDR3 vs. Istanbul's dual-channel DDR2) gave it a definitive edge over its six-core competitor in most tests. The simple fact is that bandwidth and per-thread performance still rule in most workloads.

Though none of the reviewers tested a four-socket Istanbul, I suspect that this configuration is where the new processor will really shine.

But there are a few places where Istanbul holds its own against Xeon. In particular, Johan's tests indicate that the processor is a good buy for lower VM-count servers that are focused on consolidating existing workloads onto a single physical box. And Scott's detailed platform-level power measurements give Istanbul the edge in performance per watt, especially on highly multithreaded, scalable workloads. In such cases, the Xeon's faster memory boosts the Intel platform's power consumption, so Istanbul is able to match it with slower memory and a higher physical core count per socket. These results indicate that Istanbul is still a good fit for HPC clusters.

So while the quad-core Xeon X5550 beats the similarly priced Istanbul six-core in price/performance and raw performance in 2P configs, Istanbul is a match for it on many workloads in performance per watt. And AMD's newest is enough of an improvement over Shanghai that it's a solid upgrade for any Socket F server, especially given that it's a drop-in replacement.

What's left unanswered by both reviews is how well Istanbul compares to Intel's six-core "Dunnington" Xeon, which is bandwidth-impaired due to its aging frontside bus architecture. If anyone knows of any reviews that covered this, please send a link my way.

The four-socket picture

Though none of the reviewers tested a four-socket Istanbul, I suspect that this configuration is where the new processor will really shine. The secret sauce here is something that AMD calls "HT Assist," and it's basically a trick that lets AMD trade off 1MB of L3 cache space per processor for an alleged HT streaming bandwidth increase of 17GB/s (from 25GB/s on a non-HT-enabled link to 42GB/s).

HT Assist works by using 1MB of L3 to store an index of the contents of the other processors' caches. So whenever a processor needs to check the other processors' caches for the latest version of a piece of data, instead of polling the other sockets with a broadcasted request it can check the local index to see which processor's cache is hosting a copy of the needed data and ping that processor directly. This scheme cuts down significantly on memory-overhead-related bus traffic, leaving more HT link bandwidth available for actual data transfers.

It seems likely that in a 2P Istanbul is hamstrung by its dual-channel DDR2 controller. But when you move to 4P and there's more inter-socket memory traffic, the memory bottleneck shifts from the DDR2 link to the HT link, and that's where HT assist will help tremendously.

The real question is whether HT Assist can help even the score with Xeon in a 4P config, or whether this new trick is necessary just to keep Istanbul from falling even further behind in 4P and above. The answer to that will depend partly on how inefficient QPI is in handling cache coherency among four or more Xeon sockets; but even if QPI is relatively inefficient vs. HT, Xeon still has the following advantages in terms of cache coherency: 1) it has only four cores per socket emitting snoop traffic, and 2) it has more memory bandwidth per socket. So it may turn out to be the case that HT Assist is necessary just to keep the combination of added snoop traffic from the two extra cores per socket and lower per-socket memory bandwidth from murdering Istanbul's performance in 4P and above systems.

Following on this line of inquiry, it's worth nothing that Intel didn't announce any special, HT Assist-style tricks with its recent Nehalem-EX reveal. That isn't to say that they won't eventually do something like this, but they just haven't announced it.

It's worth noting that at some point, everyone trades cache transistors for bandwidth, and Intel had a very long history of doing that before QPI came to market. (Witness the ballooning cache sizes on Intel's processors until Nehalem launched). HT Assist is just another, very specialized example of this classic tradeoff, which is often a very good one that lets a processor maker leverage Moore's Law directly to rein in system costs and power consumption.

Listing image by Wikimedia Commons