A UC Berkeley paper [PDF] recently submitted to the IEEE International Parallel and Distributed Processing Symposium manages to highlight two common and seemingly unrelated themes that have come up a number of times over the past few years in my reporting on the high-performance computing (HPC) space: 1) IBM's Cell is really good at HPC workloads when you invest the time to write custom code for it, and 2) Intel's Xeon platform is perennially bandwidth starved and not very power-efficient.

The researchers used a common HPC benchmark apparently known primarily by its initials, LBMHD, to test the following dual-socket, multicore platforms: Intel's "Clovertown" Xeon, AMD's Opteron X2, Sun's Niagara2, and IBM's Cell, along with a single single-socket, single-core Itanium2. The main focus of the research was uncovering the bottlenecks behind the LBMHD benchmark and in exploring the use of auto-tuners for optimizing the benchmark for different multicore systems. Along the way, though, the researchers uncovered some interesting details that support the two conclusions pointed out above.

The LBMHD benchmark is easily parallelizable and scales well with the number of threads. Because a platform's per-thread performance doesn't affect the benchmark the way that thread count does, it was widely thought to be memory-bandwidth constrained, because it uses complex data structures with irregular memory access patterns.

In looking for bottlenecks for their auto-tuner to optimize, the researchers discovered that memory bandwidth was not, in fact, the main constraint holding back the benchmark on the platforms under examination—at least not initially. Rather, translation look-aside buffer (TLB) resources, cache bandwidth, memory latency, and code scheduling were holding the unoptimized versions of LBMHD back on the different platforms.

The researchers then optimized for the above factors on each platform and ran the benchmark again to find that, in the case of Intel's Clovertown, the bottleneck had now shifted to the memory subsystem. It turns out that Clovertown is bottlenecked in memory bandwidth, and that the chipset can't move enough data from the single DRAM bus into both FSBs. This problem seriously constrains Clovertown's ability to scale with the number of threads, despite the other optimizations.

In contrast, the Operton X2's NUMA design is not bandwidth bottlenecked and gets nearly linear scaling with the TLB optimization enabled.

At the top of the benchmark heap in terms of scaling and raw performance was IBM's Cell, which achieves near-perfect linear scaling for a few reasons. First, the LBMHD code had to be written especially for Cell, since the architecture was so unique that they couldn't just compile the code and autotune it like they did for the other architectures. So Cell had an advantage there, in that its code was optimized from the get-go. It's also the case that Cell's RDRAM-based memory subsystem gives it plenty of bandwidth directly into the Cell socket, and this was an even bigger factor in the platform's superior performance on this benchmark.

Clovertown < Harpertown < Nehalem

It bears pointing out that a lot of the blame for Clovertown's problems can be put on its chipset. The newer "Seaburg" chipset that goes with Intel's latest "Harpertown" Xeons has improvements that are specifically designed to address the cache- and memory-bandwidth problems highlighted in these benchmarks. In addition to FSB and memory bus speed boosts, Seaburg adds a 24MB snoop filter cache designed to cut down on cross-socket cache coherency traffic. All of these factors would add up to make it a more formidable contender in these benchmarks.

But of course, no one should take these remarks as a defense of Intel's currently dated system architecture. Nehalem, with its QuickPath interconnect, NUMA topology, SMT support, and greatly expanded two-level TLB, will take Intel even further in its performance on these kinds of easily parallelizable, memory-intensive benchmarks. Indeed, because of the kinds of issues highlighted in this benchmark, Nehalem should be mark a disruptive change for Intel's fortunes in the HPC market—an "inflection point," to use the popular buzzword. (Personally, I'm waiting for Nehalem before I pick up my first Mac Pro tower, because I'll need all of that bandwidth for two things: gaming, and editing HD video clips of junior when he/she arrives in the Fall.)

Further reading