One of the ongoing themes of my microprocessor coverage over the past few years has been the relationship between on-chip execution bandwidth and the "memory wall." So I was intrigued to learn of new research from Sandia National Labs that indicates that the severity of the memory wall problem may be much greater than the industry generally anticipates.

In a nutshell, the "memory wall" problem is pretty straightforward, and it's by no means new to the multicore era. The problem arises when the execution bandwidth (i.e., aggregate instructions per second, either per-thread or across multiple threads and programs) available in a single socket is constrained by the amount of memory bandwidth available to that socket. As execution bandwidth increases, either because clockspeeds get faster or because the die contains more cores, memory bandwidth has to increase in order to keep up.

To put this in simple multicore terms, cramming a ton of processor cores onto a single die does you no good if you can't keep those cores fed with code and data.

But memory bandwidth isn't keeping up. Memory bus bandwidth (latency and/or throughput) hasn't increased quickly enough in proportion to Moore's Law, a fact that leaves processors starving for bytes. In this respect, the "memory wall" is a classic producer/consumer problem, and it's the reason that on-die cache sizes have ballooned in recent years. As the memory wall gets higher and higher, it takes more and more cache to get you over it. At this point, it would be fair to say that most modern server processors are really high-speed memories with some processor core stuck on the die, instead of vice versa.

The memory wall is therefore an added barrier to the success of the many-core paradigm. I say "added," because the most famous barrier is the programming model. Massively multithreaded programming isn't just a "hard problem"—rather, it's a generation's worth of Ph.D. dissertations that have yet to be written.

The work from the Sandia team, at least as it's summarized in an IEEE Spectrum article that infuriatingly omits a link to the original research, seems to indicate that 8 cores is the point where the memory wall causes a fall-off in performance on certain types of science and engineering workloads (informatics, to be specific). At the 16-core mark, the performance is the same as it is for dual-core, and it drops off rapidly after that as you approach 64 cores.

The chart included in the report is striking, and I wish I had the appropriate background to interpret it. (Again, the lack of any link, DOI, report title, deck title, or other reference information is unbelievable.) Nonetheless, despite the lack of color from the source, I'm sure the many-core skeptics in the audience—and there are quite a few—will seize on it as further validation that the maximum worthwhile core count is well below 16.

It looks like Sandia is proposing that stacking memory chips on top of the processor is the solution to this bandwidth problem. If that is indeed their proposal, then they're in good company. Both Intel and IBM have touted advances in chip-stacking techniques, and Sun has published research in the area of high-bandwidth memory interconnects that involve placing dice edge-to-edge. But, to my knowledge, these die-stacking schemes are further from down the road than the production of a mass-market processor with greater than 16 cores.