Just this time last year, the projection was that by 2020, ARM processors would be chewing on twenty percent of HPC workloads. In that short span of time, the grain of salt many took with that figure has dropped with the addition of some very attractive options for supercomputing from ARM hardware makers.

Last winter, the big ARM news for HPC was mostly centered on the Mont Blanc project at the Barcelona Supercomputer Center. However, as the year unfolded, details on new projects with ARM at the core including the Post-K supercomputer in Japan and the Isambard supercomputer in the UK bolstered ARM’s profile in HPC. While the hardware options for ARM were increasing (and with impressive enough performance to put real pressure on established chipmakers for the high-end server market), the software ecosystem was also blossoming, bringing us to today where even supercomputer maker, Cray, decided to make a bold statement about the future of ARM in HPC—at least, so it would seem.

Cray went to the very top of its product line to find a home for Cavium ThunderX2 by adding it to its newest and highest-end XC50 line of supercomputers with the Cray Aries interconnect and tailored software environment. The company’s Fred Kohout tells The Next Platform that this is due to demand from some of their larger scale users in HPC and that they are considering adding ARM elsewhere in their product line, but could not be more specific. This system is at the top of Cray’s XC line and is the predecessor to the forthcoming Shasta architecture, which is expected to be Cray’s system to open the exascale era.

At the moment, Cavium has two different iterations of the ThunderX2 processors, and it is important to distinguish between the two because they are aimed at different workloads—a fact that is important here as we consider Cray’s use of these for HPC. Cavium’s own homegrown ThunderX2 chip, which we detailed here back in June 2016, has 56 relatively modest homegrown 64-bit ARMv8 cores implemented in a 14 nanometer process. These cores, which have a design speed of 3 GHz, don’t have simultaneous multithreading, but the chip package has six memory controllers that can drive 3.2 GHz DDR4 main memory. They are tuned for throughput jobs that do not require 3 cache memory.

The second kind of ThunderX2 chip that Cavium is shipping is based on the “Vulcan” custom Arm design that the company inherited when Broadcom sold off its server chip intellectual property in the wake of its own acquisition by Avago Technology. Broadcom was very secretive about the Vulcan designs, but we have heard that the initial designs had eight and then sixteen cores, featuring four threads per core and out-of-order execution; the chip was slated to be implemented in a 16 nanometer process and run at speeds in excess of 3 GHz. The word is that these Cray nodes are using a variant of the Broadcom Vulcan chip that sports 32 cores and that is delivering per core performance that rivals Intel’s new “Skylake” Xeon SP processors announced this July. It is not clear how many memory controllers are on this Vulcan version of the ThunderX2, but we keep hearing that it is substantial. We will be digging to find out more here at SC17.

While not the first to package the ThunderX2 for potential HPC application use, Cray is sending a sharp signal to the existing processor ecosystem by rolling out its ARM story on the top-end system. We asked Kohout if the ARM showing was still, at least at this point, a bit of a science project to gauge interest at the high end before progressing. He and others from the technical team who are engaged in conversations with HPC sites about the possibilities of ARM assure us that indeed, demand is real, particularly in HPC areas where memory bandwidth is a bottleneck—something set to appeal to areas like weather forecasting, where Cray has already carved out a thriving role. Other high-value HPC areas that need what the ThunderX2 is delivering in terms of memory bandwidth in particular, including oil and gas and those that rely on CFD applications are prime targets for the tightly coupled performance Cray systems strive for.

As Chris Lindahl from Cray’s supercomputing division tells us, “In the discussions we have been having, the customers are excited about the vectors, the additional memory and the broader ecosystem around the world. We are really responding to a movement.” We pointed to even newer entries to the ARM hardware lineup, including the most recent “Centriq” processor from Qualcomm, which provides some interesting performance and memory tradeoffs compared to the ThunderX2, depending on the application in question.

“There is real demand here. Key Department of Energy customers have expressed interest as well as in the broader market. We know it will take time to ramp and our expectations are well calibrated from a technology and release standpoint,” Kohout says.

Nothing is kept back on the ARM-based XC50 supercomputer; the systems have the full Cray software environment, including the Cray Linux Environment, the Cray Programming Environment, and tuned ARM compilers, libraries, and tools for running today’s supercomputing workloads. Cray says there have been enhancements made to its compiler and programming environment to achieve more performance out of the Cavium ThunderX2 processors. They claim that in “a head-to-head comparison of 135 standard HPC benchmarks, Cray’s compiler showed performance advantages in two-thirds of the benchmarks, and showed significant (more than 20 percent) performance advantage in one-third of the tests, versus other public domain ARMv8 compilers from LLVM and GNU.”

Cray says members of its software and systems teams are currently working with multiple supercomputing centers on the development of ARM-based HPC machines, including various labs in the United States Department of Energy and the GW4 alliance – a coalition of four leading, research-intensive universities in the UK. Through an alliance with Cray and the Met Office in the UK, GW4 is designing and building “Isambard,” an ARM-based Cray XC50 supercomputer. We talked with the lead on that system in depth about some recent benchmarks, which we will publish later today on this first full day of SC17.