AMD has picked up yet another big supercomputer win with the selection of its second-generation Epyc processors, aka Rome, as the compute engine for the ARCHER2 system to be installed at the University of Edinburgh next year. The UK Research and Innovation (UKRI) announced the selection earlier this week, along with additional details on the makeup of system hardware.

According to the announcement, when ARCHER2 is up and running in 2020, it will deliver a peak performance of around 28 petaflops, more than 10 times that of the UK’s current ARCHER supercomputer housed at EPCC, the University of Edinburgh’s supercomputing center. ARCHER, which stands for Advanced Research Computing High End Resource, has filled the role of the UK National Supercomputing Service since it came online in 2013.

The now six-year-old ARCHER is a Cray XC30 machine comprised of 4,920 dual-socket nodes, powered by 12-core, 2.7 GHz Intel “Ivy Bridge” Xeon E5 v2 processors vintage, yielding a total of 118,080 cores and rated at a peak theoretical performance of 2.55 petaflops across all those nodes. Most of the nodes are outfitted with 64 GB of memory, with a handful of large-memory nodes equipped with 128 GB, yielding a total capacity of 307.5 TB. Cray’s “Aries” XC interconnect, as the system name implies, is employed to lash together the nodes.

The upcoming ARCHER2 will also be a Cray (now owned by Hewlett Packard Enterprise) machine, in this case based on the company’s “Shasta” platform. It will consist of 5,848 nodes laced together with the 100 Gb/sec “Slingshot” HPC variant of Ethernet, which is based on Cray’s homegrown “Rosetta” switch ASIC and deployed in a 3D dragonfly topology.

Although that’s only about a thousand more nodes than its predecessor, each ARCHER2 node will be equipped with two AMD Rome 64-core CPUs running at 2.25 GHz, for a grand total of 748,544 cores. It looks like ARCHER2 is not using the new Epyc 7H12 HPC variant of the Rome chip, which was launched in September, in fact, which has clocks spinning at 2.6 GHz but a turbo boost speed that is lower at 3.3 GHz; this chip requires direct liquid cooling on the socket because it is revving at 280 watts, which cannot be moved quickly off the CPU by fans blowing air in the server chassis.

Even though the ARCHER2 machine will only have about six times the core count, each of those Rome cores is nearly twice as powerful as the Ivy Bridge ones in ARCHER from a peak double precision flops perspective. That’s actually pretty remarkable when you consider that the nominal clock frequency on these particular Rome chips is 450 MHz slower than that of the Xeon E5 v2 counterparts in ARCHER. Having 5.3X the number of cores helps, and really, it is the only benefit we are getting out of Moore’s Law. The vector units in the Rome chips are 256-bits wide, while the AVX units in the Ivy Bridge Xeons are 128 bits wide, so this also accounts for some of the performance increase.

ARCHER2’s total system memory is 1.57 PB, which is more than five times larger than that of ARCHER, but given the 10X peak performance discrepancy, the second-generation machine will have to manage with about half the number of bytes per double-precision flop. Fortunately, those bytes are moving at lot faster now, thanks to the eight-memory-controller design of the Epyc processors. The system also has a 1.1 PB all-flash Lustre burst buffer front ending a 14.5 PB Lustre parallel disk file system to keep the data moving steadily into and out of the system. All of this will be crammed into 23 Shasta cabinets, which have water cooling in the racks.

In fact, as we reported in August in our deep dive on the Rome architecture, these processors can deliver up to 410 GB/sec of memory bandwidth if all the DIMM slots are populated. That works out to about 45 percent more bandwidth than what can be achieved with Intel’s six-channel “Cascade Lake” Xeon SP, a processor that can deliver a comparable number of flops.

The reason we are dwelling of this particular metric is that when we spoke with EPCC center director Mark Parsons in March, he specifically referenced memory bandwidth as an important criteria for the selection of the CPU that would be powering ARCHER2, telling us that “the better the balance between memory bandwidth and flops, the more attractive the processor is.”

Of course, none of these peak numbers matter much to users, who are more interested in real-world application performance. In that regard, ARCHER2 is expected to provide over 11X the application throughput as ARCHER, on average, based on five of the most heavily used codes at EPCC. Specifically, their evaluation, presumably based on early hardware, revealed the following application speedups compared to the 2.5 petaflops ARCHER:

7X for CP2K, a quantum chemistry and solid state physics package

5X for OpenSBLI, a Navier-Stokes solver for shock-boundary layer interactions (SBLI)

3X for CASTEP, a materials modeling code

9X for GROMACS, a molecular dynamics package aimed at biological chemistry

18X for HadGEM3, the Hadley Centre Global Environmental Model

As the announcement pointed out, that level of performance puts ARCHER2 in the upper echelons of CPU-only supercomputers. (Currently, the top CPU-powered system is the 38.7 petaflops “Frontera” system at the Texas Advanced Computing Center.) It should be noted that ARCHER2 will, however, include a “collaboration platform” with four compute nodes containing a total of 16 AMD GPUs, so technically it’s not a pure CPU machine.

ARCHER2 will be installed in the same machine room at EPCC as ARCHER, so when they swap machines, there will be a period without HPC service. The plan is to pull the plug on ARCHER on February 18, 2020 and have ARCHER2 up and running on May 6. Subsequent to that, the new system will undergo a 30-day stress test, during which access may be limited.

This is all good news for AMD, of course, which has been capturing HPC business at a breakneck pace over the last several months. That’s largely been due to the attractive performance (and likely price-performance) offered by the Rome silicon compared to what Intel is currently offering.

Some recent notable AMD wins include a 24-petaflop supercomputer named Hawk, which is headed to the High-Performance Computing Center of the University of Stuttgart (HLRS) later this year, as well as a 7.5-petaflops system at the IT Center for Science, CSC, in Finland. Add to that a couple of large Rome-powered academic systems, including a 5.9-petaflops machine for the national Norwegian e-infrastructure provider Uninett Sigma2 and another system of the same size to be deployed at Indiana University. The US Department of Defense has jumped on the AMD bandwagon as well, with a trio of Rome-based supercomputers for the Air Force and Army.

All of these systems are expected to roll out in 2019 and 2020. And until Intel is able to counter the Rome juggernaut with its upcoming 10 nanometer Ice Lake Xeon processors in 2020, we fully expect to see AMD continue to rack up HPC wins at the expense of its larger competitor.

The ARCHER2 contract was worth £79 million, which translates to about $102 million at current exchange rates. The original ARCHER system cost £43 million, which converted to about $70 million at the time. So the ARCHER2 machine will cost about 1.46X and delivers 11X the peak theoretical performance over an eight year span of time. First of all, that is a very long time to wait to do an upgrade for an HPC center, so clearly EPCC was waiting for a chance to get a really big jump in price/performance, and by the way, at 28 petaflops, that is considerably higher than the 20 petaflops to 25 petaflops that EPCC was expecting back in March when the requisition was announced.

That original ARCHER system cost around $27,450 per peak teraflops back in 2012, which was on par with all-CPU systems but considerably more expensive than the emerging accelerated systems, on a cost per teraflops basis, of the time. (We did an analysis of the cost of the highest end, upper echelon supercomputers over time back in April 2018.) The ARCHER2 system is coming in at around $3,642 per teraflops, which is a huge improvement of 7.5X in bang for the buck, but the US Department of Energy is going to pay another order of magnitude lower – something on the order of $335 per teraflops – for the “Frontier” accelerated system at Oak Ridge National Laboratory and the “El Capitan” accelerated system at Lawrence Livermore National Laboratory when they are accepted in around 2022 and 2023. Both have AMD CPUs and Frontier will also use AMD GPUs for compute; El Capitan has not yet decided on its GPU. The current “Summit” and “Sierra” systems at those very same labs, which mix IBM Power9 processors with Nvidia Tesla V100 GPU accelerators, cost a little more than $1,000 per teraflops.

Our point is, all-CPU systems are necessary, particularly for labs with diverse workloads, and they come at a premium compared to labs that use accelerators and have ported their codes to them.