The last few years have been unusually exciting for the data center – Intel introducing Xeon Scalable while AMD re-entered the market with EPYC. On the ARM side, we’ve seen Qualcomm announcing Centriq, Ampere talking about eMAG, and Cavium launching ThunderX2.

Among the available ARM options, it is Cavium that is gaining the most traction. ThunderX-based prototype systems have been used for various European efforts including the Mont-Blanc project initiative, the Isambard supercomputer from the University of Bristol, and a number of other projects. While the ARM ecosystem continues to improve, more significant investment is required. This is where Astra comes in. Astra is the fourth prototype system being built by the Sandia National Laboratories as part of its Vanguard project designed to deliver a future exascale ARM machine. It is part of a larger initiative designed to evaluate the viability of non-x86 architectures for HPC.

All three prior prototypes, Hammer, Sullivan, and Mayer, were also ARM-based. Hammer was based on first-generation X-Gene by AppliedMicro while Sullivan was based on the original ThunderX processors. Mayer was built by HPE and Cavium last year. It consisted of 47 nodes using pre-production ThunderX2 parts.

Compute Node

Astra uses HPE’s Apollo 70 systems. Those use a highly dense chassis system architecture that fit in just 2U and consist of four dual-socket nodes.

Each node has two 1,600 W power supplies, 1 Gbps Ethernet management port, and a Mellanox ConnectX-5 EDR link. Each node has a dual-socket Cavium ThunderX2 processor with 28 cores operating at 2 GHz. Presumably, this is the ThunderX2 CN9975 but it could be an unannounced SKU.

We have recently discussed the ThunderX2 family. Those processors are based on the Vulcan microarchitecture and incorporate up to 32 cores. For Astra, Sandia is using 28-core parts operating at 2 GHz, likely due to a better performance/power efficiency design point. Each chip supports up to eight channels of DDR4 DIMMs with rates up to 2666 MT/s as well as 56 PCIe 3 lanes.

The ThunderX2 CN9975 supports two-way multiprocessing. Communication is done over second-generation Cavium Coherent Processor Interconnect (CCPI2) which provides 600 Gbps of aggregated bandwidth. For the Astra supercomputer, each node uses an 8 GiB DDR4-2666 dual-rank DIMM per controller for a total of 64 GiB and 170.7 GB/s of aggregated memory bandwidth per socket. For each node, there is a single Mellanox EDR InfiniBand ConnectX-5 VPI card designed for the Open Compute Project (OCP) providing the 100 Gb/s link.

Full Node Capabilities

With eight DIMMs per controller, each node has 128 GiB of memory feeding 56 cores with a total bandwidth of 341.33 GB/s per node. Those cores operate at up to 2 GHz, each with 2 NEON 128-bit units providing a peak theoretical performance of 8 double-precision FLOPS/cycle. This works out to 16 GFLOPS per core.

Full Node Capabilities Socket Node Processors 1 Ã— CPU 2 Ã— CPU Core 28 (112 threads) 56 (224 threads) FLOPS (SP) 896 GFLOPS

28 Ã— 32 GFLOPS 1,792 GFLOPS

2 Ã— 28 Ã— 32 GFLOPS FLOPS (DP) 448 GFLOPS

28 Ã— 16 GFLOPS 896 GFLOPS

2 Ã— 28 Ã— 16 GFLOPS Memory 64 GiB (DDR4)

8 Ã— 8 GiB 128 GiB (DDR4)

2 Ã— 8 Ã— 8 GiB Bandwidth 170.7 GB/s

8 Ã— 21.33 GB/s 341.33 GB/s

16 Ã— 21.33 GB/s

Compute Rack

The HPE Apollo 70 compute rack contains 18 chassis for a total of 72 compute nodes along with 3 InfiniBand switches. There is a single 36-port L1 switch per 6 chassis.

With 72 nodes, there are 144 ThunderX2 processors per rack for a peak compute power of 64.5 teraFLOPS.

Full Rack Capabilities Node Rack Processors 72

72 Ã— CPU 144

72 Ã— 2 Ã— CPU Core 56 (224 threads) 4,032 (16,128 threads)

72 Ã— 56 (224 threads) FLOPS (SP) 1,792 GFLOPS

2 Ã— 28 Ã— 32 GFLOPS 129 TFLOPS

72 Ã— 2 Ã— 28 Ã— 32 GFLOPS FLOPS (DP) 896 GFLOPS

2 Ã— 28 Ã— 16 GFLOPS 64.51 TFLOPS

72 Ã— 2 Ã— 28 Ã— 16 GFLOPS Memory 128 GiB (DDR4)

2 Ã— 8 Ã— 8 GiB 9 TiB (DDR4)

72 Ã— 2 Ã— 8 Ã— 8 GiB

Full System

Astra comprises 36 racks for a total of 2,592 compute nodes and 5,184 processors. The overall interconnect design is a three-level fat tree with a 2:1 tapered fat-tree at L1. The 36 racks comprise 648 chassis and 108 L1 switches. There are 3 540-port switches. Those are formed from 30 level 2 switches that provide 18 ports each (540 in total) with the remaining 18 links going for each of the 18 level 3 switches.

With a switch per 6 chassis (24 ports), the remaining 12 ports are used to link into the L2 switches, 4 links per 540-port switch.

With a little over five thousand cores, Astra will have a peak theoretical performance of 2.322 petaFLOPS, making it by far the most powerful ARM supercomputer built to date. In addition to the 36 computer racks, the full system includes 3 networking racks, 2 storage racks, a utility rack, and 12 of HPE MCS-200 fan coils cooling units. The projected nominal power consumption under LINPACK for the system is 1.36 MW with a peak wall projection total power of slightly over 1.6 MW.

Astra Capabilities Rack System Processors 144

72 Ã— 2 Ã— CPU 5,184

36 Ã— 72 Ã— 2 Ã— CPU Core 4,032 (16,128 threads)

72 Ã— 56 (224 threads) 145,152 (580,608 threads)

36 Ã— 72 Ã— 56 (224 threads) FLOPS (SP) 129 TFLOPS

72 Ã— 2 Ã— 28 Ã— 32 GFLOPS 4.644 PFLOPS

36 Ã— 72 Ã— 2 Ã— 28 Ã— 16 GFLOPS FLOPS (DP) 64.51 TFLOPS

72 Ã— 2 Ã— 28 Ã— 16 GFLOPS 2.322 PFLOPS

36 Ã— 72 Ã— 2 Ã— 28 Ã— 16 GFLOPS Memory 9 TiB (DDR4)

72 Ã— 2 Ã— 8 Ã— 8 GiB 324 TiB (DDR4)

36 Ã— 72 Ã— 2 Ã— 8 Ã— 8 GiB

Extended WikiChip Article: Astra