By: Michael Feldman

At SC18, Depei Qian delivered a talk where he revealed some of the beefier details of the three Chinese exascale prototype systems installed in 2018. The 45-minute session confirmed some of the speculation about these machine that we have reported on, but also offered a deeper dive into their design and underlying hardware elements.

Before he got into the prototype particulars, Qian, who is the chief scientist of the China’s national R&D project on high performance computing, presented an overview of the country’s exascale effort, specifically its goals and challenges. With regard to the former, he reiterated China’s commitment to making sure the technologies that would be used for these machines would be “self-controllable,” with the implication that most if not all of the hardware and software elements would be developed domestically. The nature of the three prototypes certainly reflects this strategy.

Qian also talked about more specific goals for these supercomputers. Specifically, a Chinese exascale system will provide a peak performance of one peak exaflop – so apparently ignoring the Linpack requirement that most other nations are adhering to); a minimum system memory capacity of 10 PB; an interconnect that offers HPC-style latency and scalability and delivers 500Gbps of node-to-node bandwidth, although most of these systems seem to topping out at 400Gbps; and a system-level energy efficiency of at least 30 gigaflops per watt.

That 30 gigaflops/watt figure works out to about 33 megawatts for an exaflop, which is slightly higher that the 20MW to 30MW being envisioned in exascale programs in the US, Japan, and the EU – and those are for Linpack exaflops. In fact, Qian said energy efficiency is their number one challenge, the lesser ones being application performance, programmability, and resilience.

As far as the prototypes go, Qian’s talk at SC18 was the first instance of a public presentation that revealed the hardware makeup of these systems. A fair amount of this was provided in a slide deck he presented last year in Japan, but since this was prior to the installation of the prototypes, some of that information is no longer accurate.

All three prototypes -- Sugon, Tianhe, and Sunway (ShenWei) – were deployed over the last 10 months, with the last one being unveiled just a month ago. In Qian’s description of their design and components, we now have a fairly good understanding for what the full exascale systems will look like when, although some critical details are still missing.

Sugon prototype

As we speculated in October, the Sugon prototype is indeed equipped with the AMD-licensed Hygon x86 processors. The advantage to this design for the supercomputing community in China is that it will maintain compatibility HPC software that’s already in production today.

The more interesting tidbit here is that the prototype will also use something called a “DCU” to act as an accelerator. Apparently, these chips are provided by Hygon as well and, according to Qian’s 2017 presentation will deliver 15 teraflops per chip in the full-blown exascale system. However, their performance to date appears to be just a fraction of that.

In the 512-node Sugon prototype, there are two Hygon x86 CPUs, plus two Hygon DCUs per node, but in the current test configuration, only half the DCUs are being used. And since the peak performance of the whole machine is 3.18 petaflops, that means the DCU in the protype is delivering something in the neighborhood of 6 teraflops – not bad, but they will need to more than double that over the next couple of years if they intend to meet their goals.

Sugon is aiming for the x86 CPU to deliver about a teraflop per chip in the exascale system, which either means Hygon has to bump up the performance in the implementation of its first-generation Zen CPU or is planning to license the Zen 2 or Zen 3 IP from AMD, either of which could easily supply the needed teraflop.

The Sugon prototype interconnect is a 6D Torus, based on 200Gbps technology of undetermined origin. It looks like they are aiming for about twice that bandwidth at some point, although that would be 100Gbps short of the generic 500Gbps exascale goal. Whatever it is, the interconnect relies on optical technology as part of its implementation.

The other interesting design feature of the Sugon machine is the use of an immersive cooling system. The prototype is employing something called Imm058, a coolant that boils at the relatively low temperature of 50C (122F). That makes it a good deal more effective than liquid cooling based on water, which boils at 100C (212F).

Tianhe prototype

Qian provided the least amount of detail for the Tianhe prototype, including the processor that will power it. As we have speculated in the past, we think this system will be based on a Chinese-designed Arm chip, which will likely be some version of Phytium’s Xiaomi platform.

In Qian’s SC18 presentation, as well as the one in 2017, the chip is only characterized as a new manycore processor that balances compute and memory, which frankly could be anything. But since China intends to build an Arm-based exascale supercomputer as one of its three options, by the process of elimination, this has to be it. Unless, of course, they have changed their minds.

As with the Sugon prototype, the Tianhe system is made up of 512 nodes, and delivers the nearly identical amount of performance: 3.14 petaflops. That suggests quite a powerful processor, something akin to the ShenWei manycore chip (see below), or perhaps a more modest processor that is suitable for a four-socket-per-node setup.

The network is a 3D butterfly design with a maximum of four hops. It is based on a high radix router chip that draws less than 200 watts of power. Optoelectrical technology will be used for the interconnect fabric, which in the final exascale system will provide 400 Gbps of bandwidth per node.

The design also emphasizes fault tolerance as a key design feature. This is implemented in the interconnect, as well as a new but undefined storage media.

Bottom line: This machine is still largely a mystery.

Sunway (Shenwei) prototype

This one uses the ShenWei 26010 (SW26010) processor, the 260-core processor that currently powers the number three-ranked TaihuLight supercomputer. Each prototype node has two of these processors, which together deliver about 6 peak teraflops. The entire 512-node machine offers 3.13 petaflops.

In its current configuration, each node provides 11 gigaflops per watt. Sunway engineers will have to nearly triple that to meet the stated target for exascale energy efficiency. Needless to say, that’s a lot of innovation that needs to occur in the two to three years of remaining time before the final system is expected to be deployed.

Unlike the Sunway TaihuLight supercomputer, which uses Mellanox InfiniBand as the basis of its interconnect fabric, the exascale prototype employs a home-grown network chip that provides 200Gbps of point-to-point bandwidth. Again, this is part of China’s strategy to bring all the exascale technology in-country. Along those the same lines, this prototype’s storage subsystem is based on a ShenWei storage box.

As with the other prototypes, the Sunway system uses a liquid cooling system, but in this case a more conventional one based on a copper cold plate design.

Final thoughts

It’s probably no accident that each of these prototypes were deployed with 512 nodes. The standard size will make it easier to evaluate these systems on a level playing field, while providing at least petascale performance for developing and running software. Despite that, these are not pre-exascale machines in the sense that they will serve as direct stepping stones to full-up exascale supercomputers.

These 3-petaflop prototypes are more like technology testbeds, and it will be a challenge to scale these designs over a single generation without an intervening pre-exascale platform. We may yet see such systems deployed in China over the next two or three years (in fact, it plausible to consider TaihuLight as such a machine), but time is not on their side. The stated goal of bringing up the first exascale system in 2020 seems less likely than it did two years ago, and even a 2021 deployment would be a significant accomplishment.

Furthermore, although China has made noteworthy strides in designing and developing high performance processors like ShenWei, as Qian admitted, the country is playing catchup in semiconductor manufacturing and packaging. That will slow development of a next generation of processors, network chips, and memory devices needed for their exascale machinery.

That said, China’s exascale efforts are poised to change the global supercomputing landscape, not just for these extreme-scale systems, but for everyday HPC. At a time when Moore’s Law is slowing down, and high performance computing is being redefined by applications in data analytics and machine learning, the global community will benefit from a greater diversity of designs and approaches. The emergence of these first exascale supercomputers may turn out to be the least interesting part of all of this.