One reason China has a good chance of hitting its ambitious goal to reach exascale computing in 2020 is that the government is funding three separate architectural paths to attain that milestone. This internal competition will pit the National University of Defense Technology (NUDT), the National Research Center of Parallel Computer and Sugon (formerly Dawning) against one another to come up with the country’s (and perhaps the world’s) first exascale supercomputer.

As it stands today, each vendor has developed and deployed a 512-node prototype system based on what appears to be primarily pre-exascale componentry. Transforming these very modest prototypes into 100,000-node-plus exascale supercomputers is going to be quite a challenge, not only because it represents a huge leap in scale, but also because China is committed to powering these systems using relatively immature domestic processors. At a recent presentation by NUDT’s Ruibo Wang, he recapped the three prototypes that were deployed in 2018 and filled in some of the specifics on his organization’s plans for its exascale machine: Tianhe-3.

Let’s start with the NRCPC prototype, which, as a CPU-only machine, is probably the most conventional of the bunch. In fact, it’s the only non-accelerated architecture currently vying for exascale honors in China. Each of its nodes is equipped with two ShenWei 26010 (SW26010) processors, the same chip that is powering Sunway’s TaihuLight supercomputer. The 26010 has 260 cores and delivers about 3 teraflops of 64-bit floating point performance. Presumably, Sunway has a more powerful ShenWei chip in the works for NRCPC’s future exascale system, although it hasn’t offered any indication of what that might look like. We would expect it to deliver something on the order of 10 teraflops.

The Sugon prototype is a heterogenous machine comprised of nodes, each outfitted with two Hygon x86 CPU and two DCUs, and hooked together by a 6D torus network. The CPU is a licensed clone of AMD’s first-generation EPYC processor, while the DCU is an accelerator built by Hygon. In a 2017 presentation by Depei Qian, he said the DCU in the full exascale system will deliver 15 teraflops, which certainly is not the case for the prototype system. One interesting facet of the Sugon machine is that it’s being cooled by a liquid immersion system, which might indicate that the DCU chip dissipates an enormous amount of heat.

The NUDT prototype is another heterogenous architecture, in this case using CPUs of unknown parentage, plus the Matrix-2000+, a 128-core general-purpose DSP chip. The Matrix-2000+ is presumably the successor to the Matrix-2000, the accelerator used in the 100-petaflop Tianhe-2A supercomputer, which is currently the number four system on the TOP500 list. At peak, the Matrix-2000+ delivers two teraflops of performance and burns about 130 watts. If they were to be used to power an exaflop machine on their own, the DSP chips alone would draw about 65 megawatts.

However, for NUDT’s Tianhe-3 exascale system, the plan is to use the upcoming Matrix-3000 DSP and some future CPU. The DSP is expected to sport at least 96 cores and deliver more than 10 teraflops of performance, while the 64-core CPU will provide 2 teraflops. Each blade will be equipped with eight of these DSPs paired with eight CPUs, providing 96 teraflops per blade.

The entire system will be comprised of 100 cabinets, each containing 128 blades, which works out to 1.29 exaflops (peak). Everything will be hooked together with a homegrown 400Gbps network, using a 3D butterfly topology. That will provide a maximum of five hops between any two nodes. Cooling will be provided by a hybrid air/water system, which is expected to deliver a PUE of less than 1.1.

The only big mystery remaining is the nature of Tianhe-3’s CPU. As we’ve speculated before, we’re guessing that it’s going to be some sort of Arm processor. That still makes a lot of sense, especially because China has hinted for some time that one of its exascale systems will be using this architecture. Given the processor’s two teraflop performance goal, it may even end up being an Armv8-A implementation with the Scalable Vector Extension (SVE).

If they decide to go down that route, one possible avenue for NUDT is that they could license Fujitsu’s A64FX design, the Arm SVE technology behind Japan’s Post-K exascale supercomputer. Not only do these processors deliver 2.7 teraflops of performance today, but Fujitsu has already developed a set of HPC libraries for them. As we reported just a couple of week ago, Fujitsu is looking to sell some of the technology it developed for Post-K, and the intellectual property behind its HPC Arm chip might be its most bankable product.

In any case, if the Tianhe-3 developers are on schedule, we’ll find out soon enough on what they chose for their CPU design.