By: Michael Feldman

A new Chinese supercomputer, the Sunway TaihuLight, captured the number one spot on the latest TOP500 list of supercomputers released on Monday morning at the ISC High Performance conference (ISC) being held in Frankfurt, Germany. With a Linpack mark of 93 petaflops, the system outperforms the former TOP500 champ, Tianhe-2, by a factor of three. The machine is powered by a new ShenWei processor and custom interconnect, both of which were developed locally, ending any remaining speculation that China would have to rely on Western technology to compete effectively in the upper echelons of supercomputing.

TaihuLight is currently up and running at the National Supercomputing Center in the city of Wuxi, a manufacturing and technology hub, a two-hour drive west of Shanghai. The system will be used for various research and engineering work, in areas such as climate, weather & earth systems modeling, life science research, advanced manufacturing, and data analytics. Center director Prof. Dr. Guangwen Yang, will formally introduce the system on Tuesday afternoon, in a session at ISC.

“As the first number one system of China that is completely based on homegrown processors, the Sunway TaihuLight system demonstrates the significant progress that China has made in the domain of designing and manufacturing large-scale computation systems,” Yang told TOP500 News.

Source: Jack Dongarra, Report on the Sunway TaihuLight System, June 2016

The supercomputer was developed by the National Research Center of Parallel Computer Engineering & Technology (NRCPC), the same organization that designed TaihuLight’s predecessor, the Sunway BlueLight system, which is installed at the National Supercomputing Center in Jinan. BlueLight is a 796-teraflop supercomputer, which was deployed in 2011.

BlueLight is powered by an older version of the ShenWei processor, a third-generation 16-core chip, known as the SW1600, which tops out at about 140 gigaflops. In the five years since that system came online, NRCPC developed a much more powerful processor, the SW26010, a 260-core chip that can crank out just over 3 teraflops. TaihuLight has a single SW26010 in each of its 40,960 nodes, which adds up 125 peak petaflops across the entire machine (more than 10 million cores). Linpack, of course, is going to leave some FLOPS on the table, but 93 petaflops represents a respectable 74 percent yield of peak performance.

At 3 teraflops, the new ShenWei silicon is on par with Intel’s “Knights Landing” Xeon Phi, another manycore design, but one with a much more public history. In a bit of related irony, it was the US embargo of high-end processors, such as the Xeon Phi, imposed on a number of Chinese supercomputing centers in April 2015, which precipitated a more concerted effort in that country to develop and manufacture such chips domestically. The embargo probably didn’t impact the TaihuLight timeline, since it was already set to get the new ShenWei parts. But it was widely thought that Tianhe-2 was in line to get an upgrade using Xeon Phi processors, which would have likely raised its performance into 100-petaflop territory well before the Wuxi system came online.

Like its earlier incarnations, this latest ShenWei is a 64-bit RISC processor, with SIMD instruction support and out-of-order execution. Its underlying architecture is somewhat of a mystery, although it’s been speculated that the design was derived from the DEC Alpha architecture. The instruction set is specified simply as ShenWei-64.

The processor is divided into four core groups, each with 64 computing processing elements (CPE) and a management processing element (MPE). Each core group also includes a memory controller delivering an aggregate memory bandwidth of 136.5 GB/second on each socket. As one might expect of a manycore design, it runs at a relatively modest 1.45 GHz and supports just a single execution thread per core. The chip was manufactured at the National High Performance Integrated Circuit Design Center, in Shanghai. The process technology node has not been revealed.

Memory-wise, each node contains 32 GB, adding up to a little over 1.3 PB for the whole machine. While that seems like a lot, it’s not much memory considering the number of cores it must feed. The much smaller 10-petaflop K supercomputer at RIKEN, for example, is outfitted with 1.4 PB of memory, and most of the other large systems on TOP500 list have much better bytes-to-FLOPS ratios than that of TaihuLight. It also relies on the older DDR3 technology, which is slower and more power-hungry than the newer DDR4 memory.

The system is also rather light on cache. In fact, it really doesn’t have any in the L1-L2-L3 sense. Each core is allocated 12 KB of instruction cache, along with 64 KB of local scratchpad. And that’s it. The scratchpad can be used like a level 1 cache to some degree, but without the L2 and L3 levels to buttress it, there’s not a whole lot of capability to speed up memory accesses.

From a power standpoint though, TaihuLight is quite good. It draws 15.3 megawatts (MW) running Linpack, which, somewhat surprisingly, is less power than its 33-petaflop cousin, Tianhe-2, which uses 17.8 MW. TaihuLight’s energy-efficiency of 6 gigaflops/watt is excellent, which will certainly earn it a place in the upper reaches of the Green500 list. Keep in mind though, if the system had a more reasonable amount of memory for its size, it would draw significantly more power and its energy efficiency would suffer accordingly.

The interconnect, simply known as the Sunway Network, is also a homegrown affair. It’s noteworthy that the older Sunlight BlueLight machine employed QDR InfiniBand for the system network. The TaihuLight one, however, is based on PCIe 3.0 technology, and provides 16 GB/second of node-to-node peak bandwidth, with a latency of around 1 microsecond. Running MPI communications over it slows that down to about 12 GB/second. Such performance is pretty much on par with EDR InfiniBand or even 100G Ethernet, although the latency seems a tad high (it depends on exactly what’s being measured, of course). In any case, it looks like the design team opted for simplicity here, rather than breakneck speeds using exotic technology.

Likewise, for the operating system. The Sunway Raise OS, as it’s called, uses standard Linux as the base, along with the necessary tweaks to make it work with the custom TaihuLight architecture. Other parts of the system software are also pretty standard – compilers for C/C++ and Fortran, along with the associated math libraries. All, of course, required ports to the custom ShenWei architecture and instruction set, but presumably much of that development work had already been done for the previous-generation processors.

According to TOP500 author Jack Dongarra, three scientific simulation codes run on TaihuLight have been chosen as Gordon Bell Prize finalists, two of which have managed to reach a sustained performance of 30 to 40 petaflops. The award is bestowed each year on the most noteworthy HPC application, based on “peak performance or special achievements in scalability and time-to-solution on important science and engineering problems.”

In a paper written by Dongarra and published on June 20, he describes these applications and also provides a deep dive into the TaihuLight architecture (upon which much of the information in this article was based). The paper also offers some interesting comparisons to other supercomputers. While Dongarra does have reservations about some elements of the new machine’s design, he concludes: “The fact that there are sizeable applications and Gordon Bell contender applications running on the system is impressive and shows that the system is capable of running real applications and [is] not just a stunt machine.”