Much to the surprise of the supercomputing community, which is gathered in Germany for the International Supercomputing Conference this morning, news arrived that a new system has dramatically topped the Top 500 list of the world’s fastest and largest machines. And like the last one that took this group by surprise a few years ago, the new system is also in China.

Recall that the reigning supercomputer in China, the Tianhe-2 machine, has stood firmly at the top of that list for three years, outpacing the U.S. “Titan” system at Oak Ridge National Laboratory. We have a more detailed analysis on that trend in particular here, but needless to say, this new system is remarkable architecturally, particularly in terms of its floating point per watt capabilities—as well as politically, as this marks yet another divergence from the standard American-driven X86 norm for other machines around the world.

The Sunway TaihuLight supercomputer, which was developed at the National Research Center of Parallel Computer Engineering and Technology (NRCPC) and is in full production running early-stage workloads at the National Supercomputing Center in Wuxi, China, features the custom-designed SW26010 processor. The ShenWei chips are said to bear a strong resemblance to the Digital Alpha chip, but according to Top 500 list co-founder and renowned HPC researcher, Dr. Jack Dongarra, it is not an Alpha variant—at least based on his questions for the center, which shared the details following recent benchmarking runs for both LINPACK (the Top 500 metric) and the newer data movement-focused HPCG benchmark (analysis of TaihuLight rankings here).

The SW26010 and the Sunway TaihuLight system has been engineered for super-efficient floating point performance. If one takes at a look at the efficiency in terms of floating point operations per watt, most of the top ten supercomputers on the planet hit around 2 gigaflops per watt. This strikes a 6 gigaflops per watt figure—an impressive number, but of course, still nowhere near the 50 gigaflops/watt required for exascale efficiency targets. Still, it is a move in the right direction.

From the high level view, there is nothing hugely complicated about the cache-free architecture; in fact, it is that simplicity that makes the system hum versus the power-hungry, dense heterogeneity of some other machines on the current and future Top 500. The entire system is built from the 1.45 GHz SW26010 processors. For each node, there are four “core groups” so each processor chip has four core groups. Each of these groups has 65 cores (one management core, 64 computing cores) with the management core capable of also handling compute. This creates a total of 260 cores per unit and it’s built from there.

So, we have the 260-core node and there are also “supernodes,” of which there are 256 in a quarter of a cabinet. Four of those go in a cabinet, and the full system stretches to forty cabinets total with an interconnect that’s built into the chip (which is referred to as the custom ‘network on a chip” interconnect) and also an interconnect for hooking everything together to form a supernode.

There is also another level of the network that connects things at a cabinet level, and another that brings it all home at the system level across 40 cabinets.

Does the high-level concept look at all familiar to other HPC systems of present and future? If not, take a look at Knights Landing and soon, Knights Hill, as we’ll see come to light at scale with the massive Aurora supercomputer in a couple of years. Take a look too at the projected performance (and performance per watt) of those machines and see that while this Sunway machine is big news now, the fat lady hasn’t started her tune to close the Top 500 top slot for three or more years. There are a number of systems that will start to appear in November of this year that will feature Knights Landing and as we know for the 2018 timeframe, at least one massive supercomputer that will sport next-generation “Knights Hill” parts, which have a projected similar profile in terms of gigaflops per watt and potential peak performance.

In part to put this in some Intel perspective and highlight the above point, Dongarra provided a chart comparing Knights Corner and Knights Landing to the metrics we have on TaihuLight below.

We have talked plenty about the processor and its potential, but all is lost without a solid interconnect. Despite digging, all we know is what Dongarra told us earlier; that center officials tell him it is custom developed, but no more. “They are claiming a custom interconnect but it does look like InfiniBand and it could perhaps be coming from Mellanox,” he says.

“Sunway has built their own interconnect. Nodes are connected using PCIe 3.0 connections in what’s called a Sunway Network. Sunway’s custom network consists of three different levels, with central switching network at the top, the supernode network in the middle, and the resource sharing network at the bottom. The bi-section network bandwidth is 70 TB/s with a network diameter of 7.”

Both the processor and interconnect story lead to both the scaling and efficiency stories, but the real standout feature of this machine is how many gigaflops it can fit into a single watt. As mentioned earlier, it is still not close to the 50 gigaflops/watt required for exascale, compared to current systems on the Top 500 list, it does boast some remarkable efficiency. The efficiency figures below are for the LINPACK benchmark and count processor, memory, and the interconnect. The cooling system for TaihuLight uses a closed-coupled chilled water outfit suited for 28 MW with a custom liquid cooling unit.

When drilling into that efficiency, one sees quickly the simple architecture designed for efficient FLOPs, but that low power consumption comes at a cost. The memory is very slow and while that seems like it would matter for real-world applications, there are clear indications that even with that memory handicap, the system can do remarkable things. In fact, the highly coveted Gordon Bell prize could very well be handed to this Chinese machine this year. There are three applications that made it to the final round of reviews before the award is handed out and according to Dongarra, there were several more submissions that were not selected to make it to that stage.

Going back to the memory handicap for a moment, recall that floating point metrics are no longer the only game in town. Although the LINPACK benchmark, the yardstick by which supercomputing might is most frequently (and publicly) measured, shows outstanding ratings for this machine, the newer HPCG benchmark, which was put together by Jack Dongarra and colleagues to collect better data movement metrics that better reflect the needs of real world applications, shows this new system lagging far behind its companions in the top ten of the Top 500 supercomputer list.

In the results below, take a look at the percent of peak performance on HPCG. Other machines are getting around 2 percent, but this system gets only 0.3 percent–a very low rating that shows moving data through the hierarchy is very expensive and will limit performance.

Oh, but those Gordon Bell prize submissions.

“The fact that they have three finalists for the Gordon Bell award is a big deal. It’s a high point for any system or application,” Dongarra tells The Next Platform. “Most applications that run at that level run close to ‘at scale’ using almost all the processors. And this is capable of running nearly at scale. It is not just a stunt machine and these results are impressive and should be taken seriously.”

In short, while there were many in the supercomputing set who claimed the Chinese “Tianhe-1” and its follow-on Tianhe-2 machines were “stunt” systems to some degree (designed to do a few things well application wise, but also to exploit sheer floating point potential), that same thing cannot be said of the new system.

One of the reasons why the Tianhe-2 machine made even bigger news this time last year too was because the future of the systems was going to be affected by the restrictions that bar the Chinese from using Intel processors at certain supercomputing sites in the country. This is clearly no Intel inside this machine, and it is yet another stake in the ground for native Chinese development of architectures that can be grown and controlled in terms of production, cost, and ecosystem.

The system cost approximately $270 million, which includes all research, development, and production but does not include operational costs, which are not insignificant.