For the first time since the Top 500 rankings of the most powerful supercomputers in the world was started 23 years ago, the United States is not home to the largest number of machines on the list – and China, after decades of intense investment and engineering, is.

Supercomputing is not just an academic or government endeavor, but it is an intensely nationalistic one given the enormous sums that are required to create the components of these massive machines, write software for them, and keep them running until some new approach comes along. And given that the machines support the militaries and indigenous industry in their home countries, there is always politics in play. The competition between HPC centers in the United States, China, Europe, and Japan is to be expected and is, in a sense, a reflection of the larger economic reality on Earth.

China has had the most powerful machine on the list before, starting with the Tianhe-2 system at the National Supercomputer Center in Guangzhou province that was built by the National University of Defense Technology, which is a hybrid Xeon-Xeon Phi machine that came in at the top of the list in June 2013 with a 33.86 petaflops rating on the Linpack HPC system performance benchmark. The Tianhe-2 machine has remained in that top spot for the past three years, but today has been knocked from its perch by a new – and unexpected – machine called Sunway TaihuLight that was built for the National Supercomputing Center in Wuxi by the National Research Center of Parallel Computer Engineering and Technology, which has a peak theoretical performance of 125.4 petaflops and that delivered a sustained 93 petaflops on the Linpack test.

We have drilled down into the details of the Sunway TaihuLight system separately, but back when Tianhe-2 was announced in June 2013, there were many people who said that system was a stunt machine, something that the Chinese government had put together because it could afford to do so and it wanted to beat the top of the list.

As far as we can see, the Sunway TaihuLight system is a real machine with its own architecture, including an innovative processor design based on the Sunway (sometimes spelled ShenWei) RISC processor that has been in development for the past decade. There are many that speculate that the Sunway processor is based on the Alpha 21164, a RISC chip developed by Digital Equipment for its AlphaServer systems over two decades ago, but this has never been confirmed and often denied. As far as we know, there have been three prior generations of ShenWei processors, and there could have been more developed behind the scenes at Wuxi. The ShenWei SW-1 was a single core chip from 2006 running at 900 MHz, the SW-2 had two cores topping out at 1.4 GHz, and SW-3, also known as the SW1600, had sixteen cores running at 1.1 GHz. The Sunway TaihuLight supercomputer that was unveiled as the most powerful system on the June 2016 Top 500 rankings uses the SW26010 processor, which has a total of 256 compute cores and another four management cores all on a single die, running at 1.45 GHz. The system has a whopping 40,960 nodes for over 10.6 million cores, which share over 1.31 PB of memory.

It stands to reason that the Sunway TahuLight machine will be ranked number one on the Top 500 list until the “Summit” system built by IBM, Nvidia, and Mellanox Technologies for the US Department of Energy’s Oak Ridge National Laboratory comes online in late 2017 if all goes well, perhaps with around 150 petaflops of peak performance across its 3,500 nodes. The Summit machine will be comprised of multiple 24-core Power9 processors, an unknown number of “Volta” GV100 Tesla coprocessors, and a 200 Gb/sec HDR InfiniBand interconnect. The expectation is that the “Aurora” system based on “Knights Hill” Xeon Phi processors from Intel and going into Argonne National Laboratory, will surpass Summit in 2018, with a rating of around 180 petaflops across more than 50,000 nodes.

That is a long time to wait for the United States to be back on top of the list in terms of performance, and unless some of the hyperscalers and cloud builders start running the Linpack benchmark on their biggest clusters, as myriad Chinese companies have done, then it seems unlikely that the United States will take the lead again in system count. This, ironically enough, could actually help spur more funding in supercomputers in the United States and foster more investment in Europe and Japan, too.

On the current list, China has 167 machines on the Top 500 list, up from 109 machines only six months ago when a burst of activity gave it that number. As the Chinese machines came onto the list as Internet companies ran Linpack on their gear, they knocked other systems off the list, and as a consequence there are only 165 systems on the June 2016 list that call the United States home, down from 233 machines a year ago and 256 two years ago. It has been a long time since the United States did not have at least half the machines on the list – the early 2000s, in fact. And the level now is even lower. There are 105 machines on the June 2016 rankings that are on European soil (about the same as last year), with Germany having the most with 26 systems, France with 18 machines, and the United Kingdom with 12 machines. China is pulling up the Asian class average on the Top 500, but Japan saw its count drop from 37 six months ago to 29 machines on this ranking.

The most powerful system in the United States that has run the Linpack test is the “Titan” supercomputer at Oak Ridge, built by Cray and using a mix of 16-core AMD Opteron 6274 processors and Nvidia Tesla K20X GPU accelerators, linked by its “Gemini’ interconnect, to deliver its 17.6 petaflops of sustained performance on the Linpack benchmark. Titan was the number one ranked machine when it was first tested for the November 2012 Top 500 list, and seven months later was knocked to the number two position by Tianhe-2 and now to the number three spot by Sunway TaihuLight.

The “Sequoia” BlueGene/Q system at the US Department of Energy’s Lawrence Livermore National Laboratory, which represents the culmination of the BlueGene massively parallel design by IBM, has 1.57 million cores running at 1.6 GHz and a custom 3D torus interconnect to deliver 17.2 petaflops of Linpack oomph, and its Mira sister, installed at Argonne, has about half as many cores and about half the performance at 8.59 petaflops. Sequoia was partially built and ran Linpack for the November 2011 list and came in first for only one ranking, back in June 2012. It has been dropping as Titan, Tianhe-2, and Sunway TaihuLight were installed and tested, and Mira has mirrored its moves downwards on the list.

Between these two machines is the K supercomputer made by Fujitsu for the Riken Advanced Institute for Computational Science in Japan, which is rated at 10.5 petaflops and uses a Sparc64-VIIIfx processor with eight cores that run at 2 GHz and the custom Tofu 6D interconnect. This machine, by the way, remains one of the most efficient machines to run the Linpack test and it was installed more than five years ago.

The remaining four machines in the Top 10 of the Top 500 are all built by Cray and all use its current “Aries” interconnect. The “Trinity” system shared by Los Alamos National Laboratory and Sandia National Laboratories, is a Cray XC40 system that uses 16-core “Haswell” Xeon E5-2698 v3 processors and delivers 8.1 petaflops, while the “Piz Daint” is an XC30 that has a mix of eight-core Xeon E5-2670 processors and Nvidia Tesla K20X GPU accelerators to hit 6.27 petaflops on the Linpack test. The “Hazel Hen” system at Höchstleistungsrechenzentrum Stuttgart (HLRS) in Germany is ranked ninth on the list, with 5.64 petaflops of sustained Linpack oomph and uses twelve-core Xeon E5-2680 v3 processors to do it computations. The Shaheen II supercomputer at King Abdullah University of Science and Technology in Saudi Arabia is configured with 16-core Haswell Xeons like Trinity, but has fewer of them and comes in at 5.54 petaflops.

The capability-class machines that have been on the list for four or five years might be a little long in the tooth considering that supercomputing centers like to have a new machine every three or four years, but they are showing longevity and continued usefulness. And considering that the largest machines tend to cost on the order of $200 million a pop, wanting to squeeze extra work out of them seems only natural. (Upgrade deferrals a common everywhere in IT, except perhaps at the hyperscalers that seem to have nearly infinite money to invest in new infrastructure.) But at some point, HPC centers, which by definition exist to push the boundaries of system architecture, need new machinery and new software investment to make it hum.

This is why the performance of the largest machines on the Top 500 list and the aggregate oomph of the machines that comprise the entire list keeps growing. But as we have pointed out before, the pace of that growth is slowing and we think that one of the reasons is that the cost of a unit of performance is not coming down as fast as performance is going up. If you do the math, performance is increasing for the biggest supercomputers four times faster over the past four decades than the bang for the buck for these machines is increasing.

The number of petaflops-class machines continues to grow, thanks to Moore’s Law advances in compute engines and interconnects, and there are now 95 machines with more than 1 petaflops of sustained Linpack performance, up from 81 six months ago, 68 machines a year ago, 37 machines two years ago, and 26 machines three years ago. Putting together a petaflops-scale system has certainly gotten both easier and more affordable, but we wonder if there is more elasticity in demand and that if the price were dropped, would demand increase at a faster pace? The mitigating factor, of course, is the cost of building facilities and paying for the electricity to power and cool these behemoths. If energy was free, or even close to it, this market would look very different indeed. And, by the way, so would enterprise datacenters and hyperscalers.

The thing is, however, that the performance of the aggregate Top 500 list continues to grow, even if it is at a slower pace. On the June 2016 list, all of the machines added up to a combined 566.7 petaflops, which is up from 363 petaflops a year ago, 274 petaflops two years ago, and 223 petaflops three years ago. The concurrency in the systems – the number of cores that are shared across clusters and MPPs and such – keeps rising, too, hitting an average of 81,995 this time around compared to an average of 50,495 a year ago, 43,301 two years ago, and 38,700 three years ago. The big machines near the top of the list, which have millions of cores and soon will have tens of millions, are pulling up the class average, and will continue to do so.

The other interesting thing to note is that the number of accelerated machines dropped slightly, to 93 systems from 104 machines on the November 2015 list. The Nvidia Tesla business unit has been growing sales quite rapidly, so it is a bit perplexing why the Top 500 list doesn’t reflect this. Revenues are not shipments, we realize, and the hyperscalers doing machine learning are driving a lot of that Tesla business, and perhaps so are a lot of smaller clusters than can’t make the Top 500 list. It will be interesting as well to see what the split will be between Xeon and Xeon Phi systems, and if many of the will end up being hybrid systems within a single cluster instead of hybrid by mixing CPU and GPU inside of cluster nodes. (We report on Nvidia’s latest “Pascal” Tesla coprocessors separately today, which plug into regular PCI-Express slots but which do not support NVLink interconnects.)

There are no doubt a bunch of machines that never run the Linpack test and therefore never make the list – the “Blue Waters” hybrid CPU-GPU supercomputer built by Cray for the National Center for Supercomputing Applications, which has a peak performance of over 13 petaflops, was the first big machine we heard of that did not run Linpack. So while the list is illustrative of technology trends and helps drive competition among vendors that make systems and countries that invest in them, it is not a fully complete ranking of HPC systems. And it would be better for everyone if it were.