There have always been "wow that's fast!" computers, but we didn't start calling them supercomputers until 1964.

"It is very important... that there should be one case in a hundred where... [we] write specifications simply calling for the most advanced machine which is possible in the present state of the art." – John von Neumann, 1954.

For generations, top computer hardware designers have pushed the limits of technology to make the fastest possible computers. There’s always some computational problem that stresses existing equipment and requires high-performance computing (HPC). In the early 1960s, it was the Department of Defense modeling nuclear explosions. Today, it’s weather and climate modeling, unique search and rescue efforts, scientific space exploration, and looking ahead, traveling to Mars.

It wasn't until 1964, when Seymour Cray designed the Control Data Corporation (CDC) 6600, that we started to call the fastest-of-the-fast machines "supercomputers." Cray believed there would always be a need for a machine "a hundred times more powerful than anything available today"—a definition that works well to define their specifications.

It was Cray’s efforts to reach new records that led to the creation of supercomputers. And for several years, Cray and supercomputing were nearly synonymous.

It was never easy. Cray had to threaten to leave CDC before the company allowed his team to build the CDC 6600. With 400,000 transistors, more than 100 miles of hand-wiring, and Freon cooling, the CDC 6600 reached a top speed of 40 MHz, or 3 million floating point operations per second (megaFLOPS). This left the previous fastest computer of the time, the IBM 7030 Stretch, eating its dust.

CDC 6600: In the beginning, 1964, there was the Control Data Corporation 6600, the first supercomputer.

Image credit: Wikimedia

Today, that's painfully slow. The first Raspberry Pi, with its 700-MHz ARM1176JZF-S processor, runs at 42 megaFLOPS.

But supercomputing has always been about straining forward. For its day, the CDC 6600 was the fastest imaginable, and it would remain so until Cray and CDC followed it up in 1969 with the CDC 7600.

The other computer manufacturers of the '60s were caught flat-footed. In a famous memo, IBM CEO Thomas Watson Jr. said, "Last week, Control Data... announced the 6600 system. I understand that in the laboratory developing the system there are only 34 people including the janitor. Of these, 14 are engineers and 4 are programmers. Contrasting this modest effort with our vast development activities, I fail to understand why we have lost our industry leadership position by letting someone else offer the world's most powerful computer." To this day, IBM remains a serious supercomputer competitor.

The birth of Cray

In the meantime, Cray and CDC were not getting along. His designs, while both technically powerful and commercially successful, were expensive. CDC saw Cray as a perfectionist, while Cray saw CDC as a bunch of clueless middle managers. So, when the project for his next-generation CDC 8600 was running over budget and behind schedule, CDC elected to support another high-performance computing machine instead: the Star-100.

Cray left CDC to form his own company, Cray Research. There, free of corporate management oversight and fueled with ample funds from Wall Street, in 1976 he built the first of his eponymous supercomputers: the Cray-1. The 80-MHz Cray-1 used integrated circuits to achieve performance rates as high as 136 megaFLOPS.

The Cray-1's remarkable speed came in part from its unusual "C" shape, which wasn't created for its science-fiction appearance but because it gave the most speed-dependent circuit boards shorter, hence faster, circuit lengths. This attention to every last design detail is a distinguishing mark of Cray's work. Every element of Cray's design was intended to be as fast as possible.

Cray and the Cray 1: Seymour Cray, the father of supercomputing, and the first of his famous Cray computers.

Image credit: Wikimedia

The Cray-1 was also one of the first supercomputers to use vector processing. Previous supercomputers simply used the fastest possible components. Now, supercomputing design moved to the chip level.

Vector processors operate on vectors—linear arrays of 64-bit floating-point numbers—to obtain results quickly. Compared with scalar code, vector codes could minimize pipelining hazards by as much as 90 percent. The Cray-1 was also the first computer to use transistor memory instead of high-latency magnetic core memory. With these new forms of memory and processing, the Cray-1, and its descendants, became the poster child for late-'70s and early '80s supercomputing.

Seymour Cray wasn't done leading the way in supercomputing. The Cray-1, like all the company’s machines that came before it, used a single main processor. With the Cray X-MP, introduced in 1982, the company added four processors to the Cray-1 signature C body. With the X-MP's 105-MHz processors and a 200-plus percent improvement in memory bandwidth, a maxed-out X-MP could deliver 800 megaFLOPS of performance.

The next step forward would come with the Cray-2 in 1985. This model came with eight processors, with a "foreground processor," which managed storage, memory, and I/O, and the "background processors," which did the actual computations. The Cray-2 also was the first liquid-cooled supercomputer. And, unlike its predecessors, you could use it with a general-purpose operating system: UNICOS, a Cray-specific Unix System V with additional BSD features.

But the good times didn’t last. With the Cold War easing off, the government—which had used multimillion-dollar supercomputers for nuclear explosion simulations and code-cracking—was no longer spending as much on high-end computing.

MPP arrives

While Cray still loved his vector architectures, those processors were very expensive. Companies explored using multiple processors in a single computer using massively parallel processing (MPP). The Connection Machine had tens of thousands of simple single-bit processors working with global memory.

MPP machines made supercomputing more affordable, but Cray resisted it. As he said, "If you were plowing a field, which would you rather use: two strong oxen or 1,024 chickens?"

Instead, Cray refocused on creating faster vector processors using the untried gallium arsenide semiconductors. That proved to be a mistake.

Cray, the company, went bankrupt in 1995. Cray, the architect, wasn't done. He founded a new company, SRC Computers, to work on a machine that combined the best features of his approach and MPP. Unfortunately, he died as a result of a car accident in 1996, before he could put his new take on supercomputing to the test.

Cray’s ideas lived on in Japan. There, companies such as NEC, Fujitsu, and Hitachi built vector-based supercomputers. From 1993 to 1996, Fujitsu's Numerical Wind Tunnel was the world's fastest supercomputer, with speeds of up to 600 gigaFLOPS. A gigaFLOP is 1 billion FLOPS.

These machines relied on vector processing, dedicated chips using one-dimensional arrays of data. They also used multibuses to make more of MPP. This is the ancestor of the multiple instruction, multiple data (MIMD) approach that enables today's CPUs to use multiple cores.

Supercomputing for the masses: Beowulf

While Intel was spending millions of dollars developing ASIC Red, some underfunded contractors at NASA's Goddard Space Flight Center built their own "supercomputer" using commercial off-the-shelf (COTS) hardware. Using 16 486DX processors with a 10 Mbps Ethernet cable for the "bus," in 1994 NASA contractors Don Becker and Thomas Sterling created Beowulf.

Beowulf computers now consist of a cluster of computers using high-speed interconnects, such as InfiniBand, 10 Gigabit Ethernet, or OmniPath. Software problems then are attacked by thousands or tens of thousands of processors working in harmony with each other using massive distributed parallelism programs.

Little did they know that in creating the first Beowulf cluster, Becker and Sterling were creating the ancestor to today's most popular supercomputer design: Linux-powered Beowulf-cluster supercomputers. While the first Beowulf could hit only single-digit gigaFLOPS speeds, Beowulf showed that supercomputing was within almost anyone's reach. You can even build a Beowulf "supercomputer" from Raspberry Pi!

The model eventually took over supercomputing. One reason was that, with Cray out of the market, there were only Japanese vector supercomputing manufacturers. So, the U.S. Department of Energy's Accelerated Strategic Computing Initiative (ASCI), which dealt with nuclear weapon simulations, funded the first teraFLOPS computer: IBM and Intel's ASCI Red.

Intel had been watching from the sidelines. The company considered that MIMD and Beowulf clustering might enable it to create more affordable supercomputers without specialized vector processors. For IBM, this was its shot to get back into supercomputing. In 1996, both companies proved successful.

ASCI Red: IBM and Intel created the first TeraFlop supercomputer, ASCI Red.

Image credit: Wikimedia

ASCI Red used over 6,000 200-MHz Pentium Pros to break the 1 teraFLOPS (1 trillion FLOPS) barrier in 1997. For years, ASCI Red would be both the fastest and most reliable supercomputer in the world. As Microsoft researcher Gordon Bell observed, "The ASCI program was the crystallizing event that made the transition to multicomputers happen by providing a clear market, linking design with use and a strong focus on software."

The 1990s and 2000s saw attempts to find new ways to speed up computing. One assist was the creation of the TOP500 in 1992. This gave supercomputer vendors and customers a standardized list using the Linpack Benchmark. Before this, comparing supercomputers' performance was almost impossible.

Next, after years of supercomputing software architects constantly reinventing the wheel, developers agreed to use the Message Passing Interface (MPI). The standard's first version was accepted in 1994. The second version, MPI-2, adopted in 2002, defined parallel I/O operations, which proved useful for today's MPP-based supercomputers.

During this period, while MPP was moving up, other companies tried—and failed—to create their own custom supercomputers. These included Encore, Flexible, and Floating Point Systems. Others tried creating parallel microprocessors.

One such effort, DASH, at Stanford, eventually led to the SGI architecture. SGI, best known in the '90s for its MIPS chips, ironically saw its greatest supercomputer success with the Intel Itanium-powered Altix. While a cluster-based Altix was briefly the world's fastest supercomputer in 2006, it primarily used a shared massive non-uniform memory access (NUMA) memory model instead of networking to achieve its speed results. Beowulf MPP models, however, proved more flexible, and SGI gave up on its NUMA approach in 2007.

Other companies, such as Cydrome, Elexsi, and Multiflow, introduced very long instruction word (VLIW) chip architectures. VLIW, which brings process parallelism down to the chip instruction level, lives on in Itanium chips, but it never delivered its hoped-for performance.

Supercomputing today

Designers started using multiple processor types, which enabled a new set of innovations. For example, in 2014, Tianhe-2, or MilkyWay-2, used both Intel Xeon Ivy Bridge processors and Xeon Phi processors to become the fastest supercomputer of its day. The Xeon Phi is a high-performance graphical processing unit (GPU). These "graphic" chips excel at floating-point calculations.

What's the future of HPC? What are the challenges on the path to exascale? Learn more

GPUs are good for far more than adding punch to 3D games. A GPU has a massively parallel architecture made up of thousands of cores designed for handling multiple tasks simultaneously. A CPU, in comparison, has few cores, which are optimized for sequential serial processing. When you put them together, you get much faster supercomputers.

Supercomputer designers have found that GPUs are great at large-scale neural network training and other numeric-intensive calculations.

For example, in 2015, Steve Oberlin, responsible for Nvidia's Tesla GPU line, said of GPUs that they'd "taken a Navier-Stokes simulation to generate the training set to train a random forest algorithm to do particular-based fluid simulations. They went from taking 30 minutes or so to generate 30 seconds of video to what is now near real-time interactions with fluids."

This new style of combining two types of COTS processors is becoming more common. In the November 2017 TOP500 supercomputer list, the vast majority of the fastest systems used floating-point GPUs such as the Xeon Phi, Nvidia Tesla P100, and PEZY-SC2 accelerators.

Quite a few vendors are engaged in supercomputing today and sell equipment for HPC applications. Currently, the TOP500 supercomputer list identifies Hewlett Packard Enterprise as having the most installed supercomputers with 122 systems, followed by Lenovo with 81 computers, and Inspur in third place with 56 systems.

China's Sunway TaihuLight system is still on top with a score of 93.01 petaFLOPS on the High-Performance Linpack (HPL) test. A petaFLOP is a quadrillion FLOPS, for those of you calculating at home.

Supercomputing is never static. As the race to ever-faster computing continues, the next goal is the exaFLOPS (or 1,000 petaFLOPS) supercomputer. We hope to see the first exaFLOP machine appear by 2018. The Chinese are currently the world's leader at supercomputing, but the U.S. isn't out of it yet. IBM’s new Summit supercomputer, based on the IBM Power9 architecture with Nvidia GPUs, should be operational at Oak Ridge National Laboratory in Tennessee in 2018.

After that, is zettaFLOPS computing even possible? Thomas Sterling, professor of computing at Indiana University and co-inventor of Beowulf, doesn't think so. "I think we will never reach zettaFLOPS, at least not by doing discrete floating point operations." But, Sterling continues, "Of course, I anticipate something else will be devised that is beyond my imagination, perhaps something akin to quantum computing, metaphoric computing, or biological computing. But whatever it is, it won’t be what we’ve been doing for the last seven decades."

So, the supercomputing race will continue even after we crack the exaFLOPS barrier. Strap yourself in; the race for faster computing is never-ending.

Did I miss any important steps in the supercomputer evolution? Let us know via Twitter at @enterprisenxt and @sjvn.