At the Lawrence Livermore National Laboratory in California, a supercomputer named "Sequoia" puts nearly every other computer on the planet to shame. With 1.6 million processor cores (16 per CPU) across 96 racks, Sequoia can perform 16 thousand trillion calculations per second, or 16.32 petaflops.

Who would need such horsepower? The IBM Blue Gene/Q-based system was built for the Department of Energy for simulations designed to extend the lifespan of nuclear weapons. But for a limited time, the machine is being made available to outside researchers to perform all sorts of tests, a few hours at a time.

One of the first to take advantage of this opportunity was Stanford University's Center for Turbulence Research—and it wasn't hesitant about seeing what this machine is really capable of. For three hours on Tuesday of last week, researchers from the center remotely logged in to Sequoia to run a computational fluid dynamics (CFD) simulation on a million cores at once—1,048,576 cores, to be exact.

It's part of a project to test noise generated by supersonic jet engines and help design engines that are a bit quieter. The work is sponsored in part by the US Navy, which is concerned about "hearing loss that sailors on aircraft carrier decks encounter because of jet noise," Research Associate Joseph Nichols of the Center for Turbulence Research told Ars.

Using giant supercomputers to solve complex scientific problems is by no means unique these days. Larger numbers of cores don't necessarily translate to the fastest speed, either, because of differences between processors and the designs of supercomputers. The million-core run is intriguing, but it also poses extreme challenges in trying to use all those cores at once without things going wrong.

Believe it or not, three hours with a million cores wasn't enough to make a real dent in the jet noise project. Despite preparation work aimed at cutting out bottlenecks, it was just enough time to make sure the code ran properly and to get a sense of the possibilities that million-core computers can offer.

"This is really to show what we can do in the future," Nichols said. "The simulations take some time to boot up and pass through initialization. We did tune the I/O for the Blue Gene architecture, but it is still slower than the blindingly fast computation and communication speeds. Depending on how much data gets written, the I/O can add an extra chunk of time to the overhead."

The problems are all over the place, including "How do you write into one file from a million processors that are all trying to step on each other?" Nichols said. "That was an interesting thing. It kind of depended on the interconnect, too. Only some of the processors are connected to the disk, so you have to rearrange data to get the right performance."

More cores, or better cores?

Sequoia was named the world's fastest supercomputer in June 2012. Sequoia later fell to second place behind a 17.59-petaflop system at the Oak Ridge National Laboratory, but it is still the only system on the Top 500 supercomputers list with a million or more cores.

Of course, going for "more cores" isn't necessarily the best way to tackle a supercomputing problem. That Oak Ridge computer, named Titan, hit 17.59 petaflops using "only" 560,640 cores. The Titan system was built by the supercomputer manufacturer Cray, and it uses Nvidia graphics processing units in addition to traditional CPUs to gain dramatic increases in speed.

There are pros and cons to different approaches. Dave Turek, IBM vice president of high performance computing, told Ars last year that GPUs are more difficult to program for, and the GPU-less Sequoia is for "real science." Titan, it should be noted, does plenty of real science, tackling problems related to climate change, astrophysics, and more. With both CPUs and GPUs in a system, Titan's CPUs guide the simulations but hand off work to the GPUs, which can handle many more calculations at once despite using only slightly more electricity.

Throwing more cores at a problem doesn't necessarily result in performance gains. Code has to be carefully prepared to account for the bottlenecks that arise when information is passed from one core to another (unless the application is so parallel that each core can work on separate calculations without ever talking to each other).

Previously, Nichols' biggest calculation was performed for about 100 hours on 131,072 cores on a Blue Gene/P system. The same calculation could be done on a million Sequoia cores in about 8 to 12 hours, he said.

Besides speeding up lengthy calculations, more cores and faster supercomputers will let scientists tackle even more complicated problems.

"Having more cores, either you reduce the time to solution or you solve more complex problems, problems involving flow that involve, say, chemical reactions, combustion," said Parviz Moin, director of the Center for Turbulence Research. "These are grand challenge problems that involve many, many more equations and a lot of work that will be distributed among the processors."

Getting code ready for the million-core run

Nichols and team use the Navier-Stokes equations and codenamed CharLES for the jet noise simulations (LES stands for large eddy simulation). They're using the same code for separate projects to research scramjets (supersonic combustion ramjets), which could travel at 10 times the speed of sound.

We've written before about large supercomputing runs—for example, one using 50,000 cores on the Amazon Elastic Compute Cloud. That one was "embarrassingly parallel," in which the calculations are all independent of each other. That means the speed of the interconnect, the connections between each processing core, didn't really matter.

The million-core run was not only much bigger, it was more complicated. "It is parallel but there is communication [between processors] involved," Moin said. "Each core is not independent."

As we wrote in August 2011, Blue Gene/Q uses "transactional memory" to solve many of the problems that make highly scalable parallel programming so difficult. But prep work by humans is still required.

Nichols worked with the Lawrence Livermore folks to optimize the code for Sequoia, avoiding slowdowns in I/O performance and minimizing the communication needed in each step.

Sequoia uses IBM’s proprietary 5-dimensional Torus interconnect. Nichols explains:

Each compute node is connected to 10 of its nearest neighbors. There are five dimensions and it goes forward and backward along each dimension. That's 5x2 and you get 10 connections with these optical links. You can communicate with processors that are further away, but it has a higher latency. Latency for the nearest neighbor is 80 nanoseconds, which is totally incredible. The calculation uses the interconnect in such a way that communication overhead was very minimal even at a million processors.

Performance scaled almost at a one-to-one ratio with the increase in cores, with 83 percent efficiency. Nichols explains that since going from 131,000 cores to 1 million multiplies the number of cores by 8, one would want a speed-up of 8 as well.

The real speed-up was 6.6, which is "83 percent of the ideal speed-up we would like to see. It means that as we're adding more cores the code is getting faster and faster, even at the million level. This is amazing for a CFD simulation, because in CFD simulations each of these subdomains has to communicate with its neighbors at every time step to share wave information."

Modeling jet engines is complex, as Stanford notes in a description of the project: "These complex simulations allow scientists to peer inside and measure processes occurring within the harsh exhaust environment that is otherwise inaccessible to experimental equipment."

The simulations allow Stanford to test how changes to the engine nozzle and chevrons impact noise. With supercomputers these simulations can be done without building physical models or testing in wind tunnels.

"That's a really complicated problem because you have shock waves in engines which are very thin scales compared to the length of the combustor," Nichols said. "And then you have combustion as well as turbulence in everything. The idea is that we can predict the behavior. If given enough resolution we can predict what will happen with different types of designs."

The study of scramjets and the conditions under which they might fail involves similar complexity. "NASA is very much interested in such vehicles for access to space," Moin said. "These would be air-breathing vehicles as opposed to rockets. They carry their own oxygen to orbit; they're heavier because of that. They have to carry liquid oxygen."

For now, the researchers will have to continue this work with paltry sub-million-core supercomputers. But perhaps it won't be long before supercomputers as powerful as Sequoia are the standard for such research.

As a high school student in 1994, Nichols attended a summer program at Lawrence Livermore and worked on the Cray Y-MP. At the time, it was one of the faster machines in the world.

"Now Sequoia is 10 million times stronger than that machine," Nichols said. "This is giving us a glimpse into the future, that it's really possible to run on a million cores."