SC10 Although everybody seems to be excited about GPU-goosed supercomputing these days, Big Blue is sticking to its Power-based, many-cored BlueGene and Blue Waters massively parallel supers, and revving them up to bust into the 20-petaflops zone.

The Blue Waters massively parallel Power7-based supercomputer and its funky switching and interconnect, and very dense packaging were the big iron of last year's SC09 event in Portland, Oregon, which El Reg told you all about here. And we've covered the GPU additions to the iDataPlex bladish-rackish custom servers IBM builds, as well as the forthcoming GPU expansion blade for Big Blue's BladeCenter blade servers, which are due in December and which are also special-bid products.

But the BlueGene/Q super — made of fleets of embedded PowerPC processor cores — is still, in terms of aggregate number-crunching power, the biggest and baddest HPC box on the horizon from IBM for the next two years.

IBM lip-smackingly announced the sale of the "Sequoia" BlueGene/Q supercomputer to the US Department of Energy back in February 2009, just as the current BlueGene/P machines were ramping up production. But the company did not provide many details about the architecture, except that it would pack 1.6 million cores into a single cabinet, would have 1.6PB of storage, a peak performance of 20 petaflops, and burn 6.6 megawatts of juice. The machine will be installed at Lawrence Livermore National Laboratory, which bought the first experimental BlueGene/L super.

This week IBM yanked a compute node and an I/O out of the prototype portion of the future BlueGene/Q super that's installed at its Watson Research Center in New York and showcased them at the SC10 supercomputing show, the first outing of the BlueGene/Q system components.

To understand BlueGene/Q, you have to compare it to the prior BlueGene machines and their predecessors to see how far the design has come and why IBM still believes that the BlueGene approach — small cores, and lots of them — provides the best bang for the watt.

The original BlueGene/L machine was based on some early parallel-computing design work done in the early 1990s by IBM in conjunction with Columbia University, Brookhaven National Laboratory, and RIKEN (the big Japanese government-sponsored super lab) to make a massively parallel machine called QCDSP to do quantum chromodynamics calculations using digital signal processors.

A follow-on machine called QCDOC replaced the DSPs with embedded PowerPC processors, putting 64 compute nodes on a single board that interconnected with a proprietary backplane.

In December 1999, IBM ponied up $100m of its own dough to create the original BlueGene/L machine, aiming the box at massive protein-folding problems. Two years later, LLNL saw that such a machine could be used for nuclear weapons simulations and placed the first order for the prototype.

By the fall of 2004, a prototype of the BlueGene/L machine became the fastest supercomputer in the world, using eight BlueGene/L cabinets and 1,024 compute nodes for a sustained performance of 36 teraflops. That machine has been upgraded many times, and now has reached its full system configuration, which includes 65,536 compute nodes and 1,024 I/O nodes (both based on 32-bit PowerPC processors).

BlueGene/L held the top spot on the Top 500 ranking of supercomputers, which is based on the Linpack Fortran benchmark test, for four years. The machine is based on single-core 32-bit PowerPC 440 processors that spin at 700MHz and which are packed two cores to a die with a shared L2 and L3 cache. Each core has two floating-point units as well as memory controllers, on-chip Gigabit Ethernet interfaces, and the proprietary interconnect that implements a 3D torus interconnect (derived from the Columbia University machines) that runs the Message Passing Interface (MPI) clustering protocol to lash the nodes together like oxen pulling a cart.

The BlueGene/L machine at LLNL, which was first installed in 2005 and which has been upgraded a number of times, has 131,072 cores, 32TB of aggregate main memory, a peak performance of 367 teraflops, a sustained performance of 280.6 teraflops on the Linpack test, and burns around 1.2 megawatts. The machine is air-cooled.

IBM's currently selling massively parallel box is the BlueGene/P, which puts four 850MHz PowerPC 450 cores on a chip with the memory controllers, floating point unit, and BlueGene interconnect on the chips as well as a beefed-up 10 Gigabit Ethernet controller and the old Gigabit Ethernet port on the chip. Those PowerPC 450 cores are still 32-bit units, by the way.

Each BlueGene/P node can support 2GB of main memory (512MB for each core), and the 3D torus has 5.1GB/sec of bandwidth and somewhere between 160 nanoseconds and 1.3 microseconds of MPI point-to-point latency between its nearest peers in a single node — that's a factor of 2.4 more bandwidth and about 20 per cent lower latency.

The BlueGene/P collective network that brings the nodes together has 1.7GB/sec of bandwidth per port (2.4 times that of the BlueGene/L machine) and there are three ports per node that have a 2.5 microsecond latency talking to other nodes. In a worst-case scenario, where a node has to make 68 hops across 72 racks in the 3D torus to reach another node to get data, the latency is 5 microseconds, a big improvement over BlueGene/L, which took 7 microseconds to make the same hops.

An optical 10 Gigabit Ethernet network links the BlueGene/P nodes to the outside world and there is a Gigabit Ethernet network for controlling the system. The BlueGene/P system puts 1,024 compute nodes in a rack and from 8 to 64 I/O nodes (which plug into the same physical boards as the compute nodes) per rack. The machine delivers 13.9 teraflops per rack and can scale up to 256 racks, for a 3.56 petaflops of peak (not sustained) number-crunching performance across more than 1 million cores.

The BlueGene/P nodes, like their BlueGene/L predecessors, were air-cooled and put compute and I/O nodes on the same node boards. The BlueGene/P machines crammed twice as many cores onto a chip module (four cores instead of two) and twice as many compute nodes (32 instead of 16) onto a single compute drawer, basically quadrupling the cores and nearly quintupling floating-point performance.

The power drain on BlueGene/P also went up by a factor of 1.5, with a petaflops of peak oomph burning about 2.9 megawatts. But the performance per watt increased by 9 per cent, so it was a net gain on all fronts: performance and energy efficiency.

With the BlueGene/Q designs, IBM is doing a number of different things to boost the performance and energy efficiency of the massively parallel supers. First, the BlueGene Q processors — called BGQ for short at IBM — bear some resemblance to IBM's Power7 chip used in its commercial servers, and an even stronger resemblance to the Power A2 "wire-speed" processors, which El Reg discussed in detail this year as they were announced.

Like these two commercial chips, the BlueGene/Q processor is a 64-bit chip with four threads per core. The BlueGene/Q processor module is a bit funky in that it has 17 cores on it, according to Brian Smith, a software engineer for the product who was demonstrating the compute and I/O modules at the SC10 expo. On that BGQ processor, one of the cores will run a Linux kernel and the other 16 are used for calculations, according to Smith.

The cores used in the BlueGene/Q prototype run at 1.6GHz, compared to the 2.3GHz speed on the sixteen-core Power A2 wire-speed processor. (The cores could be the same or very similar on both chips.) With the BlueGene/Q super, not only is the BGQ chip moving to 64-bits, but it also has four threads per core to increase its efficiency.