Japan’s newest supercomputer, an 802-teraflop GPU-accelerated Appro cluster, went into production last week at the University of Tsukuba, just north of Tokyo. The machine represents the lynchpin of the university’s HA-PACS project, a three-year effort that will attempt to push the envelope on GPU-pumped supercomputing.

HA-PACS, which stands for Highly Accelerated Parallel Advanced system for Computational Sciences, is just the latest in a series “PACS” systems at the Tsukuba. The original system, known as PACS-9, was installed in 1978 and delivered 7 kiloflops (yes kiloflops!). Every two to four years thereafter, the university’s Center for Computational Sciences upgraded to a new system. The last one, PACS-CS, was deployed in 2006 and topped out at 14.3 teraflops.

The new Appro cluster represents the 8th generation supercomputer at Tsukuba and is the first to be accelerated by GPUs. As you might suspect, the vast majority of the 802 teraflops is provided by the graphics units, in this case, based on the latest NVIDIA Tesla GPU part, the M2090. Each cluster node pairs four of them with two 8-core Xeon E5 (“Sandy Bridge”) CPUs from Intel.

In aggregate, the 268-node HA-PACS machine will house 1072 GPUs and 536 CPUs, as well as a total of 34 terabytes of memory on the CPU side and an additional 6.4 terabytes for the GPUs. External storage amounts to just over half a petabyte, based on DataDirect Network’s SFA10000 gear. As a result of the high computational density afforded by the graphics chips, the entire cluster fits into just 26-racks and draw a little over 400 KW of power.

Using the top-of-the line CPUs and GPUs makes for a dense and powerful cluster, with each node delivering just shy of 3 teraflops (peak) performance. And even though most of the flops are GPU-derived (665 gigaflops per M2090), each Xeon E5 chips in with a respectable 166 gigaflops, thanks to the addition of the new Advanced Vector Extensions (AVX) instructions.

This is Appro’s second big system deployment at Tsukuba, having delivered the 95-teraflop T2K Open Supercomputer there in 2009. That machine used AMD’s quad-core Opterons and no GPUs.

Appro, by the way is one of the few server vendors offering systems equipped with Xeon E5 CPUs these days, and already claims four such systems on the TOP500 list: “Zin” (961 teraflops) at Lawrence Livermore National Lab, “Luna” (293 teraflops) at Los Alamos National Lab, “Gordon” (262 teraflops) at the San Diego Supercomputer Center and “Chama” at Sandia National Labs. That’s a nice accomplishment, considering Intel has yet to officially release the E5 chips into the wild.

CPU’s aside, the main focus for HA-PACS is to draw the most performance from the GPU hardware. The project has a two-pronged mission in this regard: to bring more big science codes to the GPU and to develop a tightly coupled parallel computing acceleration mechanism in order to “further optimize the utility of the graphics hardware.”

On the application side, HA-PACS will be porting codes to the GPU in the areas of subatomic particles, life sciences, astrophysics, nuclear physics and environmental science. For example, astrophysics applications that deal with radiation transfer can take advantage of ray tracing methods, which modern GPUs are tailor-made for. Likewise, for elementary particle physics, GPUs can be used to great advantage to accelerate dense matrix computations.

On the computational research side, the HA-PACS team is in the process of developing custom hardware to support direct communications between the GPUs. The idea is to enable the graphics processors to quickly shuffle data between themselves without the overhead involved in going through the CPU.

This custom hardware, known as the Tightly Coupled Accelerator (TCA), will be distinct from the HA-PACS base cluster from Appro, but will eventually be integrated with it, says Taisuke Boku, deputy director of Center for Computational Sciences at University of Tsukuba. According to him, TCA will use PCIe as a communication channel between the GPUs and employ FPGA technology to facilitate this.

The FPGA will be based on an existing implementation developed at Tsukuba called PEACH, which stands for PCI Express Adaptive Communication Hub. The idea is to provide a controller that enables PCIe devices to directly communicate with one another on a peer-to-peer basis, rather than as slave devices.

To make this work for TCA, an upgraded implementation of the FPGA, known as PEACH2, will be developed. It will incorporate NVIDIA’s GPU-Direct communication protocols to facilitate data transfers between the Tesla parts. Bandwidth will also be improved from the original PEACH version, which used four ports of PCIe Gen2 x4 as the communication link. For PEACH2, four ports of PCIe Gen2 x8 will be supported, doubling throughput.

The first prototype of the TCA is under development now. The plan is to to incorporate the technology into a second cluster, which will be glued to the Appro base cluster by early 2013. The TCA cluster will add an additional 200-plus teraflops into production, bringing the integrated HA-PACS system to over a petaflop.



The HA-PACS work will be a precursor to future exascale systems already in the minds of Boku and his team at Tsukuba. He believes future exascale system will require some level of accelerated computing technology due to its inherent advantages in performance and energy efficiency.

“The largest issue on the accelerated computing is how to fill the gap between its powerful internal computation performance and relatively poor external communication performance,” says Boku. “In some applications, we may need a paradigm shift toward a new generation of algorithms. HA-PACS will be the testbed for developing these algorithms.”