GPU Technology Conference Intel and NVIDIA are battling for the hearts and minds of developers in massively parallel computing.

Intel has been saying for years that concurrency rather than clock speed is the future of high performance computing, yet it has been slow to provide the mass of low-power, high-efficiency CPU cores needed to take full advantage of that insight.

Another angle on this is that GPUs are already designed for power-efficient massively parallel computing, and back in 2006 NVIDIA exploited its potential for general-purpose computing with its CUDA architecture, adding shared memory and other features to the GPU and providing supporting libraries and the CUDA SDK. CUDA is primarily a set of extensions to C, though there are wrappers for other languages.

Huang's Tesla K20 will serve intense computing

At NVIDIA’s GPU Technology Conference in San Jose, California, last week, the company announced new editions of its Tesla GPU accelerator boards based on its “Kepler” architecture. These boards are designed for accelerating general-purpose computing rather than for driving displays. The Tesla K10, available now, has two Kepler GK104 GPUs, 3,072 cores in total, and performs at up to 4,577 gigaflops (2,288 gigaflops per GPU).

The Tesla K20, expected in the fourth quarter of 2012, uses two of the forthcoming Kepler GK110 GPU, which promises over 1,000 gigaflops double precision. “It’s intended for applications like computational fluid dynamics, finite element analysis, computational finance, physics, quantum chemistry, and so on,” explained chief executive Jen-Hsun Huang in his keynote speech.

Power efficiency, which is the true limitation on supercomputer performance, has also been a focus, and NVIDIA states a three times improvement in performance per watt, compared to the previous “Fermi” generation.

The not-yet-available K20 is really the one you want, and not only because of its better performance. Although both the GK104 and the GK110 are called Kepler, there are several key advances that only appear in the GK110. A Grid Management Unit in the GK110 enables a feature called Dynamic Parallelism, which means that the GPU can schedule its own work. Previously only the CPU could schedule work on the GPU. Dynamic Parallelism means that more code can run entirely on the GPU, for greater efficiency and simplified code.

Another GK110 advance is Hyper-Q, which provides 32 simultaneous connections between CPU and GPU, compared to just one in Fermi. The result is that multiple CPUs can launch work on the GPU simultaneously, greatly improving utilisation.

NVIDIA now projects that by 2014, 75 per cent of HPC customers will use GPUs for general purpose computing.

The rise of GPU computing must be troubling to Intel, especially as the focus on power efficiency raises interest in combining ARM CPUs with GPUs, though implementation is unlikely until we have 64-bit ARM on the market. Intel’s response is an initiative called Many Integrated Core (MIC, pronounced Mike). It has similarities with GPU computing, in that MIC boards are accelerator boards with their own memory, and developers need to understand that parts of an application will execute on the CPU, parts on MIC, and that data has to be copied between them.

Prototype Knights

Knights Ferry is the MIC prototype, available now to some Intel partners, and has 32 cores and up to 128 threads (four Hyper Threads per core). Knights Corner will be the production MIC and has more than 50 cores and over 200 threads. The processor in Knights Ferry, codenamed Aubrey Isle, is based on an older Pentium design for power efficiency, but includes over 100 additional x86 instructions including a Vector Processing Unit, important for many HPC applications. Knights Corner is expected in late 2012 or early 2013.

Intel is supporting MIC with its existing suite of tools for concurrent programming: Parallel Studio XE and Cluster Studio XE. Key components are Threading Building Blocks (TBB), a C++ template library, and Cilk Plus which extends C/C++ with keywords for task parallelism. Intel is also supporting OpenMP, a standardised set of directives for parallel programming, on MIC, though in doing so it is getting ahead of the standard since OpenMP does not yet support accelerators. Intel’s Math Kernel Library (MKL) will also be available for C and Fortran. OpenCL, a standard language for programming accelerators, will also be supported on MIC.