I spent a while this evening reading through the documentation for the beta release of NVIDIA’s CUDA GPGPU system. My motivation for this was that nvcc , the CUDA compiler, is based on a code drop of the EkoPath compiler, which I’ve worked on intermittently over the past few years.

The programming model that these GPUs enforce is incredibly complex. It’s more than a little reminiscent of the Connection Machine (for a blast from the past, see a collection of scanned CM-5 documentation.

The idea is that the GPU executes a “kernel” of compute-intensive, highly parallelisable, code on behalf of the CPU. Data is transferred to the GPU when a kernel starts to execute, and back to the host when it completes. The GPU may execute multiple kernels simultaneously, if it is capable of it.

A kernel is executed as a collection of threads, with threads organised into “blocks”. The blocks are further organised into “grids”. Each thread has thread-local storage, and a block of threads shares a chunk read-write memory. This is how they communicate and synchronise with each other. Threads can also access the GPU’s global memory, although blocks of threads cannot synchronise with each other via global memory.

Global memory has a pile of constraints. There’s not one single kind of global memory; instead, there are three. Unlike the thread-local and block-shared memories, global memories can be accessed by the host (although accesses are expensive).

Plain old read-write memory. Accesses to this memory are not cached. When a thread block is scheduled to execute, and accesses global memory, its performance is heavily penalised if threads don’t access global memory in a suitable pattern (see section 6.1.2.1 of the Programming Guide).

“Constant” memory is read-only, and reads are cached. Again, there are restrictions on access patterns to constant memory that affect performance.

“Texture” memory is also read-only, and cached. Cached texture data is laid out to give best performance for 2D access patterns.

Block-shared memory isn’t immune to peculiar performance constraints, either. It’s organised into “banks”, to allow parallel access to multiple banks in a single operation. In other words, n threads can access n banks in one cycle. However, concurrent accesses to a single bank are forcibly serialised, so n threads hitting a single bank will take n cycles.

So at every level of the GPU memory hierarchy, there are multiple factors that you have to keep in mind if you want to achieve good performance.

The extensions that NVIDIA has made to the C language are fairly minimal. There are some new keywords to control layout of data, to determine where variables should live in the memory hierarchy. Not surprisingly, there are new vector types that look very much like OpenGL vector types.

A few similar keywords control function layout. Functions can be marked as device accessible only, resident on the device but called from the host (i.e. a kernel), or host-only. A weird piece of syntax `<<< Dg, Db, Ns >>>’ is required when calling a kernel, to control the execution parameters for the kernel: grid size, block size (and number of threads), and memory allocation.

With hardware of this complexity, a good optimising compiler can make a substantial difference to application performance. The heritage of NVCC is unimpeachable. It’s based on the EkoPath compiler, which has been the most frequent compiler used for SPEC benchmark submissions by AMD64 hardware vendors for several years. The EkoPath compiler was in turn based on Open64, a compiler that SGI GPLed when it dropped out of the compiler business.

Among the memory hierarchy related optimisations in the Open64 compiler family that are available to NVCC are the following (this is just a short list of highlights; the real number is big). I don’t know how many of these are enabled in NVCC, nor do I know enough about individual cache sizes or miss penalties to have a clue as to which ones are likely to be effective.

Loop nest optimisation. For a set of nested loops, this can change the order in which the inner and outer loops are executed, to improve the pattern of access to memory.

Vectorised intrinsics. If application code is, for example, computing sin(x[i]) for all i in a vector, the compiler can replace this with a single call to a highly optimised sin specialised for vectors.

Cache blocking. Replace a single loop over a large vector with smaller loops that operate on cache-sized chunks of the vector.

Given the degree of manual control over variable placement that NVIDIA’s C extensions seem to enforce, it’s not clear to me that their compiler team has had a chance to automate any of the transfer of data between levels of the memory hierarchy yet.

I also find it telling that the programmer’s guide includes specific guidelines on how to avoid bank conflicts in block-shared memory, where in at least some of these cases it’s clear that the compiler could be automating the job.

It’s worth reading the programmer’s guide in its entirety to get a sense of just how complex CUDA is, and how many different constraints the determined application programmer will have to keep in mind at a time. There’s not a lot of abstraction going on here (vendor-provided BLAS and FFT libraries notwithstanding).

We have plenty of previous examples of hardware that failed to live up to their early marketing promise, from the i860 to the PS3. CUDA looks set to follow in their footsteps: I expect that it will take vast amounts of work for programmers to get halfway decent performance out of a CUDA application, and that few will achieve more than 10% of theoretical peak performance.

People with the expertise, persistence, and bloody-mindedness to keep slogging away will undoubtedly see phenomenal speedups for some application kernels. I’m sure that the DOE and NSA, in particular, are drooling over this stuff, as are the quants on Wall Street. But those groups have a tolerance for pain that is fairly unique. This technology is a long way from anything like true accessibility, even to those already versed with parallel programming using environments like MPI or OpenMP. Still, it’s a great first step.

“Now, imagine a Beowulf cluster of these!”