What is PyCUDA being used for?

[[!toc ]]

This is a Wiki--please add your application below! (sorted alphabetically)

Agent-based Models

Simulation of spiking neural networks

Romain Brette and Dan Goodman have developed a spiking neural network simulator in Python called Brian. Brian can use PyCUDA to generate run-time GPU code for the integration of differential equations provided by the user in a Python script. It is planned to move more elements of the simulator to the GPU to obtain very fast simulations (see the development group for Brian on GPU).

Computational Visual Neuroscience

[[!table header="no" class="mointable" data=""" [[!img cvn.png] """]]

The study of biological vision and the creation of artificial vision systems are naturally intertwined as they represent simultaneous efforts to forward and reverse engineer systems with similar goals. However, while neuroscience has provided inspiration for some of the "broad-stroke" properties of the visual system, much is still unknown. To pave a way forward, we have developed a high-throughput approach to more expansively explore the possible range of brain-inspired models (see figure), including models of larger, more realistic scale, leveraging recent advances in commodity stream processing hardware. In analogy to high-throughput screening approaches in molecular biology, we generate and train thousands of potential model instantiations, and "screen" their visual representations using an object recognition task. From these candidate models, the most promising are selected for further analysis. We have shown that this approach can yield significant, reproducible gains in performance across an array of basic object recognition tasks, consistently outperforming a variety of state-of-the-art purpose-built vision systems from the literature, and that it can offer insight into which computational ideas are most important for achieving this performance.

The brain itself is a highly parallel statistical supercomputer, and thus algorithms inspired by its function are well suited to the computational advantages offered by GPUs. However, this power naturally comes at the cost of increased complexity for the developer. In the last three years, we have experienced three different paradigms (i.e. programming GPUs with graphics primitives in 2006, programming the PlayStation 3 using low-level Cell intrinsics in 2007 and programming GPUs with compute primitives in 2008). To overcome the challenge of optimizing each architecture, we applied RTCG to auto-tune the core operations by instrumentalizing low-level code and manipulating it with a Python template engine (see figure). We implemented common optimization strategies (e.g. loop unrolling, pre-fetching and software pipelining, alleviation of register pressure using spilling, communication and computation load distribution, etc.) and achieved comfortable speed-ups with a simple auto-tuning method (i.e. random search on a coarse grid). In the future, we plan to investigate the use of machine learning techniques for auto-tuning, an approach recently undertaken by IBM's Milepost GCC.

Using RTCG toolkits like PyCUDA, we were thus able to combine the flexibility and ease-of-use of a high-level language for "outer loop" control and auto-tuning, with the raw performance of highly optimized "close-to-the-metal" GPU or Cell code to achieve hundred-fold speedups over conventional MATLAB/MEX CPU implementations (the standard in the fields of computational neuroscience and computer vision; see the article below for more details). We argue that the combination of these qualities enables a new kind of exploration of ideas in biological and computational vision, where scale is matched with the fluid ability to experiment new ideas.

As the scale of available computational power continues to expand, and more RTCG tools like PyCUDA emerge, we believe that this approach has the potential to greatly accelerate progress in both artificial vision and our understanding of the computational underpinning of biological vision.

Discontinuous Galerkin Finite Element PDE Solvers

[[!table header="no" class="mointable" data=""" [[!img ka6d-arrows.png] """]]

Discontinuous Galerkin finite element methods (DG-FEM) for the numerical solution of partial differential equations have enjoyed considerable success because they are both flexible and robust: They allow arbitrary unstructured geometries and easy control of accuracy without compromising simulation stability. In addition to their favorable numerical properties, DG schemes combine high arithmetic intensity, local memory access and locally dense linear algebra. They are therefore computationally well-suited for implementation on GPUs. However, DG-FEM also face significant challenges for GPU implementation: While GPUs prefer granularities of 2n for some n, DG's granularities are typically different. Further, the current GPU generation's on-chip memory is smaller than would be optimal for DG methods, and therefore the fetch schedule needs to be tuned carefully. Finally, loop slicing has a large impact on DG workloads. Using GPU run-time code generation strategies within PyCUDA, we were able to accelerate a solver for Maxwell's equations on a general 3D unstructured grid by a factor of around 50 relative to a serial computation on a current-generation CPU, using a single Nvidia GTX280 GPU.

Estimating the Entropy of Natural Scenes

Characterizing the statistics of natural scenes is an important area of vision research. The entropy of images provides a measure of the information content available to the visual system and as such quantifies the demands placed on neural information processing mechanisms. From an applications perspective, entropy is the theoretical limit of compression--the lower bound on any compression scheme. Recently, Chandler et al. used an entropy estimation algorithm to binlessly estimate the entropy of small patches of natural images from the distribution of nearest-neighbor (NN) distances. This approach is limited by requiring NN calculations of an exponentially growing set. We overcome this limitation by porting the parallel brute force NN search to the GPU. This enables us to perform more extensive entropy and fractal dimensionality analyses on the entire database of about 4000 thousand natural images, only a few dozen of which were used in for the previous work by van Hateren et al. One 8800GTX card performs 30 times faster than a compiler optimized C version, 53 times faster on a GTX 295. Additionally, because our implementation uses PyCUDA, we can easily optimize the parameters of the implementation for newer cards, and extend the parallelism to multiple cards. Such computational capabilities will enable us to analyze and compare previously unimaginable large classes of images in a reasonable amount of time. (P. Ivanov)

Facial Image Database Search

Ian Cullinan and the SAFE Advanced Surveillance group at NICTA have developed a web-based interface for forensic searches of facial image databases. There are two very time-consuming steps in running such a search: computing the "fingerprint" for the query image and then computing distances from that to a large number of images in the database. Performance numbers (on a modern dual-core CPU) are ~200ms for calculating a single histogram (single-threaded), and about 1 sec for distance computation over 10,000 images (multi-threaded). The distance calculation is not math-intensive, but it is data-parallel: the distance for each image in the dataset can be calculated independently, making it well suited to acceleration with CUDA.

The web interface is written in Python using Django, so PyCUDA was an obvious choice for CUDA acceleration. It was found to be very easy to learn and it integrated well with our exiting Python codebase. We developed the CUDA implementation of the distance calculations in PyCUDA. Even without accelerating the "fingerprint" calculation, this has approximately halved the time it takes to run a search (on an Intel E8400 with a GTX280 GPU).

Filtered Backprojection for Radar Imaging

Tomographic systems as diverse as x-ray CT and synthetic aperture radar all use a sensor that projects a three-dimensional function (such as x-ray absorption and electromagnetic reflectivity, respectively) onto one-dimensional range profiles. Reconstruction of the original function from a collection of these line projections can be accomplished by the filtered backprojection algorithm, where each voxel must query each line projection for its contribution to that voxel, which can be done in $\mathcal{O}(N^3)$ time. Leveraging the large number of cores and the linear interpolation hardware available on modern GPUs, a mildly-optimized CUDA implementation performs SAR backprojection over 50 times faster on a C1060 Tesla than a single-threaded implementation on a modern CPU. Anecdotes about hyperoptimized industrial implementations support the quality of this comparison where, e.g., a Tesla implementation is claimed to be roughly 10 times faster than a multi-threaded 4-core Intel Xeon implementation.

An all-C CUDA implementation can either have a kernel that can accept several arguments for radar- and geometry-specific parameters, or these parameters can be pre-compiled as constants. The latter increases the conciseness of the kernel code, but requires separate binary executables for different imaging scenarios---and neither precludes the risk of errors in programmer-controlled memory management. PyCUDA makes it trivial to automatically generate kernels with pre-compiled constants, allowing the CUDA kernels to be much simpler. This also allows the experimentation that is inherent in tuning a CUDA implementation to be done rapidly in a command-line environment. Thus, in our experience, both these advantages combine to dramatically decrease both code size and tuning time when using PyCUDA. (A. Fasih)

LINGO Chemical Similarities

SIML: a fast SIMD implementation of LINGO chemical similarities. SIML ("Single-Instruction, Multiple-LINGO") is a library containing implementations of a fast SIMD algorithm for calculating the LINGO chemical similarity metric (described in an upcoming publication). This method, currently implemented for x86 CPUs (non-vectorized) and NVIDIA GPUs, is several times faster than existing LINGO implementations for the CPU, and two orders of magnitude faster when run on a GPU.

Recurrence Diagrams

Tomasz Rybak at Bialystok Technical University is applying GPU computing to the generation of recurrence diagrams for time series analysis. Using PyCUDA for his analyses, he was able to achieve an 85-fold speedup of his computations. He is using code generation strategies to achieve even greater speeds in cases when data set characteristics allows for using faster memory.

Sailfish: Lattice Boltzmann Fluid Dynamics

(Michał Januszewski)

Sailfish is an Open Source fluid dynamics solver for CUDA devices. It uses the Lattice Boltzmann method to simulate incompressible and weakly compressible flows in 2D and 3D. The project uses Python to maintain a rapid pace of development and high level of code readability. pycuda and numpy allow it to achieve high performance without compromising the flexibility offered by Python.

Selective Embedded Just In Time Specialization

[[!table header="no" class="mointable" data=""" [[!img sejits-schematic.png] """]]

We have also used PyCUDA as the foundation of higher level programming tools, performing Selective Embedded Just In Time Specialization \cite{sejits_2009}. The idea behind SEJITS is to provide highly productive environments for parallel programming through the use of specialized runtime code-generation. We create domain specific modules, called specializers, which use metaprogramming to analyze a high level description of a particular computation, and then perform JIT code generation for that particular computation. In this case, we express our computations in Python, and use Python function decorators to intercept procedure calls which should be specialized. Python's introspection facilities allow us access to the source of a procedure under specialization, which we then analyze and manipulate to generate CUDA source code. Using PyCUDA, we move data back and forth between the Python interpreter and the GPU, as well as execute the specialized CUDA code.

The figure above outlines this approach. Because the specialization machinery and the domain specific modules are embedded in a scripting language, it is easy for programmers who understand efficient implementation to incrementally add specializers for new domain abstractions, which then can be exported for use by those less familiar with the details of efficient parallel implementation. Additionally, embedding the code to be specialized in a scripting language allows us to fall back to execution by the high-level interpreter, if a particular idiom is not supported by a specializer. Finally, SEJITS allows for the incorporation of autotuners to generate multiple variations of a particular computation, which is very useful when attempting to provide good performance on diverse target architectures.

We prototyped a set of specializers for image processing applications, providing some abstract stencil and category reduction primitives to allow the implementation of image processing routines including k-means clustering and edge detection, taken from a high-end image contour detection algorithm \cite{iccv_2009}. On simpler types of code, such as image convolution, our SEJITS system ran only about 3x slower than our hand-optimized convolution routines. Due to naive code-generation in our specializers, on more complicated types of code, such as the k-means clustering routines, our system was about 10x slower than hand-optimized CUDA code, although we believe the code generators can still be substantially improved, which is ongoing work. In summary, RTCG with PyCUDA has enabled research into higher-level programming models and compilers for parallel platforms, by bridging the gap between a high-level language, Python, and a highly-parallel platform, the GPU. (Y. Lee and B. Catanzaro)

B. Catanzaro, S. Kamil, Y. Lee, K. Asanovi?, J. Demmel, K. Keutzer, J. Shalf, K. Yelick, and A. Fox. SEJITS: Getting Productivity and Performance With Selective Embedded JIT Specialization. In PMEA '09: Programming Models for Emerging Architectures, 2009.

B. Catanzaro, B.-Y. Su, N. Sundaram, Y. Lee, M. Murphy, and K. Keutzer. E?cient, High-Quality Image Contour Detection. In ICCV '09: The International Conference on Computer Vision, 2009.

Simulation of spiking neural networks

Romain Brette and Dan Goodman have developed a spiking neural network simulator in Python called Brian. Brian can use PyCUDA to generate run-time GPU code for the integration of differential equations provided by the user in a Python script. It is planned to move more elements of the simulator to the GPU to obtain very fast simulations (see the development group for Brian on GPU).

Time Encoding and Decoding Toolkit

Time Encoding Machines (TEMs) are asynchronous signal processors that encode analog information in the time domain. Traditional communication circuits such as the Asynchronous Sigma-Delta Modulator and neuron models such as the Integrate-and-Fire neuron and Hodgkin-Huxley neuron can be shown to be time encoding machines. Signals can also be time-encoded by ensembles of modulators or populations of neurons. Under appropriate conditions, TEMs can faithfully represent stimuli such that the time-encoded information can be recovered using a Time Decoding Machine (TDM). Time encoding therefore provides a key insight into how information may be represented by communication and neural circuits in the time domain. The Bionet Group at Columbia University is currently using PyCUDA to encode, decode, and process audio and video signals with populations of up to 10^5 neurons on multiple NVIDIA GPUs; the Time Encoding and Decoding Toolkit (TED) contains sample implementations of various TEMs and TDMs in Python and PyCUDA. To facilitate continued development of new time domain algorithms, members of the group have also developed scikits.cuda, a package of high-level numerical routines and wrapper functions that provides a Python interface to GPU-powered libraries such as CUBLAS and CUFFT from NVIDIA and CULA from E.M. Photonics. TED and scikits.cuda can be respectively downloaded from the following sites.

Copenhagen CT toolbox

A Collection of open source CT tools developed mainly at the University of Copenhagen.

We have implemented CPU and GPU versions of a few popular reconstruction algorithms for common scan patterns. The code is released under the open source GPL v2 license so you can download and try it out for yourself. The release includes a README with background and install notes and user instructions with complete examples.

We use the toolbox ourselves for education, research and in the industrial collaborations we participate in. The DYI CT scanner that one of our students built from laid-off parts is just one example of the educational applications of the toolbox.