It is hard to imagine performing research without the help of scientific computing. The days of scientists working only at a lab bench or poring over equations are rapidly fading. Today, experiments can be planned based on output from computer simulations, and experimental results are confirmed using computational methods.

For example, the Materials Genome Project is currently plowing through the periodic table looking for structures and chemistries that may lead to enhanced materials for energy applications. By allowing a computer to perform most of the work, researchers can concentrate their valuable time on synthesizing and characterizing a small subset of interesting compounds identified by the search algorithm.

As the scope of scientific research has become more complex, so have the computational methods and hardware required to provide answers to scientific questions. This increasing complexity results in expensive, highly specialized scientific computing equipment that must be shared across multiple departments and research units, and the queue to access the equipment can be unacceptably long. For smaller labs, it can be nearly impossible to get adequate, timely access to critically important computing resources. Sure, there are national user access facilities or toll services, but they can take extraordinarily long times to access or be prohibitively expensive for prolonged projects. In short, high performance scientific computing is largely restricted to large and wealthy research labs.

With these issues in mind, a research team in the Laboratoire de Chimie de la Matière Condensée de Paris (LCMCP) at Chimie ParisTech, led by research engineer Yann Le Du and graduate student Mariem El Afrit, has been building a high performance computational cluster using only commercially available, "gamer" grade hardware. In a series of three articles, Ars will take an in-depth look at the GPU-based cluster being built at the LCMCP. This article will discuss the benefits of GPU-based processing, as well as hardware selection and benchmarking of the cluster. Two future articles will focus on software choices/performance and the parallel processing/neural network algorithms used on the system.

The most bang for the buck: GPU-based processing

The cluster, known as HPU4Science, began with a $40,000 budget and a goal of building a viable scientific computation system with performance roughly equivalent to that of a $400,000 "turnkey" system made by NVIDIA, Dell, or IBM. The crux to the project was identifying hardware that provided enough computational power to be useful in an advanced academic setting while staying within the relatively modest budget. The system also needed to be modular and scalable so that it can be easily expanded as results come in and the budget (hopefully) grows.

The ideals of the project team dictated that the system use open source software wherever possible and that it be built only from hardware that is available to the average consumer. The project budget was, of course, an order of magnitude more than the average consumer could afford. In principle, however, anyone should be able assemble a similar, albeit scaled down, system and freely use the software and code developed by the HPU4Science team to perform high-end scientific computing.

With most clock speeds topping out at 3 GHz, achieving the computational capacity necessary to attack complex scientific problems means increasing the number of processors in the system—but which processors? CPUs (like x86, x86_64, POWER, etc.) are flexible and popular (the vast majority of the top 500 supercomputers are CPU-based), but come at a heavy price, $50-$250 per core, depending on the architecture.

Alternatively, GPUs like NVIDIA's GTX 580 pack over 512 computational units (more exactly "shaders") into a package that retails for around $500 (less than $1.00 per shader). Each computational unit is significantly simpler (fewer transistors) and less flexible than a CPU core, but, dollar for dollar, the processing power of these chips is unparalleled. With a GPU-based system, the cost is much lower, but the scientific problems solved with the GPU must be translated to simple, linear operations. Some problems that can be tackled on CPU-based systems may be intractable on a GPU system, but many, if not most, scientific problems can be largely translated to linear algebraic operations, so the subset of intractable problems is small.

Another advantage of working with massively parallelized GPU processing is the ability to train neural networks. Artificial Neural Networks (ANNs) are, at their core, a series of independent multiplications and sums. These are simple operations that can be carried out on simple processors like GPUs. With the right choice of neural network methodology, the parallel and solitary cores that make up a GPU can be efficiently put to use on extraordinarily complex problems. The key is translating the scientific understanding of the latter into an appropriate algorithm.

GPU cluster architecture: master and worker

Considering both the cost and ability to host neural networks, the team at LCMCP decided to move forward with GPU-based processing, which would initially be used to provide better answers to some questions that arise in Magnetic Resonance Imaging. As shown in the diagram, the system is set up in a Master-Worker configuration.

The cluster runs two different categories of algorithms: a standard Master-Worker relationship and a more complex set of neural network algorithms. In the simple Master-Worker mode, the master dispatches specific problem sets and algorithms to the workers. The workers then simply churn through the computations and report the results back to the master, where they are compiled and assembled into a final report.

When using neural network algorithms, the master describes the problem parameters to the workers and selects the specific neural network algorithm to be used. This information is dispatched to workers and each worker independently explores its own set of possible solutions to the problem using the provided methodology. The master collects the results, combines the individual results to see if a more optimal hybrid solution exists, and finally reports the best results to the user. The neural network algorithm used on the HPU4Science cluster will be discussed in detail in Part 3 of this series.

While the starting budget for the system was $40,000, not all of that money has been spent. The hardware configuration shown below for the master and five workers cost $30,000. This translates to a cost of less than $1.80 per GFLOPS of real computing power. This price is the total price of the system (storage, power supplies, boxes, network components, etc.), not just the price of the GPUs