1. Understanding the Original Code

The original code can be split into three phases:

initialization with random values (lines 20-25) evolution (lines 29-75, temporal loop with nested spatial loops) output (lines 79-83, count number of living cells)

What's interesting is that the original code (shown on the right) handles the boundary cases separately: cells on the edges are copied over to their counterparts on the opposite edges. This creates a torus topology: the left and right boundaries can be imagined as being glued together, which transforms the 2D simulation plane into a pipe. Glueing together the top and bottom edges transforms the pipe (topologically) into a doughnut.

Fig. 1: Torus Topology

2. Porting It

The basic behind LibGeoDecomp is that it takes over grid storage and controls which cells get updated when. That's not much, yet it allows the library to add features such as multi-node parallelization and parallel IO. We can map the three phases of the sequential code mentioned above to components of LibGeoDecomp:

An Initializer will set up the grid before the simulation starts. The Simulator controls the temporal and spatial loops. A Writer handles output. It will be periodically notified by the Simulator.

If you've checked out the repository then you'll have noticed that the parallel version comes with a couple of additional files. These files mainly act as the glue between the Fortran kernel and the C++ library. Fig. 2 illustrates the workflow of the ported code. Here is how it works:

We can't instantiate the components directly in the Fortran program -- but we can call a C function which will do this for us. This function is called simulate() and lives in wrapper.cpp .

and lives in . The Initializer and Simulator will need a way to call back the relevant portions of the Fortran code, which is why I've packaged those in two subroutines, initialize() , which sets the initial state of a single cell, and update() , which is responsible for the evolution of a row of cells.

, which sets the initial state of a single cell, and , which is responsible for the evolution of a row of cells. update() in turn will need to access data stored within LibGeoDecomp. I've added macros named LGD_GET() and LGD_SET() for reading/writing cell data.

The code in wrapper.cpp and libgeodecomp.F90 is fairly generic and could be moved within the library, should there be more interest in using it with Fortran apps. This would leave decomposing the Fortran code into appropriate functions as the only task left to the user.

Fig. 2: Simulation Callgraph

3. How Does the Wrapper Work?

The wrapper was meant to be generic, but since it's the first time I wrote such a piece of code, it's not. A quick examination might be interesting for those wishing to reuse it. To facilitate the two-way callback from C++ to Fortran (for the update) and back to C++ (for reading/writing data), I've defined a class Cell as shown below.

class Cell { public: class API : public APITraits::HasTorusTopology<2>, public APITraits::HasUpdateLineX, public APITraits::HasFixedCoordsOnlyUpdate {}; inline Cell(int alive = 0) : alive(alive) {} template void update(const COORD_MAP& hood, const unsigned& nanoStep) { ... } template static void updateLineX(Cell *targetLine, long *index, long indexEnd, const NEIGHBORHOOD& hood, const unsigned& nanoStep) { ... } int alive; };

The class Cell::API inherits from a couple of classes defined in LibGeoDecomp's APITraits . These type traits are used as flags which enable the library to discover which features the user code supports. The code above tells the library that it requests a 2D torus topology. Accesses to neighboring cells will automatically be wrapped around the edges of the grid -- this makes user code much shorter and easier to read and maintain. It also specifies that Cell has a function updateLineX() which will update a whole row of cells -- as opposed to update() , which operates on a single cell. It's more efficient to update multiple cells in one go, so this is a good thing. And it doesn't really add much complexity since the original code was capable of this already. Finally the API specifies that the update routines will only use a certain type ( FixedCoord ) to address neighbors. At this point it's not really important how this works; what is important though is that this gives us access to some compile time optimizations. Many of the performance benefits in LibGeoDecomp come from using templates as a code generator. And that's the downside of using Fortran: templates don't cross the language barrier. Without templates within the kernel, LibGeoDecomp can't apply all of its optimizations.

4. Performance Evaluation

Game of Life is probably not the kernel that comes first to mind when thinking about HPC benchmarks. But LibGeoDecomp is an HPC library after all, so let's take a look. I ran both programs on my notebook, which has an Intel Core i7-3610QM (Ivy Bridge quad-core at 2.3 GHz) at its heart. The table below gives the measured run times for a grid of 2000x2000 cells and 500 time steps:

Cores Program Time Speedup 1 game_of_life-serial 8.0s 1.0 1 game_of_life-parallel 6.9s 1.2 2 game_of_life-parallel 4.9s 1.6 3 game_of_life-parallel 3.8s 2.1 4 game_of_life-parallel 2.9s 2.8

What can we observe? First of all: yay, we get a speedup -- even with just one core. Why is that? The original code did copy over the new state to the old grid (line 73). That may be convenient in Fortran, but it's also needlessly slow. LibGeoDecomp simply swaps pointers internally. But what's more interesting is that adding more cores (e.g. by running mpirun -np 4 ./game_of_life-parallel ) further reduces the run time. The code does by no means scale perfectly, 4 cores don't give us a speedup of 4. This is because the matrix size is really really small and we would probably have gotten better results when using OpenMP and not MPI. So, why did I choose MPI in the first place? Because it's usually much harder to embrace and requires more modifications to the code. With LibGeoDecomp it doesn't matter which types of parallelism the Simulator exploits. Its (almost) all hidden in the library. If we wanted to run this example with really huge matrices (e.g. 1M x 1M cells) on a cluster we could directly go ahead, no further code modifications required.

5. Summary

In this longer-than-expected post I've shown how you can use LibGeoDecomp to parallelize legacy Fortran codes. What's the gist of it?

shorter, faster code

scales on clusters (supercomputers even)

convenient to add IO

right now: no CUDA, sadly

Chances are that the port will benefit performance even on notebooks and workstation, but what's better is that it will also run on much larger systems. Apart from that, it's interesting to note that the resulting Fortran program is actually shorter than the original code (I'm discounting the wrapper code as this should be moved into the library). You also get access to LibGeoDecomp's IO infrastructure: With just a couple lines of code I've added the PPMWriter to the simulation (in wrapper.cpp ), which can render the matrix to PPM image files.

P.S.: Why no CUDA?

What's a bit sad though is that we can't yet interface with CUDA. In C++ a user would simply chose a CUDA-capable Simulator and flag his update function with __host__ __device__ to make nvcc compile the code for both, the CPU and the GPU. For Fortran there is no counterpart to nvcc . PGI has CUDA Fortran. We could probably build something on top of that to bring Fortran kernels together with LibGeoDecomp on NVIDIA GPUs. But right now only few people have access to the PGI compiler suite. Perhaps this will change in the aftermath of the acquisition of PGI by NVIDIA. Another avenue of approach would be to convert Fortran to C, either via f2c or with LLVM's C backend, and to paste the result into a C++ class.

I've also toyed with this idea. In my trials f2c could not parse my Fortran code and LLVM lacks a Fortran frontend. Any ideas welcome.