I’m working on a modular random level generator, the essence of which uses similar logic to that of a roguelike. Last month I benchmarked the effectiveness of various programming languages at a simplified level generation algorithm that roughly mimics the demands of my actual algorithm, using single-threaded code. I’m now running the same benchmark with concurrent code, to examine how easy the languages tested are to parallelise for this task.

The code is available here. Any improvements to the code are most welcome. Most of the running time is spent on (writing in Haskell* as it’s easy to read):

roomHitRoom Room {rPos=(x,y), rw=w, rh=h} Room {rPos=(x2, y2), rw=w2, rh=h2}

| (x2 + w2 +1 ) < x || x2 > (x+w+1 ) = False

| (y2 + h2 +1 ) < y || y2 > (y+h+1 ) = False

| otherwise = True

Checking a newly generated room against a list of previous rooms to see if they collide or not, discarding it if it does (it’s a brute-force level generation technique; our actual engine is a bit more sophisticated, but still relies on the same principle). Much of the rest of the time is spent on random number generation, with all languages using a similar prng algorithm to ensure fair comparison. Note that the program is embarrassingly parallel, a trait I aim for in algorithm selection, and so how easy the implementations are to parallelise for this simple task is not necessarily representative of how effective the languages are at more complex parallel tasks. I.e, these benchmarks are relevant to my goal, not necessarily yours. I enclose the generalisations made later in this post within the context of that statement.

Edit: A couple of months after the benchmark was originally run, somebody submitted a D implementation that takes advantage of LLVM optimisation particularly well and is significantly faster than the other entries (also quite concise, using the D parallel foreach). It’s included at the top of the table. An impressive achievement, although it’s certainly possible that someone quite familiar with optimising C or C++ could produce an equally fast result.

The results are as follows:

Language Compiler Speed (s) % Fastest Resident Mem Use (KiB) D ldc2 0.812 116.38% 26,536 C++ clang++ 0.945 100.00% 25,552 D****** ldc2 0.955 98.95% 26,536 Neat***** fcc 0.958 98.64% 26,762 Nimrod clang 0.980 96.43% 25,932 C++ *** g++ 1.025 92.20% 25,532 Rust rustc 1.109 85.21% 47,708 Go 6g 1.184 79.81% 30,768 C clang 1.199 78.82% 25,796 Scala scala 1.228 76.95% 72,960 Nimrod gcc 1.376 68.68% 26,120 C**** gcc 1.467 64.42% 25,800 D dmd 2.103 44.94% 26,508 Go gccgo 2.710 34.87% 69,120

Language SLOC SLOC to Parallelise Nimrod 109 -* Neat 117 6 Rust 123 20** Scala 99 20 Go 131 0 C++ 142 15 D 83 -24** C 172 32

*Haskell was excluded from this version of the benchmark as there seems to be a space leak of some sort in the algorithm that neither I nor anyone who’s examined it so far has been able to overcome. Nimrod was added to the benchmark instead, and so since it has no single-threaded version to compare to it thus has no ‘SLOC to Parallelise’ measure. Nimrod is a language with a whitespace-based syntax, like Python, but which compiles to C for optimal speed.



**I parallelised the D and Rust programs myself, hence the code is probably not idiomatic, and still has plenty of room for improvement. D for instance has a parallel for loop; I couldn’t get it working, but if someone got it working then it would significantly reduce the size of the code. Edit: the D version has now been made more idiomatic, and uses the parallel foreach.

***Somebody submitted a C++ version that runs twice as fast (in around 0.550 ms on the GCC), using an occlusion buffer for collision testing between rooms. I’m not including it in the benchmark numbers as the algorithm is different, but anyone who’s interested can view it here.

****It turns out the reason the C version runs slower than the C++ one is because the PRNG seeds for each thread are all stored in an array together, forcing the hardware threads to compete for access to them and slowing the program down. Having each thread use a copy of the original seed from the array brings the speed up to that of the C++ implementation.

*****A Redditor submitted a version in a language they’re working on called Neat, currently built on top of the LLVM and inspired by D; the compiler is here. I was impressed by how a new language can take advantage of the LLVM like that to achieve the same level of performance as much maturer languages.

******This is the D version from the time of the benchmarks, not the faster more recent submission.

Speed:

Generally, the relative speeds for the concurrent benchmark were the same as those for the single-threaded one, with llvm D, C++ and C++ running fastest, along with the new entry, Nimrod. I was surprised however by how C was slower than C++ and D; I imagine this may have been due to my naive C implementation not giving the compiler sufficient hints, or something related to aliasing. (Edit: The reason C is slower than C++ is described at **** above.) C is definitely capable of reaching those speeds: the Nimrod compiler compiles to C, and impressively fast C at that. The gap between gccgo’s speed for this problem and 6g’s is surprising; it just goes to show that gccgo isn’t always the best choice for speed. Note that the gcc C implementation was slower than the LLVM one because the gcc missed an optimisation involving turning a jump in the GenRands function into a conditional move, resulting in the gcc C one encountering more branch misses.

Memory Use:

Memory use was mostly as expected. I was impressed by how Go and particularly D didn’t use much more memory than the C and C++ versions in spite of being garbage collected, but Rust’s memory use was somewhat surprisingly high. This is likely just due to the immaturity of the language, as there’s not reason it should need so much memory for such a task. Scala used an expectedly large amount for a JVM language.

Concision:

I was quite impressed by Nimrod’s concision. Presumably it has such a low SLOC count because it uses whitespace rather than {}, like python, or because it uses a very concise parallel for loop courtesy of OpenMP:

for i in 0 || <NumThreads:

makeLevs(i.int32)

Note however that Nimrod was written completely idiomatically. Go, Scala and C++ are similar, but for the D and Rust implementations only the single-threaded portion was written idiomatically (correction: the Rust one was, but the D one was written hackishly for speed) ; the parallelisation of that code was done naively by myself. Note also that the Scala version’s ‘SLOC to parallelise’ measure also includes optimisations made for speed; it was possible to parallelise the algorithm much more simply, but this had inferior speed characteristics.

Subjective experience:

Working on the C implementation I had one of those rare moments where it feels like the language is laughing at me. I was encountering weird bugs in my implementation of the for loop:

for (i=0;i<NumThreads;i++){

pthread_create(&threads[i], NULL, MakeLevs, (void *)&i)

…

}

Can you spot the problem in that code? I was passing ‘i’ into the newly created thread by reference, to represent the thread number (pthread_create only takes a void* pointer as an argument, normally to a struct, but in this case a single integer was all that was needed). What was happening was that when the thread tried to access ‘i’, even if was the first thing that thread did, often ‘i’ had already been incremented by the for loop in the original thread, so the thread would have the wrong number; there might be two threads with thread number 2, and they’d both do the exact same calculations, writing to the same part of the global array, corrupting the output. I fixed this using the rather cumbersome method of filling an array with the numbers 1 to NumThreads and passing a pointer to a value from there; it could have been done more concisely by just casting ‘i’ to void* and then back to an integer, but I feel it’s bad practice to cast values to pointers. It’s potentially unportable: if ‘i’ was a large 64bit integer, converting it to void* and back would work on a 64 bit system but not on a 32bit one, as pointers on the latter are only 32bit, and it’s impossible to store a large 64bit integer in a 32bit pointer (although this problem would be unlikely to surface in this case unless one somehow had a machine with over 2^32 cores..).

The D implementation also surprised me, although I think this was largely because I was unfamiliar with D’s memory model. I originally tried having each thread write to part of a global array, like in the C implementation, but, after they had completed their task and the time came to read the data, the array was empty, possibly having been garbage collected. The solution was to mark the array as shared, which required propagating the chain of ‘shared’ typing to the objects being written to the array in order to ensure safety. This was an interesting difference from Rust’s memory safety model, where the type of pointers rather than objects determines how they can be shared, but I’m not familiar enough with D’s memory model to comment on their relative effectiveness. I liked the use of messages in D, which allow the sending of data between threads using Thread IDs for addressing, rather than channels, and I imagine this would be particularly useful for applications running on multiple processes across a network once messaging between processes is supported (currently it’s only supported between in-process threads). Note that D offers support for multiple methods of parallelising code in its standard library, so the method I used may not necessarily be the neatest or most idiomatic.

(Edit: It turns out that all memory in D is thread local unless specified otherwise, which seems like an effective way of ensuring memory safety.)

The Rust implementation proved to be relatively straightforward, apart from some slight difficulties I had with the syntax. Declaring a shared channel (one that can have multiple senders) required:

let (port, chan) = stream();

let chan = comm::SharedChan::new(chan);

I imagine this process will be more concise by the time version 1.0 is reached. Using ‘.clone()’ for explicit capture of a variable into a closure also took a bit of getting used to, but it makes sense in light of Rust’s memory model (no direct sharing of memory between tasks). I think there may be a more concise way to parallelise parts of the problem in Rust, using something like (from the Rust docs):

let result = ports.iter().fold(0, |accum, port| accum + port.recv() );

I wasn’t however familiar enough with the language and current iterator syntax to implement it myself. Using a future might also be more concise. Go felt the easiest to parallelise, although to a degree this was largely because it didn’t enforce the same kind of memory safety as Rust or D. This made it more enjoyable for a project like this, but I imagine on a much larger project the stricter nature of Rust and D’s memory models might come in useful, especially if one was working with programmers of dubious competence who couldn’t be trusted to write memory-safe code.I didn’t write the Scala, Nimrod or C++ version, so I have no comments on the experience of doing so.

Notes for optimising:

To anyone wanting to optimise the code, note that the code must produce consistent results for any given seed, but needn’t produce the same results as the other implementations. This is because different seeding behaviours will produce different results, as the seed for each thread/task is calculated by the original random thread multiplied by the square of that thread/task’s number. This means a different number of threads may produce a different result. Currently D, C, C++ and Rust will all produce the same result for any given seed, but will produce different results for a given number of cores. Go on the other hand will produce the same result for any particular seed no matter how many cores are used. Also note that optimisations that change the fundamental logic, the number of calls to GenRands used, or the number of checks done is not allowed. Finally, remember to change the NumCores variable or its equivalent to however many cores you have in your machine.Timing is done using bash’s ‘time’ function, as I’ve found it to be the simplest accurate method for timing threaded applications. The fastest result of 20 trials is taken, with all trials using the same seed. Resident memory use is obtained from running “command time -f ‘max resident:\t%M KiB’ filename seed” in Bash.

The moral of the story:

For a task like this, compiler choice has as significant an effect on speed as language choice.