I recently came across an interesting blog post in which Java performed slower than I’d have thought reasonable, around 4 times slower than C++, so I decided to try my luck at making it faster. The problem? Imagine a forest of goats, wolves and lions. “Wolves can devour goats, and lions can devour wolves and goats. (‘The stronger animal eats the weaker one’.) As this is a magic forest, a wolve, after having devoured a goat, is transmuted into a lion; a lion, after having devoured a goat, is transmuted into a wolve; and a lion having devoured a wolve becomes a goat.” A new world is generated for each possible outcome (a forest (2,2,2) would generate (3,1,1), (1,3,1) and (1,1,3)), with the same then happening in each of these new forests recursively, and so on until a ‘stable’ forest is found: one with only one kind of animal remaining.

Note: some new results have been found since this post was originally made, and are listed at the bottom.

My first impulse was to rewrite the functional Java implementation in imperative code, assuming it would be faster. Big mistake! My implementation ran in around 1.5 seconds for an input of (100,100,100), compared to 65ms for C++; an order of magnitude slower, and much slower than the functional version. To try to understand what went wrong, I translated the Java implementation into C, and it was equally slow! C is generally not (barring a really insidious compiler) an order of magnitude slower than C++, so I realised I must have made a mistake in the translation from functional to imperative. I hence decided to translate the functional code into a less-verbose language in order to understand it better, so I grabbed the D language and got busy:

forest_t[] meal(forest_t[] forests) { return map!(a => [forest_t(-1, -1, +1)+a, forest_t(-1, +1, -1)+a, forest_t(+1, -1, -1)+a])(forests) .join .partition!(forest_invalid) .sort .uniq.array; }

The D turned out to be rather elegant compared to the original C++, which is a little harder on the eyes:

std::vector<forest_t> meal(const std::vector<forest_t>& forests) { static const std::array<forest_t,3> possible_meals = {{ forest_t{{-1,-1,+1}}, forest_t{{-1,+1,-1}}, forest_t{{+1,-1,-1}} }}; std::vector<forest_t> next_forests; next_forests.reserve(forests.size() * possible_meals.size()); for (auto meal: possible_meals) { forest_t next_forest; std::transform(std::begin(forests), std::end(forests), std::back_inserter(next_forests), [&](const forest_t& forest) { std::transform(std::begin(forest), std::end(forest), std::begin(meal), std::begin(next_forest), std::plus<int>()); return next_forest; }); } auto valid_end = std::remove_if(std::begin(next_forests), std::end(next_forests), forest_invalid); std::stable_sort(std::begin(next_forests), valid_end); next_forests.erase(std::unique(std::begin(next_forests), valid_end), std::end(next_forests)); return next_forests; }

Unfortunately, the D was also noticeably slower; around 200ms for input (100,100,100), compared to 65s for C++. Assuming for no good reason that creating and populating the array of new forests was what was consuming the most time, I next decided to vectorise it using D’s delightful inline assembly support:

forest_t[] meal(forest_t[] forests) { auto forestsAddr = forests.ptr; size_t forLen = forests.length; forest_t[] newFors = uninitializedArray!(forest_t[])(forLen*3); auto newForsAddr = newFors.ptr; size_t bytesToSimd = (forLen)*3*4; int[12] potentialMeals = [-1, -1, 1, 0, -1, 1, -1, 0, 1, -1, -1, 0]; asm{ movupd XMM0, [potentialMeals]; movupd XMM1, [potentialMeals+16]; movupd XMM2, [potentialMeals+32]; mov R8, forestsAddr; mov R9, forestsAddr; add R9, bytesToSimd; mov RCX, newForsAddr; loop:; movupd XMM3, [R8]; movupd XMM4, [R8]; movupd XMM5, [R8]; paddd XMM3, XMM0; paddd XMM4, XMM1; paddd XMM5, XMM2; movupd [RCX], XMM3; movupd [RCX+12], XMM4; movupd [RCX+24], XMM5; add RCX, 36; add R8, 12; cmp R8, R9; jne loop; }

Which, unfortunately, had absolutely no effect on performance whatsoever. I was considering rewriting it to use only aligned access, but it finally occurred me that I should probably check that this allocation actually was a bottleneck before proceeding. I ran the program through Valgrind’s callgrind tool, and lo and behold, turns out I’d been chasing waterfalls; it actually represented an extremely insignificant proportion of the runtime. The sort was taking the most time, and upon a little much-needed reflection, I realised that made perfect sense, as sorting is 0(nlogn), while allocating a large array is just O(n). Thanks to a clever suggestion from someone on the D forums, I then replaced the .sort.uniq with sticking everything into a hashtable and taking it out again (which should be only O(n)), and got the runtime down to only around 100ms, quite competitive with C++’s 65ms.

I figured that, as O(n) beats the O(nlogn) sort+std::unique algorithm that C++ was using for removing duplicates, the D version might actually be faster for higher inputs. Unfortunately, this was not the case; at (500,500,500), while the C++ implementation ran in around 7 seconds, the D version took over 25. I assumed this was due to D’s hashmap implementation generating garbage, which gave me an oh-so-brilliant idea. Java’s got the best garbage collector in the world, right? So if I translate the D implementation into Java, the hashmap garbage shouldn’t cause trouble, and the O(n) uniqifying would allow it to beat C++ (oh, and around this time I also realised why my original C and Java implementations were so slow: they were using a naive 0(n*n) method of removing duplicates; yes, I’m an idiot). Anyway, I fired up Java, and got an implementation going using Java.util.HashSet.

static Forest[] meal(Forest[] forests) { HashSet<Forest> uniqFors = new HashSet<Forest>(forests.length*3); for(int i = 0, j = 0; i < forests.length; i++, j+=3){ uniqFors.add(new Forest(forests[i].goats + 1, forests[i].wolves - 1, forests[i].lions - 1)); uniqFors.add(new Forest(forests[i].goats - 1, forests[i].wolves + 1, forests[i].lions - 1)); uniqFors.add(new Forest(forests[i].goats - 1, forests[i].wolves - 1, forests[i].lions + 1)); } return uniqFors.toArray(forests); }

How’d it go? Put it this way. Imagine the pain of giving birth while being repeatedly punched in the testicles, stung by extremely venomous spiders and debugging a Boost library, convert that pain into ineffectiveness, multiply it by a thousand, and it’d still be a few orders of magnitude more effective than using HashSet to remove duplicates in Java. It literally failed with a java.lang.OutOfMemoryError on an input of around (30,30,30), while the D implementation could handle (300, 300, 300). Needless to say, it was time to go back to the proverbial drawing board.

Okay, I’m turns out I’m even more of a shenjingbing than first imagined; I didn’t realise, but apparently java defaults to reference-based equality checks when no hashing function is defined for a class. This means the HashSet wasn’t working as intended; it wasn’t removing duplicates, as it didn’t recognise duplicates as duplicates since they were separate objects even if their goats, wolves and lions values were the same. Properly implementing equals() and hashCode() makes the Java HashSet implementation slightly faster than the C++ for high inputs and much faster than the original functional implementation, so in that sense yaay, I did it. Thanks to JingJing23 on reddit for pointing out my mistake.

They say when you don’t know what to do, do what you know. So I translated the original D implementation (the 200ms one using .sort.uniq) into C, albeit using a naive quicksort instead of whatever D uses, and it ran in around 90ms, quite competitive with the C++. What’s more, its performance didn’t degrade with higher input like the D hashmap implementation; (500,500,500) took 7s in C++, and only 9s in C. On a hunch, I decided to try change D’s to use quicksort too, via the following elegant implementation (don’t worry, I promise this is the last D I’ll show! It’s actually not even mine; I found it on the internet).

forest_t[] quickSort(forest_t[] fors) pure nothrow { if (fors.length >= 2) { auto parts = partition3!(forLessThan)(fors, fors[$ / 2]); parts[0].quickSort; parts[2].quickSort; } return fors; }

And voila, it was as fast as the C, and around 80% of the C++. Presumably the sorting algorithm in D’s std.algorithm is just particularly unsuited to this case; sometimes sorting algorithms are made to ensure good performance on worst case input (unlike quicksort, which has O(n*n) worst case), but this introduces a constant factor that can make them slower on best-case input than the appropriately named quicksort.

Armed with this knowledge, I was certain I’d finally be able to give the Java implementation the speedup I’d been searching for. I thus set about translating the C implementation into Java, and by translating I mean transliterating. I capitalised on the fact that an array of struct{int, int, int} is equivalent to an array of ints, in order to get past Java’s lack of value types, and changed various functions to take and reuse buffers, in the end managing to remove ‘new’ from the main loop completely. My miraculously ugly Java quicksort:

public void qsFors(int start, int len){ if (len < 2){ return; } int p = len/2+start; forests[(this.cap-1)*3] = this.gAt(p); forests[(this.cap-1)*3+1] = this.wAt(p); forests[(this.cap-1)*3+2]= this.lAt(p); int left = start; int right = start + len - 1; while (left <= right) { if (forLessThan(left, this.cap-1)) { left++; continue; } if (forLessThan(this.cap-1, right)) { right--; continue; } swapFors(left, right); left++; right--; } qsFors(start, right - start + 1); qsFors(left, start + len - left); }

Note how it uses integers like pointers, as the array of forests is actually just an array of ints. Also note how it stores the pivot goat, wolf and lion values at the final 3 spots in the array, as the ‘forLessThan’ function can only compare two parts of the same array. If you’re thinking this could lead to problems if the final 3 spots in the array contained a valid forest, you’re right, but at that point there’d be an extremely high probability of overflowing the array and getting an outOfBoundsException anyway.

How’d it go? In spite of having essentially written C in Java, it was actually slightly slower than the original functional implementation from the original blog! So a big congratulations to the writers java.util.stream. While I’ve no idea how distinct works, but it certainly kicked my ass.

For fun, I also wrote a few other implementations. One in Haskell, which was surprisingly fast considering I just used sticking everything into Data.Set and taking it out again for removing duplicates. It ran in around 350ms for input (100,100,100), however slowed down at higher inputs presumably due to garbage or hash collisions.

meals :: V.Vector Forest -> V.Vector Forest meals fors = V.fromList $ S.toList $ S.fromList $ V.toList $ snd $ V.unstablePartition forInvalid $ V.concatMap meal fors

I’d be interested to see how fast one using sort and removing consecutive duplicates would be; I attempted one myself, however it was rather slow as I lacked the monad-foo to make an in-place partitioning function for removing duplicates.

Minor gripe: why doesn’t Haskell support generic functions on collections; why do I have to use Set.fromList for a set but Vector.fromList for a vector? Why isn’t there a generic map function? D seems to handle this fine, via ranges, and even C++ manages it via std::algorithm.

Edit: Turns out it can, via type classes, but Haskell programmers don’t always write it this way.

I also wrote a Go version, but it’s slower than I’d have expected, presumably because the Go standard library sorting algorithm uses heapsort “if depth of 2*ceil(lg(n+1)) is reached”. It also generates more garbage than the D version, as D’s creator Walter Bright recently did some much-needed work on removing allocations from D’s std.algorithm. So a big thank you to Walter for ironing out that wart on the language!

I wrote a Common Lisp version too, but it’s rather slow as I was too lazy to write my own unique function, and the one in the standard library (‘remove-duplicates’) uses the naive O(n*n) algorithm for some reason. I’d be quite excited to see how fast it is if someone wants to write up an improved version!

Finally, I attempted a Rust version, and although I wasn’t able to make it functional as the iterator syntax got the better of me, I did get an imperative version working, and it ran at around the same speed as the C and D versions.

I had one more idea for optimising the Java version. If I could create an ideal hashset, with O(n) time for uniquifying and minimal garbage generation, I should be able to beat the C++ implementation’s O(nlogn). A simple way to create a perfect hash (unique for any forest) is by using the first n bits of the goats, wolves and lions values of a forest, as follows:

ulong hash = 0; hash |= this.goats; hash <<= BITS_FOR_HASH; hash |= this.wolves; hash <<= BITS_FOR_HASH; hash |= this.lions;

This produces a maximum value MAX of 1<<(BITS_FOR_HASH)*3, meaning if BITS_FOR_HASH is less than eleven it will fit in a uint (just to clarify for any Haskell programmers reading this, that ‘<<=’ is to bitshift left as ‘+=’ is to ‘+’; it’s not a bind). To guarantee O(1) insertion and removal (no collisions), a simple if super space-inefficient solution is to use a bitset of size MAX, as follows:

forest_t[] uniquify(forest_t[] fors){ auto newFors = uninitializedArray!(forest_t[])(fors.length); size_t i = 0; foreach(f; fors){ ulong hash = f.Hash(); if(this.bits[hash] == false){ this.bits[hash] = true; newFors[i]=f; i++; } } newFors.length=i; return newFors; }

While incredibly slow for small inputs (as it’s allocating and deallocating hundreds of megabytes every call), for an input of (500,500,500) it only took around 20 seconds, just slightly less than half the speed of the quickSort+uniq implementation. I suspect that with even higher input it could well prove to be faster, however input greater than (512,512,512) is not possible, as the maximum goat/wolf/lion value that can be hashed with BITS_FOR_HASH < 11 is 1024. Using BITS_FOR_HASH of 11 or larger is not possible, as it wouldn’t translate into the Java version, as the maximum value would be greater than 2^32 and there’s no way to index into a Java BitArray with a long. The only option would be to write my own Java implementation of a long-indexable bitarray, and ain’t nobody got time for that, so I admit defeat. java.util.stream: 1. Logicchains: 0.

Moral of the story: profile before attempting to optimise, don’t assuming functional programming necessarily performs poorly, and give the D language a try sometime; while it may have some wrinkles, it can nevertheless be quite powerful and elegant.

Also, sometimes if pays to implement your own sorting function if you know the input won’t be worse case, as demonstrated here by quicksort being twice as fast as D’s standard sort. (Or, conversely, maybe D needs to improve its default sorting algorithm for this kind of input, as C++’s and Rust’s weren’t so slow.)

Edit: The Java HashSet implementation of the problem I was using was buggy in that I didn’t realise I had to manually implement toHash and equals on the forest class. Fixing that made it actually faster than the C++ at higher inputs, so yay, optimisation complete!

Also, changing the D to use the default stable sort (like the C++) actually made it as fast as the C++ for higher inputs (such as 500,500,500). Stable sorting is faster in this case than unstable sorting, possibly because the input is already partially sorted (forests generate similar forests nearby to themselves in the array).

Generic functions on collections are possible in Haskell, via type classes, they’re just not shown in the tutorials as they can produce errors messages that are less comprehensible to beginners.

Finally, there’s a superior algorithmic implementation that runs in linear time, described here.