A while back I posted about solving Project Euler’s problem 14 (aka Collatz conjecture) in JavaScript, for those just joining us in lieu of a recap I’ll let Randall Munroe explain things:







The object of the exercise is to find the number with the longest Collatz orbit under a certain limit, in other words the number which takes the most iterations to get to 1.

I started using this question when interviewing and it got me thinking how fast I could get it to run. This post is the result, basically it’s a study of how much time I can waste optimizing a very small toy problem as well as a blatant violation of the two rules of optimization:

Don’t optimize (Experts only) Don’t optimize yet

Along the way I’ll try to find some general lessons we can learn.

Finding the asymptotic complexity of the maximal Collatz orbit is beyond my meagre mathematical/CS abilities, but I take comfort in knowing that (for the moment at least) I’m in good company. The best case for finding the Collatz orbit length for a number N is when N is a power of two, in this case we halve the number until we get to one, exactly log 2 (N) steps. However I seem to remember from uni that not all numbers are powers of two. As for the worst case, this is unknown, if Collatz conjecture is wrong then there exists a number for which it takes ∞ steps, this takes us out of the field of complexity and into the halting problem land. Since the problem at hand requires finding the Collatz orbits of all numbers up to a given N we must multiply by N to get a minimum of N*log(N).

Failing to mathematically deduce the complexity leaves us with the empiric way, if we count how many times we call the collatz function we’ll know its complexity this is charted as operations in the following graph and we can see that it falls between N*log(N) and N2 (which means that the orbit for each given number is ≤ Θ (N) ).

Other series in the chart are how many operations take place when memoizing takes place (as described in my previous post) and the space complexity (cache size) when using said optimization.

Setting the baseline

The problem with C++ is that it’s too fast, the original problem required us to find the number with the longest Collatz orbit under one million and do this in less than a minute. The most basic solution takes 0.734 seconds, so I did all my measurements on 10,000,000 which took 8.263 seconds for the iterative solution; this number will serve as our baseline[1].

int iterative(int limit) { std::pair<int, int> max = std::make_pair(0, 0); for (int i = 2; i <= limit; ++i) { unsigned long long curr = i; int len = 1; for (; curr != 1; ++len) curr = collatz(curr); if (len > max.first) max = std::make_pair(len, i); } return max.second; }

Iterative to recursive

The problem with memoizing the values of the numbers we pass through is that we don’t know what the value will be before we run into it, this makes a recursive solution much more natural than the iterative solution. I made the function receive the cache as a template argument, this way the compiler can inline all the function calls for different types of caches.

template <class T> int length(unsigned long long n, T& cache) { if (n == 1) return 1; int ret = cache.get(n); if (ret) return ret; ret = 1 + length(collatz(n), cache); cache.set(n, ret); return ret; }

Now all we have to do is plug in the caching methods, first we’ll use no caching to see the algorithm is OK.

struct no_cache { void set(unsigned long long, int) { } int get(unsigned long long) const { return 0; } };

Unfortunately the recursive method is much slower (all those function calls aren’t free) and we’re now clocking in at 20.83 seconds (252% of our baseline). Let’s see if we can shave this time off.

Simple cache

The simplest cache to use is std::map the bread and butter of any C++ programmer, here’s an adapter class so we can plug in different types of maps.

template <class T> class map_cache { T map; public: int get(unsigned long long n) { typename T::iterator i = map.find(n); if (i == map.end()) return 0; return i->second; } void set(unsigned long long n, int i) { map.insert(std::make_pair(n, i)); } };

This clocks in at 20 seconds, a small improvement, but wait, why does it feel so slow? Putting in another timer I see that it’s actually taking 25.3 seconds (306%), the extra time (after the solution was found) goes into destroying the map.

Hash map

std::map has logarithmic complexity, a no-brainer would be to use a constant time cache, luckily Visual Studio 9 supplies std::tr1::unordered_map , that’s bound to improve things.

Nope, 26.22, things are just going from bad to worse (317%), how can a constant speed algorithm do worse than a logarithmic one? Well it all depends on your constants, it’s easy to forget that as programmers we mostly use bounded numbers; even logarithmic complexity on a 32 bit number is constant if you take the big view. Anyway things aren’t that bad, if we try using boost::unordered_map instead of the Visual Studio’s implementation we “only” take 14.57 seconds (176%).

Lesson 1: I wouldn’t go so far to say that MS’s implementation of unordered_map is inferior but you should remember that you depend on the quality of the implementation you’re using and some implementations will work better with some data sets.

Vectoring in on a solution

One of my interviewees suggested using a vector rather than a map, at least for the first N elements. The reasoning is that we know we’ll be accessing all these numbers so there’s no need to use a data structure that supports sparsity, this way we gain the natural advantages of std::vector namely low overhead and high data locality (changes in bold blue).

template <class T> class vec_map { protected: const unsigned long long limit; std::vector<int> vec; T map; public: explicit vec_map(int limit) : limit(limit) , vec(limit) { } int get(unsigned long long n) const { if (n < limit) return vec[static_cast<int>(n)]; typename T::const_iterator i = map.find(n); if (i != map.end()) return i->second; return 0; } void set(unsigned long long n, int i) { if ( n < limit) vec[static_cast<int>(n)] = i; else map.insert(std::make_pair(n, i)); } };

That made quite a difference, now we’re down to 8.32 seconds (less that 101% of the iterative method but still above 100%) what else can we squeeze out?

Lesson 2: “Invisible” considerations such as data locality and cache friendliness[2] can have a very profound affect. “Invisible” considerations such as data locality and cache friendlinesscan have a very profound affect.

Being Picky

Our cache grows to be rather big, for the 10M range we have 21,730,848 elements stored, perhaps we can reduce that number. Lets think of when we can encounter a number.

When checking this specific number (relevant only for numbers ≤ N ) Reaching the number from above (coming from 2N ) Reaching the number from below (coming from (N-1)/3)

So if we only consider the numbers greater than our initial range any value stored in the cache will be used at most one time! That seems rather wasteful, it means that once we use a number we can remove it from the cache.

It also means that some numbers that are cached are never used since they cannot be not reached from below (if the number -1 is not divisible by 3).

template <class T> class selective_vec_map { protected: const unsigned long long limit; std::vector<int> vec; T map; public: explicit selective_vec_map(int limit) : limit(limit) , vec(limit) { } int get(unsigned long long n) const { if (n < limit) return vec[static_cast<int>(n)]; if (((n-1) % 3) != 0) return 0; typename T::const_iterator i = map.find(n); if (i != map.end()) { int ret = i->second; map.erase(i); return ret; } return 0; } void set(unsigned long long n, int i) { if ( n < limit) vec[static_cast<int>(n)] = i; else if (((n-1) % 3) != 0) map.insert(std::make_pair(n, i)); } };

Great now we’re down to 6.095 seconds 73.76% of the iterative solution, all the hard work was worth something!</pat self on back> but wait why are we paying for destruction of the cache? We don’t need to reclaim the memory (the process is exiting, the OS will take care of that) and we know that int and unsigned long long don’t have side effects during destruction.

This goes against a C++ programmer’s training but leaking memory during exit is not a big deal, all we need to do is add a malicious destructor.

~selective_vec_map() { T* leak = new T(); map.swap(*leak); }

Now we’re down to 4.77 seconds just 57.8 percent of our original time.

Lesson 3: If you know that there are no needed side effects to destroying objects it may improve the user’s experience to forgo on unneeded cleanup (I personally hate it when an application I close takes its time shutting down).

Here’s a graph to recap.

The other fork in the road

Another obvious optimization (at least it’s obvious in retrospect, I only thought of it after writing most of this post) is to look at only the second half of the numbers in the range. The number with the longest orbit can’t be less than limit/2 since for every number N ≤ limit , 2N is also in our range and orbit(N*2) = orbit(N) + 1 . When looking at only the second half of the numbers in the range our iterative solution takes 4.1 seconds (just over 50% of the time as one would expect) and there is no detectable change in the caching methods (we have to cache the same number of elements).

All that hard work which got us down to 58% and changing 2 to limit/2 does better, it’s almost enough to make me cry.

Lesson 4: Don’t get attached to your code, if something better comes along don’t hesitate to throw out the fruit of your toil.

The Elephant in the room

Some of you may have noticed that this is an embarrassingly parallel problem, if you haven’t noticed this fact it’s time for a paradigm shift. My computer today is dual core (pretty low end I know) and I don’t expect to see a single core computer (baring embedded devices) in the future. By using Microsoft’s Parallel Patterns Library (available in VS10) I can write code that will scale to use all available CPUs on the target machine.

#include "ppl.h" int parallel(int limit) { auto max = std::make_pair(0,0); Concurrency::concurrent_vector<decltype(max)> convec; Concurrency::parallel_for(limit/2, limit+1, // +1 to include limit [&convec](int i) { convec.push_back(std::make_pair(i, orbit(i))); }); std::for_each(convec.begin(), convec.end(), [&max](decltype(max) curr) { if (curr.second > max.second) max = curr; }); return max.first; }

This example uses some of the C++0x features available in VS10 as well as the PPL library.

Using PPL takes about 58% of the time on a dual core machine, add that to the 50% improvement supplied by limit/2 and we have a winner even on a measly dual core machine.

Running the same program on a 16 core machine only takes 11.9% with no code changes (or recompilation).

Lesson 5: Technology is changing keep up with it.

Edit: I can’t believe I didn’t know about combinable which makes the parallel code even faster (it keeps one value per thread in TLS).

#include "ppl.h" int parallel(int limit) { typedef std::pair<int, int> pint; Concurrency::combinable<pint> comb; Concurrency::parallel_for(limit/2, limit+1, [&comb](int i) { auto p = std::make_pair(i, orbit(i)); if (p.second > comb.local().second) comb.local() = p; }); auto max = comb.combine([](pint a, pint b) { return a.second < b.second? b : a; }); return max.first; }

I guess I haven’t been keeping up with technology 😦

[1] I’m using unsigned long long since int isn’t big enough.↩

[2] One of the best talks I’ve seen online was Herb Sutter’s Machine Architecture: Things Your Programming Language Never Told You (video, slides) it’s a bit long (almost two hours) but well worth watching! I gained a whole new perspective on how computers work and what to keep in mind while coding.↩