I’m planning a post on how to maximize build parallelism in VC++, but first I needed to create a simple project that was slow to compile. Creating such a program was an interesting exercise in its own right and I decided that it deserved a separate blog post.

I started with a recursive algorithm that ran in exponential time. When I made it a compile-time algorithm I found that it compiled in linear time – too fast to be useful for my peculiar purposes. In order to get a slow-to-compile file I had to understand and then prevent the optimization that was allowing my result to be calculated so efficiently at compile time.

Is it wrong that I get so much satisfaction from defeating compilers’ optimizations?

Fibonacci, calculated slowly

Calculating a Fibonacci number using recursion is a great way to waste some CPU time. Here’s what a simple recursive Fibonacci calculation of F(n) looks like:

int fib(int n) { if (n <= 2) return 1; return fib(n - 1) + fib(n - 2); }

Most calls to fib(n) result in two calls to fib(n) so one might guess that the expected run-time cost is O(2^n). In fact fib(n) is O(fib(n)) – the runtime cost is proportional to the result. More usefully, this works out to about O(1.618^n). This is exponential which means that relatively small values of n can cause fib(n) to take a long time to run.

Yes, I know that this is a foolish way to implement fib(n). It is trivial to structure the code so that it is O(n). I’m trying to make something slow.

Executing fib(45) with all optimizations on takes about 4 – 5 s on my laptop for the 2.3 billion function calls.

Pop quiz: if we assume 32-bit ints then what, according to the C++ standard, happens if we calculate fib(46)?

Better living through templates

My goal, however, was to calculate F(n) at compile time, in order to get exponential compile times for my parallel compilation tests. So I translated the recursive function calls into recursive template metaprogramming:

template <int N> struct Fib_t { enum { value = Fib_t<N-1>::value + Fib_t<N-2>::value }; }; // Explicitly specialized for N==2 template <> struct Fib_t<2> { enum { value = 1 }; }; // Explicitly specialized for N==1 template <> struct Fib_t<1> { enum { value = 1 }; }; ... printf("Fib_t<45> is %d.

", Fib_t<45>::value);

This is a simple, albeit not particularly useful, example of template metaprogramming. Types are used for recursion, and explicit template instantiations halt the recursion. On the face of it this should give the same exponential performance as fib(int n), only at compile time instead of at run-time. I expected it to be slower than the run-time solution because creating a new type is more expensive than a simple function call.

But it wasn’t.

Fibonacci<45>::Val compiled with no perceptible delay. After turning off warnings about overflows in constant integer arithmetic I compiled Fibonacci<500>::Val in a fraction of a second. I could not measure any cost for this template madness. Clearly the exponential cost was being completely avoided.

Recursion in linear time

Before analyzing how the compiler is able to handle this case so efficiently let’s try optimizing our recursive fib(n) function, while still keeping it recursive.

The key observation is that fib(n) gets called exponentially many times, but only actually calculates n different values. If we cache the results of fib(n) for each value of n then we can avoid the replicated calculations and the exponential explosion. Here’s some simple code that caches the results:

int fibfast(int n) { // Non-zero entries in this vector contain the // value of fib(n). Not thread safe! static std::map<int, int> s_fibValues; // Check the cache. int& result = s_fibValues[n]; if (result == 0) { // Calculate F(n) and store it in the cache. result = 1; if (n > 2) result = fibfast(n - 1) + fibfast(n - 2); } return result; }

The code simply maintains a map of results and it consults this map before doing any calculations. This technique is known as memoization.

Below I have a diagram showing the process of calculating fib(7) using the original fib(n) function. Each arrow represents a function call, radiating out from the bottom left corner. We can see that fib(2) is called eight times to calculate fib(7), and this gets exponentially worse as n increases.

Without caching we visit each diamond. If we cache values, as shown in fibfast(n), then we only need to visit each column once. The cells with the blue backgrounds are the only ones we have to visit, and the cost drops from exponential to linear. It almost feels magical.

The exact set of cells visited actually depends on expression evaluation order and is likely to be more than just the blue cells but we still only do O(n) function calls and n calculations.

Again, I’m not recommending this way of calculating Fibonacci numbers. Starting from F(1) and building up is much simpler. Recursion simply doesn’t make sense unless you are trying to be slow.

Compile-time caching

When you use a class template the first time then the compiler instantiates it – it stamps out a class of the appropriate type. However if you use that template a second time with the same template parameters then the compiler typically doesn’t instantiate it. That would be wasteful. Instead the compiler typically just reuses the previous instantiation.

This means that the compiler must be keeping a record of what types it has instantiated – a cache if you will. This cache behaves exactly like the cache in fibfast(n) and gives exactly the same performance benefit – it makes our exponential algorithm run in linear time.

Cool!

Defeating optimizations for fun and profit

Now that we know why the compiler is working so quickly we can figure out a way to defeat its optimizations. The basic idea is to avoid reuse of types. We can do this by adding another template parameter whose only purpose is to make the types unique. At each level of recursion we can simply pass this value along when going on the top branch, and set a bit (corresponding to our distance from the leaf nodes) when going down the bottom branch. Simple!

Here’s the code, with the explicit instantiations omitted for simplicity:

template <int TreePos, int N> struct FibSlow_t { enum { value = FibSlow_t<TreePos, N - 1>::value + FibSlow_t<TreePos + (1 << N), N - 2>::value, }; };

TreePos is the unique number, and when going down the top branch we just pass it along – it will be zero all along the top of our diagram. When going down the bottom branch (the N – 2 branch) we set a bit. The combination of TreePos and N ensures that we never instantiate the same type twice. Optimization defeated!

How slow can we go?

With Fib_t I could measure no slowdown, but with FibSlow_t I finally got the exponential compile times that I so desperately wanted. Compiling FibSlow_t<0,18> takes about 1.7 seconds and every increase in n increases this by about 2.8 times. I’m not sure why the slowdown is greater than expected, but I’m just happy to have slain the efficient compilation dragon. It takes FibSlow_t<0,21> about 40 seconds to compile which is more than slow enough for my purposes, and the memory usage (about 356 MB for FibSlow_t<0,21>) is modest enough to allow lots of compiler parallelism.

Look for this code being put to ‘practical’ use in a future post that will cover how to get maximum parallelism from the VC++ compiler.

Thanks to STL expert Stephan T. Lavavej for reviewing this and suggesting many improvements.

Reddit discussion is here and here.