Template Haskell for faster linear interpolation

An application I’m writing requires a lot of linear interpolation of a function. There’s so much of it, in fact, that it was the biggest single contributor to the running time. In this article I show how I used Template Haskell to make it much faster and cut the overall running time of the application by about a third.

To be clear about the problem, we have a list of points (which I’ll call the “table”) like this:

These points are input-output pairs for some function f, and we want to define the points in the middle by joining the dots with straight lines. For example, since f(0.5) = 0.4 and f(1.0) = 0.84, we want to have that f(0.75) = 0.66 because it’s halfway between those two points.

So, for a given input value x, we have to find the two points (a,av) and (b,bv) such that x is between a and b, and the result should be calculated like this:

All we need to do is find those two neighbouring points. For any values that are outside of the range of the table, we just use the nearest value; e.g. for the sample table above, f(100) = -0.54.

Linear lookup

A lot of the time, you might write a simple linear traversal of the table like this and call it a day:

When I was writing the application in question, I was struck by the feeling that this was not “Haskelly” enough – there’s direct recursion, and it doesn’t even have any monoids. So I wrote this instead:

It’s the same linear-traversal strategy, but written so that it’s harder to understand. (It occurred to me in writing this article that it might be clearer to use a fold directly instead of the First monoid, but I didn’t try this.) The only thing you could say in its defense is that the “selection” part (called “f” there) is separated from the rest of the algorithm.

Map lookup

A linear traversal is simple, but has the drawback that to get to values near the end of the table we have to traverse the whole table. Profiling my application suggested that this monoidal lookup was spectacularly slow. Maybe we can use a Map instead?

The first argument to mapLookup is just the table converted to a Map ; e.g. Data.Map.fromList table . This approach reduces the worst case from O(N) to O(log N), but hurts the best case and introduces some overhead in general. It also seems a bit inefficient to do two lookups (the lookupLE and lookupGE ) when they are both really looking for the same spot in the tree.

In my application, I used this mapLookup function for a long time, until improvements in other parts of the program meant that interpolation was again the bottleneck.

My next approach was to encode the lookup as a series of <= / > comparisons, essentially the same as how a Map would work internally. The table is encoded as a tree, where each entry in the table becomes a node with two children, or a leaf.

The makeTree function turns a table into a tree, evenly distributing the table entries to the left and right. Doing the lookup is similar to a standard binary search, with the addition of keeping track of the most recent times we took left and right branches.

Static Compilation

The point of the treeLookup function above is that we can easily “unroll” the loop, turning it into a series of nested comparisons. For example, for the simple table [(1,1), (2,4), (3,9)] , we’d like to get as a result a function like this:

The nice thing is that via Template Haskell we can do this with only some small modification of the treeLookup function above. The structure of the function stays essentially the same; the main work is in annotating which parts should be evaluated statically, and which parts are in the resulting “unrolled” function.

One hiccup is that Template Haskell won’t automatically lift a Double value to an expression — apparently, only integers and rationals are allowed. As a consequence I used the little function ld to turn a Double into a rational literal.

You can also unroll the linear lookup version:

Performance Comparison

We could debate which of these methods is the most elegant, but at least we can easily address the question of which one is the fastest.

(If you want to see the full source code to reproduce these results, or just for fun, see this repository on GitHub.)

I measured each method, looking up a value at the start, middle and end of the table; the table has 21 entries. Here are the results, courtesy of the criterion library:

(Click on the image above to get the full report and all the exciting details.) As we expected, linear lookups are fast when the target is at the front, but get slower as the target gets towards the end. Also, the binary lookups take about the same time no matter where the target is.

The monoidal lookup is impressively inefficient. The statically unrolled lookups are noticeably faster than their regular versions, so it’s a worthwhile transformation if the table is known at compile time.

For a bigger table (101 entries) the results are basically the same, but with the end-of-the-table-linear-lookup effect magnified. Note that the scale on the horizontal axis has changed.

Finally, I measured the effect that actually matters: how much faster my application runs. Here is a short sample run using the mapLookup function with some GHC and timing statistics:

2,743,929,608 bytes allocated in the heap [...] MUT time 2.55s ( 2.64s elapsed) GC time 0.33s ( 0.26s elapsed) [...] Total time 2.88s ( 2.91s elapsed) [...] ./SimpleMain newp02-bin +RTS -s 2.89s user 0.02s system 99% cpu 2.912 total

And using the statically unrolled tree lookup:

974,367,480 bytes allocated in the heap [...] MUT time 1.70s ( 1.72s elapsed) GC time 0.20s ( 0.20s elapsed) [...] Total time 1.90s ( 1.92s elapsed) [...] ./SimpleMain newp02-bin +RTS -s 1.90s user 0.02s system 99% cpu 1.922 total

There’s a reduction in allocation, but more importantly the running time of the program as a whole is reduced by about one third.

Conclusion

In the end this was a worthwhile exercise: I got a faster program and learnt about Template Haskell along the way.

Whether the same trick is useful in other places depends on the circumstances. The static compilation of the lookup was only useful because it is called so very often – maybe the algorithm could be changed at a higher level to avoid making so many lookups.

Christopher Mears, 9 July 2014