Vastly outperforming LAPACK with C++ metaprogramming

This is kind of a sequel to Outperforming LAPACK with C metaprogramming. The tricks described there were effective but rather ugly, and not quite scalable. What I propose here is the same idea (making compiler produce reducible deterministic code against its best judgment) but implemented in C++ template metaprogramming.

I will also show a quick win you can gain with this type of code, and point out a thing that may stand in a way. The solution I propose solves a linear system of 5 equations 10 times faster than LAPACK does.

Making the compiler unroll recursion

C++ supports functions with integer templates which work in a remarkably straightforward way. The compiler creates an instance of a function for every template's value it knows the function should be instantiated for. This means that you can't simply make a recursive template function even with the most explicit exit condition, because the compiler does not resolve conditions on early stages. If you make a usual factorial function into an integer template and, for instance, run it as factorial<3>(), the compiler would have to create factorial<3>(), and factorial<2>(), and factorial<1>(), and factorial<0>(), and factorial<-1>(), and so on. I'm not sure if it stops eventually due to integer overflow, and I'm not even willing to know.

The trick here is to patch a recursive call to do an extra instantiation with the exit condition for the case of that very extra condition. Factorial of 1 should call a factorial of 1 (not 0) in the dead branch, and the compiler will have to stop spawning instantiations, since it already has a factorial of 1. This might seem like a thing that would prevent inlining by creating natural recursion, but the instantiation would be eliminated from the dead branch eventually, so the inlining would remain possible.

Step 1. C++ source code #include <iostream> template <int N> int factorial(){ return N == 1 ? 1 : N*factorial<N - 1 + (N == 1)>(); } int main(){ return factorial<3>(); } #include template int factorial(){ return N == 1 ? 1 : N*factorial (); } int main(){ return factorial<3>(); } Step 2. C++ pseudo-code with all the instantiations int factorial<3>(){ return N == 1 ? 1 : N*factorial<2 + 0>(); } int factorial<2>(){ return N == 1 ? 1 : N*factorial<1 + 0>(); } int factorial<1>(){ return N == 1 ? 1 : N*factorial<0 + 1>(); } int main(){ return factorial<3>(); } int factorial<3>(){ return N == 1 ? 1 : N*factorial<2 + 0>(); } int factorial<2>(){ return N == 1 ? 1 : N*factorial<1 + 0>(); } int factorial<1>(){ return N == 1 ? 1 : N*factorial<0>(); } int main(){ return factorial<3>(); } Step 3. Pseudo-code after constant propagation int factorial<3>(){ return 3*factorial<2>(); } int factorial<2>(){ return 2*factorial<1>(); } int factorial<1>(){ return 1; } int main(){ return factorial<3>(); } int factorial<3>(){ return 3*factorial<2>(); } int factorial<2>(){ return 2*factorial<1>(); } int factorial<1>(){ return 1; } int main(){ return factorial<3>(); } Step 4. Pseudo-code after inlining int main(){ return 3*2*1; } int main(){ return 3*2*1; } Step 5. Assembly main: .LFB1497: .cfi_startproc movl $6, %eax ret .cfi_endproc main: .LFB1497: .cfi_startproc movl $6, %eax ret .cfi_endproc

Next step

The trick is as simple, as it is effective. Static computation is a byproduct here, the important thing is — we can make the compiler unroll any kind of recursion with no limitations on depth and branching. Well, no limitations except for the compilation time, of course.

Just like before I propose a recursive solution that is expected to unroll and reduce itself in the compile time. The input data is explicitly made volatile to prevent static computation.

#include <array> #include <iostream> namespace linear_equations { namespace semi_static { template <int I, int J, int K, int N> inline static double aij( const std::array<std::array<double, N>, N>& a) { if(K == N) return a[I][J]; return aij<I, J, K+(K<N), N>(a) * aij<K, K, K+(K<N), N>(a) - aij<I, K, K+(K<N), N>(a) * aij<K, J, K+(K<N), N>(a); } template <int I, int K, int N> inline static double bi( const std::array<std::array<double, N>, N>& a, const std::array<double, N>& b) { if(K == N) return b[I]; return aij<K, K, K+(K<N), N>(a) * bi<I, K+(K<N), N>(a, b) - aij<I, K, K+(K<N), N>(a) * bi<K, K+(K<N), N>(a, b); } template <int J, int I, int N> inline static void d_for( double& d, const std::array<std::array<double, N>, N>& a, std::array<double, N>& x) { if(J < I) { d -= aij<I, J, I+(J<I), N>(a) * x[J]; d_for<J+(J<I), I, N>(d, a, x); } } template <int I, int N> inline static double di( const std::array<std::array<double, N>, N>& a, const std::array<double, N>& b, std::array<double, N>& x) { double d = bi<I, I+1, N>(a, b); d_for<0, I, N>(d, a, x); return d; } template <int I, int N> inline static void x_for( const std::array<std::array<double, N>, N>& a, const std::array<double, N>& b, std::array<double, N>& x) { if(I < N) { double d = di<I, N>(a, b, x); double aiji = aij<I, I, I+1, N>(a); x[I] = d / aiji; x_for<I+(I<N), N>(a, b, x); } } template <int N> inline static void solve( const std::array<std::array<double, N>, N>& a, const std::array<double, N>& b, std::array<double, N>& x) { x_for<0, N>(a, b, x); } } } int main() { auto sum_x = std::array<double, 5> {0., 0., 0., 0., 0.}; for(auto i = 0u; i < 1000 * 1000; ++i){ auto v_a = std::array<std::array<volatile double, 5>, 5>{{ { 6.80, -6.05, -0.45, 8.32, -9.67}, {-2.11, -3.30, 2.58, 2.71, -5.14}, { 5.66, 5.36, -2.70, 4.35, -7.26}, { 5.97, -4.44, 0.27, -7.17, 6.08}, { 8.23, 1.08, 9.04, 2.14, -6.87} }}; auto v_b = std::array<volatile double, 5> {4.02, 6.19, -8.22, -7.57, -3.03}; const auto a = std::array<std::array<double, 5>, 5>{{ {v_a[0][0], v_a[0][1], v_a[0][2], v_a[0][3], v_a[0][4]}, {v_a[1][0], v_a[1][1], v_a[1][2], v_a[1][3], v_a[1][4]}, {v_a[2][0], v_a[2][1], v_a[2][2], v_a[2][3], v_a[2][4]}, {v_a[3][0], v_a[3][1], v_a[3][2], v_a[3][3], v_a[3][4]}, {v_a[4][0], v_a[4][1], v_a[4][2], v_a[4][3], v_a[4][4]} }}; const auto b = std::array<double, 5> {v_b[0], v_b[1], v_b[2], v_b[3], v_b[4]}; auto x = std::array<double, 5> {}; linear_equations::semi_static::solve<5>(a, b, x); for(auto j = 0u; j < 5; ++j) sum_x[j] += x[j]; } std::cout << sum_x[0] << ' ' << sum_x[1] << ' ' << sum_x[2] << ' ' << sum_x[3] << ' ' << sum_x[4] << '

'; return 0; }

Unlike the C solution, this is completely scalable. It doesn't require macros or case misuse. The downside is, it runs twice as long when being built with inlines off.

Two things to consider

1. In C++ inline specifier doesn't necessary specify inlining. It's up to the compiler and the build parameters to decide. In our case we know we will eventually benefit from it, so it might be a good idea to demand inlining explicitly. GCC has a special attribute __attribute__((always_inline)) for it.

Generally it is ok to let the compiler decide, as excessive inlining may be bad for instruction cache usage, but it all depends on the code and the hardware.

2. The code we will get after the inlining will mostly consist of arithmetic operations. There will be little memory access, no branching and no stack operations. This type of code may benefit drastically from superscalarization. Modern processors have the capabilities of processing several values as one big superscalar, and the more recent the model, the larger the superscalars are.

The compiler always has some target platform in mind. Usually, the default one is a little conservative to make sure your binary will run on other machines. This means that unless you select it explicitly, the compiler would only use the benefits of 10 years old processor.

For GCC you can select the target architecture with -march option. My choice for the evening will be -march=native since the only purpose of the binary it will build is to run on my machine a couple of times.

Measurements

I want to measure several solutions. The one with inlines prohibited by the compiler option, and the same one rebuilt for the native architecture. Then a pair with forced inlines both with default and native architectures. Then the solution with a triangular matrix to make sure we still may benefit from partial constantness with this approach. These numbers then should be compared to the LAPACK and the recursive solutions from the C piece.

The numbers will seem more precise than before, but let it not deceive you, we still have quite an error in our measurements. It's just that the difference between the most and the least effective solutions is so great, we can't really put the numbers on the same scale. The whole runtime of the most efficient version fits in the error of the least efficient one.

Anyway, the numbers are as follows.

No inlines, default architecture. Runtime: 4400 ms, assembly size 123 KB. No inlines, native architecture. Runtime: 4000 ms, assembly size 123 KB. Forced inlines, default architecture. Runtime: 120 ms, assembly size 10.2 KB. Forced inlines, native architecture. Runtime: 62 ms, assembly size 9.7 KB. Forced inlines, native architecture, triangular matrix. Runtime: 56 ms, assembly size 8.1 KB. LAPACK from before: Runtime 650 ms. Recursive solution from before: Runtime 2100 ms.

I should point out that the whole point of the experiment is not to show that LAPACK is somehow bad or obsolete. It is to show that knowing how the compiler works, you can make it make the most efficient code for you.

On 5 equations LAPACK's LU decomposition roughly matches the Cramer's rule in complexity, but on higher numbers the complexity of the latter will get overwhelming really fast. You can make the compiler make you an unrolled UV decomposition as well, but then the pivoting will get in a way, so you wouldn't avoid branching. Long story short, if you want linear algebra, you should better stick with well established libraries. Such as LAPACK or Eigen.

But if you want performance, you should know your compiler.

Summary

The “unroll recursion” trick may be extremely effective in the right context. In the wrong context it may be as well harmful. We can make the compiler produce code 30 times more effective that the original recursive solution, but we may just as easy lose 2 times if the compiler fails to unroll recursive functions.

Superscalarization does wonders for the unrolled code. Without inlines there is only 10% speedup, but for the fully unrolled code it is a fair 2x. But even with that 10% gain it is still something worth consideration. If your code runs in a cloud or on your own server, or in some other form of predictable environment, you might consider rebuilding it for the native architecture for a cheap speedup.

The unrolled code is a subject of extremely efficient constant propagation. The computation on triangular matrix was another 10% speedup for no cost at all. You too can benefit from specializing the code to do your specific task without any additional effort.

All the measurements conducted on Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz, the code is compiled with g++ 5.4.0 -std=c++11 -O2. The source code for the benchmark is available on Github.