Outperforming everything with anything

Python? Sure, why not?

This continues Outperforming LAPACK with C metaprogramming, and Vastly outperforming LAPACK with C++ metaprogramming. These two posts describe language tricks to make compilers generate highly-performant code for you. But do you need compilers at all?

The alternative would be writing programs in plain assembly. Although the horrors of it are vastly exaggerated, this approach has two major flaws.

Assembly code is not portable. Although it became much easier with modern tools, assembly programming still requires a lot of tedious work.

Thankfully, we all live in the XXI century, and both of these problems have already been addressed.

The solution to the first one would be LLVM. Initially, it meant “Low Level Virtual Machine” and that is exactly what we want to ensure portability. Simply put, it takes some code written in very low-level hardware agnostic language and returns some highly optimized native code for the specific hardware platform. With LLVM we have both the power of low-level programming and the automation of hardware-oriented micro-optimizations.

And the solution to the second problem is any “scripting” language you want. Scheme, Python, Perl, even bash or AWK will do. They are all meant to eliminate tedious work. Everything you use daily for automation you can use to generate highly-performant code.

The plan

Let's repeat the same experiment I did with C and C++ in the posts from before. Let's generate a fully inlined fully unrolled solution and embed it into the benchmarking code. The plan then is as follows.

Use Clang to generate LLVM intermediate code for the benchmark that is supposed to measure yet non-existent function named solve_5 Make Python generate a linear solver code in LLVM. Hijack a benchmark with a Python script replacing solve_5 call with the generated solver. Use the LLVM static compiler to turn the intermediate code into the machine code. Use the GNU assembler and the Clang's linker to make the machine code into an executable binary.

That's how it looks in a Makefile:

all: generate_llvm generate_s generate_elf test .PHONY: generate_llvm generate_llvm: clang benchmark.c -march=native -S -emit-llvm -o benchmark.bc python substitute_solver_call.py benchmark.bc .PHONY: generate_s generate_s: llc -O2 benchmark.bc -o benchmark.s .PHONY: generate_elf generate_elf: as benchmark.s -o benchmark.o clang -o benchmark benchmark.o

The Python part

We need a linear solver in Python just like we had with C and C++. Here it is:

# this generates n-solver in LLVM code with LLVMCode objects. # No LLVM stuff yet, just completely Pythonic solution def solve_linear_system(a_array, b_array, x_array, n_value): def a(i, j, n): if n == n_value: return a_array[i * n_value + j] return a(i, j, n+1)*a(n, n, n+1) - a(i, n, n+1)*a(n, j, n+1) def b(i, n): if n == n_value: return b_array[i] return a(n, n, n+1)*b(i, n+1) - a(i, n, n+1)*b(n, n+1) def x(i): d = b(i,i+1) for j in range(i): d -= a(i, j, i+1) * x_array[j] return d / a(i, i, i+1) for k in range(n_value): x_array[k] = x(k) return x_array

When we run this on numbers, we have numbers. But we want the code. So let's make an object that pretends to be numbers only to spy on the algorithm. The object that writes down every operation the algorithm wants it to perform in ready to assemble LLVM intermediate language.

# this is basically the whole LLVM layer I = 0 STACK = [] class LLVMCode: # the only constructor for now is by double* instruction def __init__(self, io, code = ''): self.io = io self.code = code def __getitem__(self, i): global I, STACK copy_code = "%" + str(I+1) copy_code += " = getelementptr inbounds double, double* " copy_code += self.io +", i64 " + str(i) + "

" copy_code += "%" + str(I+2) copy_code += " = load double, double* %" + str(I+1) copy_code += ", align 8

" I += 2 STACK += [I] return LLVMCode(self.io, copy_code) def __setitem__(self, i, other_llvcode): global I, STACK self.code += other_llvcode.code self.code += "%" + str(I+1) self.code += " = getelementptr inbounds double, double* " self.code += self.io +", i64 " + str(i) + "

" self.code += "store double %" + str(I) self.code += ", double* %" + str(I+1) + ", align 8

" I += 1 STACK = STACK[:-1] return self def general_arithmetics(self, operator, other_llvcode): global I, STACK self.code += other_llvcode.code; self.code += "%" + str(I+1) + " = f" + operator self.code += " double %" + str(STACK[-2]) + ", %" self.code += str(STACK[-1]) + "

"; I += 1 STACK = STACK[:-2] + [I] return self def __add__(self, other_llvcode): return self.general_arithmetics('add', other_llvcode) def __sub__(self, other_llvcode): return self.general_arithmetics('sub', other_llvcode) def __mul__(self, other_llvcode): return self.general_arithmetics('mul', other_llvcode) def __div__(self, other_llvcode): return self.general_arithmetics('div', other_llvcode)

Now when we run the solver with this kind of objects, we get a fully functional linear solver written in LLVM intermediate language. Then we put it into the benchmark code to see how fast it is.

Instruction in LLVM are numbered and we want to preserve this enumeration, so the function that inserts our code into our benchmark is not trivial. But it's not very complicated either.

# this replaces the function call # and updates all the instructions' indices def replace_call(text, line, params): global I, STACK # '%12 ' -> 12 I = int(''.join( [xi for xi in params[2] if xi.isdigit()] )) first_instruction_to_replace = I + 1 STACK = [] replacement = solve_linear_system( LLVMCode(params[0]), LLVMCode(params[1]), LLVMCode(params[2]), 5).code delta_instruction = I - first_instruction_to_replace + 1 for i in xrange(first_instruction_to_replace, sys.maxint): not_found = sum( [text.find('%' + str(i) + c) == -1 for c in POSSIBLE_CHARS_NUMBER_FOLLOWS_WITH] ) if not_found == 4: # the last instruction has already been substituted break new_i = i + delta_instruction for c in POSSIBLE_CHARS_NUMBER_FOLLOWS_WITH: # substitute instruction number text = text.replace('%' + str(i) + c, '%' + str(new_i) + c) return text.replace(line, replacement)

The whole piece of code that implements the solver, provides the Python-to-LLVM layer, and does the code insertion is only 100 lines long! You can see it in one piece on GitHub.

The benchmark

The benchmark itself is in C. When we run the Makefile, its call for solve_5 gets replaced by the LLVM code generated from Python by Python.

Step 1. Benchmark C source code #include "stdio.h" extern void solve_5(double* a, double* b, double* x); // fake int main() { double sum_x[5] = {0., 0., 0., 0., 0.}; for(int i = 0; i < 1000000; ++i) { double a[5*5] = { 6.80, -6.05, -0.45, 8.32, -9.67, -2.11, -3.30, 2.58, 2.71, -5.14, 5.66, 5.36, -2.70, 4.35, -7.26, 5.97, -4.44, 0.27, -7.17, 6.08, 8.23, 1.08, 9.04, 2.14, -6.87 }; double b[5] = { 4.02, 6.19, -8.22, -7.57, -3.03, }; double x[5]; solve_5(a, b, x); // this will get replaced later for(int j = 0; j < 5; ++j){ sum_x[j] += x[j]; } } printf("%f, %f, %f, %f, %f

", sum_x[0], sum_x[1], sum_x[2], sum_x[3], sum_x[4]); return 0; } #include "stdio.h"// fake int main() { double sum_x[5] = {0., 0., 0., 0., 0.}; for(int i = 0; i < 1000000; ++i) { double a[5*5] = { 6.80, -6.05, -0.45, 8.32, -9.67, -2.11, -3.30, 2.58, 2.71, -5.14, 5.66, 5.36, -2.70, 4.35, -7.26, 5.97, -4.44, 0.27, -7.17, 6.08, 8.23, 1.08, 9.04, 2.14, -6.87 }; double b[5] = { 4.02, 6.19, -8.22, -7.57, -3.03, }; double x[5];// this will get replaced later for(int j = 0; j < 5; ++j){ sum_x[j] += x[j]; } } printf("%f, %f, %f, %f, %f

", sum_x[0], sum_x[1], sum_x[2], sum_x[3], sum_x[4]); return 0; } Step 2. LLVM assembly language ... 27 lines before... ; <label>:6 ; preds = %3 %7 = bitcast [25 x double]* %a to i8* call void @llvm.memcpy.p0i8.p0i8.i32 (i8* %7, i8* bitcast ([25 x double]* @main.a to i8*), i32 200, i32 8, i1 false) %8 = bitcast [5 x double]* %b to i8* call void @llvm.memset.p0i8.i32(i8* %8, i8 0, i32 40, i32 8, i1 false) %9 = bitcast i8* %8 to [5 x double]* %10 = getelementptr [5 x double], [5 x double]* %9, i32 0, i32 0 store double 4.020000e+00, double* %10 %11 = getelementptr [5 x double], [5 x double]* %9, i32 0, i32 1 store double 6.190000e+00, double* %11 %12 = getelementptr [5 x double], [5 x double]* %9, i32 0, i32 2 store double -8.220000e+00, double* %12 %13 = getelementptr [5 x double], [5 x double]* %9, i32 0, i32 3 store double -7.570000e+00, double* %13 %14 = getelementptr [5 x double], [5 x double]* %9, i32 0, i32 4 store double -3.030000e+00, double* %14 %15 = getelementptr inbounds [25 x double], [25 x double]* %a, i32 0, i32 0 %16 = getelementptr inbounds [5 x double], [5 x double]* %b, i32 0, i32 0 %17 = getelementptr inbounds [5 x double], [5 x double]* %x, i32 0, i32 0 call void @solve_5(double* %15, double* %16, double* %17) ; to replace store i32 0, i32* %j, align 4 br label %18 ; to update ; <label>:18 ; preds = %29, %6 %19 = load i32, i32* %j, align 4 %20 = icmp slt i32 %19, 5 br i1 %20, label %21, label %32 ... 58 lines after... ... 27 lines before...... 58 lines after... Step 3. LLVM after the call replacement ... 44 lines before ... %15 = getelementptr inbounds [25 x double], [25 x double]* %a, i32 0, i32 0 %16 = getelementptr inbounds [5 x double], [5 x double]* %b, i32 0, i32 0 %17 = getelementptr inbounds [5 x double], [5 x double]* %x, i32 0, i32 0 %18 = getelementptr inbounds double, double* %15, i64 6 ; generated %19 = load double, double* %18, align 8 ; by %20 = getelementptr inbounds double, double* %15, i64 24 ; the %21 = load double, double* %20, align 8 ; Python %22 = fmul double %19, %21 ; script ... 2407 more lines of auto-generated code ... %2422 = getelementptr inbounds double, double* %17, i64 3 %2423 = load double, double* %2422, align 8 %2424 = fmul double %2421, %2423 %2425 = fsub double %2419, %2424 %2426 = getelementptr inbounds double, double* %15, i64 24 %2427 = load double, double* %2426, align 8 %2428 = fdiv double %2425, %2427 %2429 = getelementptr inbounds double, double* %17, i64 4 store double %2428, double* %2429, align 8 store i32 0, i32* %j, align 4 br label %2430 ; instructions updated by the same script ; :18 ; preds = %2441, %6 %2431 = load i32, i32* %j, align 4 %2432 = icmp slt i32 %2431, 5 br i1 %2432, label %2433, label %2444 ... still 58 lines after... ... 44 lines before ...... 2407 more lines of auto-generated code ...... still 58 lines after... Step 4. Native optimized assembly ... 139 lines of assembly ... vmovsd 352(%esp), %xmm0 # xmm0 = mem[0],zero vmulsd 256(%esp), %xmm2, %xmm6 vmovsd .LCPI0_0, %xmm4 # xmm4 = mem[0],zero vmulsd %xmm4, %xmm0, %xmm7 vsubsd %xmm7, %xmm6, %xmm6 vmulsd 128(%esp), %xmm0, %xmm5 # 8-byte Folded Reload vmovapd %xmm0, %xmm3 vmulsd 344(%esp), %xmm2, %xmm7 vsubsd %xmm5, %xmm7, %xmm1 vmovsd %xmm1, 128(%esp) # 8-byte Spill vmulsd 280(%esp), %xmm2, %xmm7 vmulsd 192(%esp), %xmm4, %xmm5 # 8-byte Folded Reload vsubsd %xmm5, %xmm7, %xmm7 vmovsd %xmm7, 120(%esp) # 8-byte Spill vmovsd 104(%esp), %xmm0 # 8-byte Reload vmulsd %xmm6, %xmm0, %xmm5 vmulsd %xmm7, %xmm1, %xmm6 vsubsd %xmm6, %xmm5, %xmm5 vmovsd %xmm5, 64(%esp) # 8-byte Spill vmovapd %xmm3, %xmm7 vmulsd 88(%esp), %xmm7, %xmm3 # 8-byte Folded Reload vmulsd 336(%esp), %xmm2, %xmm6 vsubsd %xmm3, %xmm6, %xmm3 vmulsd %xmm0, %xmm3, %xmm3 vmulsd 80(%esp), %xmm1, %xmm6 # 8-byte Folded Reload vsubsd %xmm6, %xmm3, %xmm1 vmovsd %xmm1, 88(%esp) # 8-byte Spill ... only 400 more lines of assembly ... ... 139 lines of assembly ... vmovsd 352(%esp), %xmm0 # xmm0 = mem[0],zero vmulsd 256(%esp), %xmm2, %xmm6 vmovsd .LCPI0_0, %xmm4 # xmm4 = mem[0],zero vmulsd %xmm4, %xmm0, %xmm7 vsubsd %xmm7, %xmm6, %xmm6 vmulsd 128(%esp), %xmm0, %xmm5 # 8-byte Folded Reload vmovapd %xmm0, %xmm3 vmulsd 344(%esp), %xmm2, %xmm7 vsubsd %xmm5, %xmm7, %xmm1 vmovsd %xmm1, 128(%esp) # 8-byte Spill vmulsd 280(%esp), %xmm2, %xmm7 vmulsd 192(%esp), %xmm4, %xmm5 # 8-byte Folded Reload vsubsd %xmm5, %xmm7, %xmm7 vmovsd %xmm7, 120(%esp) # 8-byte Spill vmovsd 104(%esp), %xmm0 # 8-byte Reload vmulsd %xmm6, %xmm0, %xmm5 vmulsd %xmm7, %xmm1, %xmm6 vsubsd %xmm6, %xmm5, %xmm5 vmovsd %xmm5, 64(%esp) # 8-byte Spill vmovapd %xmm3, %xmm7 vmulsd 88(%esp), %xmm7, %xmm3 # 8-byte Folded Reload vmulsd 336(%esp), %xmm2, %xmm6 vsubsd %xmm3, %xmm6, %xmm3 vmulsd %xmm0, %xmm3, %xmm3 vmulsd 80(%esp), %xmm1, %xmm6 # 8-byte Folded Reload vsubsd %xmm6, %xmm3, %xmm1 vmovsd %xmm1, 88(%esp) # 8-byte Spill ... only 400 more lines of assembly ...

Next step

The most noteworthy is how the ultra-verbose intermediate code generated by the Python script turns into some very compact and very effective code for the hardware. It is highly super-scalarized too. But is it good enough to compete with C and C++ solutions?

Here are some approximate numbers for the three cases: C with tricks, C++ with a trick, and Python with LLVM.

The tricks for C don't quite work for Clang, so GCC version is measured. It runs for roughly 70 ms on average. The C++ version was built with Clang and it runs for 60 ms. The Python version, the one described here, runs only for 55 ms.

Of course, this speed-up is not something you would kill for. But it shows that you can write programs in Python that outperform their counterparts written in C or C++. You don't have to learn some special language to create high-performant applications or libraries.

Conclusion

I think that the dichotomy between fast compiling languages and slow scripting languages is a bluff. Native code generation may very well be not a core feature but something like a pluggable option. Like a library or an embedded language. Like Numba for Python or Terra for Lua. Think about the benefits: you can do your research and rapid prototyping in one language, and then generate highly-performant code... in the same language.

High-performance computing has no reason to remain a privilege of compiling languages. A compiler is just a soft machine for code generation. You can generate code in any language you want. I believe you can teach Matlab to generate ultra-fast LLVM code if you want to.

All the measurements conducted on Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz, the code is compiled with clang version 3.8.0-2ubuntu4 and g++ 5.4.0 with -march=native -O2. The source code for the benchmark is available on Github.