How is it even possible? Meet the LLVM

Basically the high performance exhibited by Julia is made possible by the LLVM project. LLVM is a technology suite that can be used for building optimizing compilers, and is notably used in Clang, a C and C++ compiler. Clang is like any other C compiler such as GCC, icc, Turbo C or MSVC. All of these can exhibit great performance, provided all suitable and architecture-specific optimizations are performed during the production of the final machine-level code. And the same is true for any other comparable languages like Fortran, Pascal, Basic, Ada and Algol 68.

What makes LLVM peculiar is that while most compilers are only built with a single language implementation in mind, LLVM was created to have a flexible

LLVM has a badass logo

front-end. Adapting a regular compiler to make use of its low-level optimization and code generation capabilities to implement new language features, or potentially a completely different language, can be quite difficult. In fact, I cannot really cite a single example, it’s easier to see languages that output C language as an intermediate representation (e.g. Cython).

The LLVM changes that by offering a reusable back-end for creating compilers of potentially very different languages. Clang is just a different front-end for LLVM. This really changed the programming language landscape after 2003. And there was a demand. You can see there were many attempts to create new compilers or interpreters based on the LLVM for languages created before that time, Python and Lua being examples. But these after-thought ports don’t seem to have gained much traction, although Numba seems reasonably successful. The story is different with Swift and Rust, two other new languages that show the power of LLVM apart from Clang and Julia.

LLVM enables language designers in a way similar to what happened to the JVM in the past decade, supporting the creation of many new languages different from Java such as Scala, Clojure and more recently Kotlin, not to mention the same attempts to re-implement 90s languages (e.g. JRuby, Jython). There are important differences, though: the JVM is a lot more opinionated, and the LLVM is not even a virtual machine. The similarity lies mostly in the fact that a sort of intermediate representation (IR) is created in both cases, over which the compiler can perform all the optimizations we desire, either in just-in-time (JIT) or ahead-of-time (AOT) fashion.

Julia, created in 2012, is perhaps the first language designed with numerical processing in mind conceived in a post-LLVM world, and that is the reason it can be so good. Unlike Rust or Swift, or Python for that matter, it was conceived with numerical applications in mind. And Julia is also unlike Python, Matlab, R or anything from the 90s when most tools for optimizing low-level and IR code were trapped inside silo-projects, difficult to reuse. LLVM changed that, and we are now reaping the benefits.

My benchmark proposal

Let’s start with a word about the experiment problem I picked for this benchmark. When going for a test like this some people, maybe under the pressure of time or to make their code generally accessible, often end up trying over-simple things like the Fibonacci sequence. Or even a silly “Hello world”. These are not a very natural tests for numerical processing benchmarks in my opinion, and will tell you very little about the languages you are testing.

Of course the ideal benchmark is just your actual application, but a more proper general numerical benchmark for me is something more complicated like LULESH, or whatever. At the very least you should have some large numerical expressions. Really a lot of non-trivial arithmetic. This is what is going to challenge a Numpy vectorized approach, for instance, because it

conflicts with the technique of storing lots of intermediate results in memory instead of doing a single pass through the input data. You should also put in some nested function calls to test for function inlining. These are the two main things I had in mind here.

One way to create such a complex expression is to just pick up an approximation of a transcendental function to evaluate. Good old exp(x). This is just what I did, I went all the way through the code of math.h on glibc, and picked up a piece of code used to calculate the exponential function. The code below includes the expression, and is the full program I ran for my benchmark in C.

The code in Julia reads much more like a “script”, with very few keywords or type annotations, even though they can be strictly enforced. It also includes nice handy things like the @elapsed macro for measuring the time for the benchmark.

The Python code relied on the ipython magic %time for the time measurement. I just didn’t find anything else to be as good, but it is clearly bad not being something from the real language… It kind of illustrates Python’s limitations here. The big expression can take a single float, but also works with Numpy arrays, much in the spirit of how you are supposed to use that library. Notice the Numba version is pretty much the same from this, but it has a lot of @jit decorators everywhere.

And finally we have the Scala code. Notice the clear use of map and reduce at lines 40 and 43! A thing of beauty.

To recap, what we are doing is calculating an approximation of the exponential function, then comparing it to the standard library exp, for many numbers between 0.4 and 1.0, where this approximation is valid, and then keeping the largest absolute error. We test from as few as 10³ points all the way to 10⁹. The resulting error in practice is something negligible, around 1e-14.

Results

Now let’s just jump into what you’re all really interested in seeing: the benchmark results! This graph shows the speed (quantity-per-time of input points calculated) for different input lengths and language settings. Here we just want to illustrate how weird cases compare to the core of results with reasonable performance, and also how some methods cannot actually cope with large inputs because at some point they produced large concrete arrays storing intermediate values unnecessarily. These are the ones that stop at 10⁷, except for the pure Python one where I just stopped because I was tired of waiting.

Benchmark performance for different languages and input size. In some cases, the subject could not even handle a very large input. Native Python shows a drastically sluggish performance at the bottom and a buggy C code that did not really calculate anything at the very top.

Now this bar chart finally shows a direct comparison of the cases that matter the most. We can see that Julia really attained a performance on par with C, with a small, daresay negligible three-percent advantage to the Jurassic contender.