Sometimes people ask: why does Julia need to be a new language? What about Julia is truly different from tools like Cython and Numba? The purpose of this blog post is to describe how Julia's design gives a very different package development experience than something like Cython, and how that can lead to many more optimizations. What I really want to show is:

Julia's compilation setup is built for specialization of labor which is required for scientific progress Composition of Julia codes can utilize the compilation process to build new programs which are greater than the sum of the parts

I will also explain some of the engineering tradeoffs which were made to make this happen. There are many state-of-the-art scientific computing and data science packages in Julia and what I want to describe is how these are using the more "hardcore language-level features" to get there.

Point 1: Any sufficiently competent IR generator will hit the optimization limit on small examples, but that doesn't necessarily generalize to full package solutions

You will find benchmarks on small scripts (microbenchmarks) where Numba and Cython do as well as Julia. Can these tools optimize as well as Julia? I will give you even more than that. I will go as far as saying that any sufficient competent IR generating mechanism will be efficient on a single function (or in Julia terminology, method). It's actually not that difficult of a problem in 2018. If you build an LLVM IR with concrete types then LLVM will compile it well, and pretty much any decent representation of that IR will get you to around the same spot. Cython, Numba, etc. all will work just as well as Julia if you have a single function with known input types and all of that because there's just a limit to how optimal you can make your computations and most of the optimizations are just standard LLVM passes on the IR. It's not hard to get a loop written in terms of simple basics (well okay, it is quite hard but I mean "not hard" as in "give enough programmers a bunch of time and they'll get it right"). Cool, we're all the same. And microbenchmark comparisons between Cython/Numba/Julia will point out 5% gains here and 2% losses here and try to extrapolate to how that means entire package ecosystems will collapse into chaos, when in reality most of this is likely due to different compiler versions and go away as the compilers themselves update. So let's not waste time on these single functions.

Let's talk about large code bases. Scaling code to real applications is what matters. Here's the issue: LLVM cannot optimize a Python interpreter which sits in the middle between two of your optimized function calls, and this can HURT. Take a look at this example which introduces the DifferentialEquations.jl bindings in Python and R. In it we show how JIT compiling a function with Numba only moderately helps the ODE solver (i.e. 50% performance gain).

import numpy as np from scipy. integrate import odeint import timeit import numba def f ( u , t , sigma , rho , beta ) : x , y , z = u return [ sigma * ( y - x ) , x * ( rho - z ) - y , x * y - beta * z ] u0 = [ 1.0 , 0.0 , 0.0 ] tspan = ( 0 . , 100 . ) t = np. linspace ( 0 , 100 , 1001 ) sol = odeint ( f , u0 , t , args = ( 10.0 , 28.0 , 8 / 3 ) ) def time_func ( ) : odeint ( f , u0 , t , args = ( 10.0 , 28.0 , 8 / 3 ) , rtol = 1e-8 , atol = 1e-8 ) time_func ( ) timeit . Timer ( time_func ) . timeit ( number = 100 ) / 100 # 0.07502969299999997 seconds numba_f = numba. jit ( f , nopython = True ) def time_func ( ) : odeint ( numba_f , u0 , t , args = ( 10.0 , 28.0 , 8 / 3 ) , rtol = 1e-8 , atol = 1e-8 ) time_func ( ) timeit . Timer ( time_func ) . timeit ( number = 100 ) / 100 # 0.04437951900000001 seconds

But if you allow the function to be a Julia code it can optimize it a lot further:

from julia import Main from diffeqpy import de import numpy as np jul_f = Main. eval ( """ function f(du,u,p,t) x, y, z = u[1], u[2], u[3] sigma, rho, beta = p[1], p[2], p[3] du[1] = sigma * (y - x) du[2] = x * (rho - z) - y du[3] = x * y - beta * z end""" ) u0 = [ 1.0 , 0.0 , 0.0 ] tspan = ( 0 . , 100 . ) t = np. linspace ( 0 , 100 , 1001 ) p = [ 10.0 , 28.0 , 8 / 3 ] prob = de. ODEProblem ( jul_f , u0 , tspan , p ) sol = de. solve ( prob , saveat = t , abstol = 1e-8 , reltol = 1e-8 ) def time_func ( ) : sol = de. solve ( prob , saveat = t , abstol = 1e-8 , reltol = 1e-8 ) time_func ( ) timeit . Timer ( time_func ) . timeit ( number = 100 ) / 100 # 0.0033111710000000016 seconds

with the "Julia called from Python" solution which is about 13x faster than the SciPy+Numba code, which was really just Fortran+Numba vs a full Julia solution. The main issue is that Fortran+Numba still has Python context switches in there because the two pieces were independently compiled and it's this which becomes the remaining bottleneck that cannot be erased. And it's clear why this fact will not be of influence in simple one-function microbenchmarks but it is an extremely important difference when trying to build an optimized full ecosystem of scientific computing tools. I don't care about these little 5% compiler version differences when there's a 13x difference the moment I throw a real problem at it, and this is the kind of technical issue which you see requires a completely different architectural structure to fully solve.

To further showcase the difference, let's now solve the differential equation in pure Julia:

using DifferentialEquations, BenchmarkTools function lorenz ! ( du,u,p,t ) du [ 1 ] = p [ 1 ] * ( u [ 2 ] -u [ 1 ] ) du [ 2 ] = u [ 1 ] * ( p [ 2 ] -u [ 3 ] ) - u [ 2 ] du [ 3 ] = u [ 1 ] * u [ 2 ] - p [ 3 ] * u [ 3 ] end u0 = [ 1.0 ; 0.0 ; 0.0 ] p = [ 10.0 , 28.0 , 8 / 3 ] tspan = ( 0.0 , 100.0 ) prob = ODEProblem ( lorenz ! ,u0,tspan,p ) @btime solve ( prob,saveat= 0.1 ,reltol=1e-8,abstol=1e-8 ) # 2.467 ms (13436 allocations: 1.00 MiB) @btime solve ( prob,Tsit5 ( ) ,saveat= 0.1 ,reltol=1e-8,abstol=1e-8 ) # 2.904 ms (1081 allocations: 155.70 KiB)

That's now close to 2.5 miliseconds, or about 18x faster than SciPy+Numba. That's just with the simplest way of solving the equations! Now if we follow the code optimization tutorial we get:

using StaticArrays function lorenz_static ( u,p,t ) @inbounds begin dx = p [ 1 ] * ( u [ 2 ] -u [ 1 ] ) dy = u [ 1 ] * ( p [ 2 ] -u [ 3 ] ) - u [ 2 ] dz = u [ 1 ] * u [ 2 ] - p [ 3 ] * u [ 3 ] end @SVector [ dx,dy,dz ] end u0 = @SVector [ 1.0 , 0.0 , 0.0 ] p = @SVector [ 10.0 , 28.0 , 8 / 3 ] tspan = ( 0.0 , 100.0 ) prob = ODEProblem ( lorenz_static,u0,tspan,p ) @btime solve ( prob,saveat= 0.1 ,reltol=1e-8,abstol=1e-8 ) # 2.102 ms (13714 allocations: 12.97 MiB) @btime solve ( prob,Tsit5 ( ) ,saveat= 0.1 ,reltol=1e-8,abstol=1e-8 ) # 2.010 ms (56 allocations: 60.19 KiB)

That's right: in a real application we see the Julia solution 22x faster than SciPy+Numba. Sure, when things are just a straight loop, Numba is okay. But when packages and real applications are timed, there's a not so subtle difference. Please go run it on your computer and see what you get. You can just copy and paste it into the REPL and see for yourself. As a computational scientist, this matters a lot more to me than than a microbenchmark.

More Cross Language ODE Solver Benchmarks

For more in-depth cross language ODE solver benchmarks which compare speed and accuracy together, see the ODE Solver Multi-Language Wrapper Package Work-Precision Benchmarks (MATLAB, SciPy, Julia, deSolve (R)).

Can you avoid these performance issues in Cython?

You can work past this with Cython. You can design the entire package yourself as one monolithic code base. You write the whole thing in Cython and don't use person X's C++ nonlinear solver library or person Y's Numba nonlinear optimization tool and don't use person Z's CUDA kernel because you cannot optimize them together, oh and you don't use person W's Cython code without modification because you needed your Cython compilation to be aware of the existence of their Cython-able object before you do the compilation.

The problem is that monolithic architectures become maintenance nightmares which decrease programming productivity since you're having to repeat the coding of many algorithms which have already been done. But there's an even larger issue in the context of scientific computing: it is the intersection of high performance computational utilities with complex mathematical algorithms that gives you strong performance. Write a modern EPIRK ODE integrator with Krylov exponential approximation (one of the state-of-the-art stiff ODE solvers with few implementations) in pure Python using objects to describe your scientific model and your problem will be bogged down due to the computational structures that are used. Write a fast and simple Runge-Kutta order 4 integrator in Cython and your simulation will be bogged down since the choice of mathematical algorithm is unoptimized and will require many more function calls than necessary. It's the intersection: good algorithm plus efficient data structures and compiled code, that produces efficient large-scale scientific software.

Monolithic programming structures are antithetical to the knowledge specialization that's required in higher level mathematics. It's simply impossible for someone to be an expert in all areas of numerical mathematics, let alone have the know-how and time to create optimized implementations of the newest algorithms from every discipline. In fact, there are very few people in each mathematical discipline that can even know the state-of-the-art in any detail! I still haven't seen a fixed-leading coefficient Nordsieck BDF formulation in Python at all so native Python is behind ecosystems like SUNDIALS algorithmically, and that's not even Cython, and that's not using GPUs, etc. Thinking you can quickly/productively write all of this yourself in one giant codebase is hubris.

Rewriting every single difficult algorithm from scratch in order to utilize the most modern methods with the most efficient structures is not a style of programming that scales well. This is 2018: we're past this. We have packages and package managers now, and we want the ability to have separate packages work fully together. This solution already exists: it's Julia.

Point 2: Julia uses fully-dependent compilation

Julia avoids this through its generic algorithms and dependent compilation: write the integrator once and recompile it to new data structures via multiple dispatch. Let's explain this approach in some detail. If you're new to the Julia-sphere and you haven't read this explanation of the Julia compilation process, you may want to do a little pre-reading.

The idea is that if you know all of the code in its original un-compiled form then you can see you have literals (constants) on the top and propagate them throughout this code as true constants at compile time. You'd have to have no dynamicness in the middle, no interpreter, etc. since otherwise you wouldn't be able to guarantee const-ness. You'd have to compile the entire package together. In order to propagate these constants and inline function calls into another package, you'd even need to compile different package calls together at the same. This is what Julia does and it's why it's able to do full interprodcedural optimization in a way that combines functions from different packages.

Take for example ForwardDiff.jl which can take arbitrary pure Julia code and do forward-mode automatic differentiation to it (even Julia's Base library). When using ForwardDiff.jl inside of a large package like DifferentialEquations.jl, it doesn't use a compiled version of ForwardDiff.jl operations inside of DifferentialEquations.jl, but instead it generates on-demand the full functions it needs to compile so that way the function's value and its derivative are computed simultaneously. While doing this it inlines the ForwardDiff.jl arithmetic operations (along with any small first class functions which were passed into the routines). Julia then compiles the whole call together. And it's not just packages: since Julia's Base library and its standard library are written in Julia as well, it takes all of the Base library code as well to build a single typed script and can compile all of this together to a fully optimized form. The result is you get a full LLVM IR where ideas like "dual numbers" are abstractions which can be completely eliminated, and from there LLVM passes like common subexpression eliminated (CSE) can reduce the repeated arithmetic.

I say "can" because Julia's compiler uses a cost model to decide what to keep separate as purely a function call, and then inserts the function call if it determines that the function call wouldn't significantly change the runtime. By using a function call, it can use a separately compiled version of the function in order to reduce the amount of compilation time. This separately compiled version though is still a type-dependent compiled form. Note though that by using `@inline` you can bump the weight in the inlining heuristic to almost force it to happen (i.e. copy/paste the code in and compile in full instead of compiling the separate function).

This highlights a key difference between the Cython approach and the Julia approach, and it highlights the tradeoff. In Cython you have separately compiled functions and packages, much like static compilation to shared libraries in C++, and then you put function calls between them. In a few cases where you compile parts together it can inline, but generally you have separate packages/modules/etc. compile separately. This cuts down on compile time and makes it easier to generate a static binary but adds runtime costs.

In contrast, with Julia you have fully dependent compilation. Packages which call other packages can take control of the full code before compilation and then choices have to be made at how to separate it in a meaningful way. Yes, this means that managing compilation times is much more difficult in Julia as seen by the use of inlining cost models and "no-specialization" catches and tricks. If you go through the latency tag you can see that core developers are finding ways to automatically reduce specializations without cutting runtimes. Also, this means that static compilation is much more difficult than in other languages though Julia developers have already made large headway into making it a reality. There are package-level examples of this as well. In this SO post I describe how an ODE solver call can take a 2-3 second compilation to compile a version of the entire integrator program specifically to your ODE model which can reduce the runtime cost by about 4x-5x (which really matters in parameter estimation!), and how we have developed high level ways to turn this off to give users a choice to remove this extra specialization.

You can see that once we have fully dependent compilation, we want language level features in order to fully utilize and optimize this process. This is my next point.

Point 3: Compile-time control, code generation, and optimization require language-level compilation control features

Once you have the dependent compilation process as a large feature, you need/want language level features to be able to control this process. This begs the question: other than behind-the-scenes optimizations that this can add, are there any extra optimizations that can be had by allowing programmers control during this dependent compilation process?

Being able to fully optimize dependently JIT compiled code depends on having strong language level tools and utilizes to control the compilation process. You cannot even discuss these optimization opportunities from a Cython/Numba-perspective because you have a "Sapir-Whorf hypothesis problem": Python does not have this dependent compilation setup so it doesn't have the language or language-level tools for handling behavior at this phase. So let's look at what you gain from dependent compilation and how Julia's language level features interact with it.

This is where things like macros come in handy. Macros and metaprogramming apply at compile-time, so before functions get compiled you can modify what function you will be compilation based on an expression. But another huge fact is that "function" doesn't even mean the same thing as in Python/Cython. In Julia, a function is a collection of methods and the method chosen to be used in the final compiled code is dependent on the input types. Since input types matter, you may want to specialize an algorithm for arrays of 64-bit floating point numbers separately from how you'd treat arrays of 32-bit floating point numbers. This is where Julia's parametric typing comes in: you can specify Vector{Float64} vs Vector{Float32} and dispatch to separate algorithms which are optimal in each of the cases. You can even put these ideas together with generated functions which are "functions" where you can programmatically build a function expression during the compilation stages depending on the input types that you see. In fact, since the entire function compilation is at your control, you can build tools like Cassette.jl which allows you to take control of anyone else's function and "overdub" it in the compilation process to change what it's doing.

So what optimizations can you do with these tools? First of all, since you can see the entire code of a function, you can use the dependent compilation to build alternative output functions at compile-time. A type-based dispatch approach utilizes a wrapper type and have it re-write the internal function calls using Julia's multiple dispatch. Once again, ForwardDiff.jl is a great example to point out. It uses a wrapper into Dual numbers and then on-demand compiles new versions of functions in a way that does automatic differentiation. For example, you can make an entire ODE solver be reconfigured at compile-time to be essentially two parallel ODE solvers which then computes the derivative simultaneous to the equation, with the parallel derivative part generated automatically by the compile-time controls. It's not even that you can, it's that it takes no more than 2 lines for a user to set it up. Actually, nobody had to write this functionality for it to exist in Julia since it's a result of the compilation process! Let me repeat that: nobody had to implement this ability to efficiently autodifferentiate through the ODE solvers: this all happens due to Julia's compilation process and ForwardDiff's function overloads. The result is that packages compose nicely and efficiently.

This is that interaction of efficient data structures and difficult algorithms. The ODE solvers in DifferentialEquations.jl are generically-typed algorithms written independently by differential equation solving experts and include many optimized versions of the newest methods. ForwardDiff.jl was created by autodiff experts. The combination, fast autodifferentiation through Runge-Kutta-Chebyshev, EPIRK, Nordsieck BDF, etc. integrators is something that no one could create and optimize on their own, but it's a combination that Julia can create and optimize! Autodifferentiation is much less work than numerical differentiation and is more accurate, so this is a good combination algorithm to have! Yes, in some cases like fully continuous ODEs you can improve upon this via forward/adjoint sensitivity analysis, but the code of this type-based compilation approach applies to ODEs/SDEs/DAEs/DDEs/SDAEs/mixed Gillespie + ODE/SDE, etc. and many of these cases the sensitivity analysis derivation has never been done and would be a bear to implement. And again, ForwardDiff.jl utilizes value-types (i.e. it's pointer-free unlike objects) for its Dual numbers and inlines the small function calls, so it builds a very efficient AD runtime in other libraries. Separation of labor without overhead leads to huge productivity and efficiency gains!

So that's a nice optimization which you may have missed because there's no code to look at and see how this combination works in full. It works as the free result of compilation controls: the ODE solvers written generically and the autodifferentiation library reconfiguring algorithms using their number type overloads.

Composition of Julia codes gives new free and efficiently implemented features!

This is far from the only example. While Python has an uncertainties package with a type that calculates uncertainties, you cannot make a Cython kernel with NumPy linear algebra and throw this into SciPy ODE solvers and get an a fully optimized function call with uncertainty propagation. You would need to (A) recompile your Cython code to take into account this object (possible, but not automatic and it won't automatically do this through all dependent packages even if they were Cython), (B) recompile the NumPy linear algebra kernels to use this object in their Fortran code (good luck), and (C) recompile the SciPy ODE solvers to utilize all of this type information internally to propagate it through all of the internal linear combinations. I have never seen cross-package type-dependent optimized auto-recompilation in Python, let alone one that crosses language barriers into the C/C++/Fortran code. But what about Julia? How much programming was needed to be done to solve this enormous problem? A user notified me in a Discourse post that it worked without having to do anything. Cool, that's literally infinitely less work! Oh, and once again Julia's dependent compilation process optimizes it.

I can keep going. Just see Cassette's recent video for a bunch of compile-time optimizations and context-dependent compilation to take control of other people's packages/code and recompile it to be distributed/gpu/etc.

Another interesting case of this is Julia's broadcast system. You probably think of "broadcast" as synonymous with "vectorization", but let me describe it in a very different way to highlight how it can be used to a much greater effect. Broadcast defines full expressions for "element-wise" generically using a lazy type-building system. It does this through a type promotion system plus function overloading. If a package uses broadcast for its element-wise kernels, this means that broadcast overloads allow you to essentially overdub these kernels. I describe in a fair bit of detail how this is used within the DifferentialEquations.jl architecture to allow for heterogeneous scientific models being solved on heterogeneous architecture (GPUs/multithreaded/distributed) to get specialized compilation for the DiffEq solvers. Notice though that this composition is greater than adding GPU functions to ODE solver calls. Instead, by overloading broadcast, the array type's implementation takes control of the internal loops of the ODE solver and reconfigures/recompiles them to be OpenCL/CUDA kernels on the GPU. Every internal operation is now GPUitized, not just ones from the user passed in functions. And nothing in the DiffEq code was written for GPUs to make this work: this is all context and compilation controls.

Again, in Cython you could write an ODE solver which directly utilizes the structure of your mathematical model and puts pieces on the GPU as necessary, but this is far different than having an efficient combination built automatically from generic algorithms by the dependent compilation of different interacting parts of existing packages!

If you never thought about doing this, if you always believed you had to write code to build a software to solve your problem, then this is very Sapir-Whorf. While I will agree with you that these tools may not be what the vast majority of Julia users are using, this is the Julia that core developers of the base language and its package ecosystem are using to build the tools which are unrivaled in Python/Cython/Numba.

So what is the identity of Julia?

But Julia showcases itself as a simple language for R/Python/MATLAB users, and broadcast is described as a tool for vectorization?! Yes, this is because using any of these compilation control features is optional. I would probably say that most Julia programmers are not using it in full, and that's fine. Even if the end user of your package doesn't make use of all of these tools, someone else's package can still optimize over your code if you have written Julia code. Remember, since we can dependent compile entire function calls, if all of the code is in Julia then we can do all of our powerful stuff like AD through your code even if you don't know how these compilation controls work. This accumulation effect is in some sense like how as Google generates more data it gets more accurate predictions. In Julia, as the package ecosystem replaces ccall/PyCall packages with native Julia packages, the amount of compilation control and the available optimizations on larger scientific projects grows. So Julia does well by presenting a simple form of the language to users who want a "scripting language but faster", i.e. it looks like it's a Cython thing, to get everyone writing pure Julia code.

Julia is really not a simple language if you take the time to utilize all of the language's features together. I would instead think about it like this. The FFTW package is well-known as the fastest open-source fast Fourier Transform library out there. It does this not by writing an FFT code in C, but with OCaml code which generates C code based on aspects of the problem you are trying to FFT. Because of the overwhelming success of code generation for the purpose of optimizing code, you can ask the question: what if we built a scripting language that is designed so we can do these kinds of things on a user's scientific computing code? I don't think it's a surprise that the author of FFTW is now one of the core contributors of Julia. Regardless of first intentions or how Julia has been marketed, Julia has become a language of generic programming and compile-time optimization to its hardcore users and there is an entire rabbit hole of dynamic compilation control features to explore.

Conclusion

What Julia offers is different because of a full language level solution, which has its own tradeoffs that the Julia developers are working on. Julia as a language has parametric typing to make it multiple dispatch mechanism more powerful and easier to use because controlling compilation through different type structures is a way to hijack downstream/upstream code/packages in order to make the code more optimized as a whole. The language-level features like macros, generated functions, and the newer tools like Cassette.jl (which needed compiler changes to work in full) allow you to utilize the entire code and do modifications dynamically in order to take CPU code and optimize it or throw parts out to GPUs/TPUs/distributed, even if you don't "own" the code. Dependent compilation fully eliminates the overhead that exists when different libraries are separately compiled. Broadcast overrides allow you to dictate how internal structures of scientific computing codes should be implemented and optimized on your specific model. These are all unnecessary if you could write your algorithm as a simple script, but this is necessary if you want to have packages play nicely together and automatically build optimized combinations of features. This is a whole different world than "write fast code". Instead, you may never need to write the best features of your package, and they will still come out optimized. This is not something in the realm of Cython/Numba.

There is a tradeoff that I alluded to here and it's compile-time. For this system to be used interactively, you have to take a step back and find out where you want to stop specializing and where to put up artificial walls. The core Julia developers have made a lot of headway just in the latest part of the Julia v0.7 release candidate in terms of latency, and this is what's seen in the pretty new latency label on the Julialang/julia Github page. Parts like the Julia REPL can be compiled separately and more controls can be added. There's a ton to do here. Also, getting full static compilation of libraries is just beginning to show up. Makie.jl is Julia's next generation plotting library and it statically compiles. Again, there's no such thing as "fully statically compiling" in Julia because everything is so extendable: you'd have to compile new functions dependent on the input types if these are "new types from packages" (example: think back to the numbers with uncertainties). This is not something that a Python plotting package would deal with (if you create a new primitive type in Cython and throw it to matplotlib it'll give you a weird look and say "this Float64 doesn't look right!"). Since Makie.jl is a Julia library, it has extension tie-ins via recipes which allow taking control of the internal function dispatching to change the datatype conversions, but this requires recompilation of specific internals for the new types and so mixing cached native precompilation and static compilation with this dependent compilation is an engineering challenge. But again, Julia and its packages are already making great headway into solving these issues, so I see a very bright future ahead of us.

This might not be the Julia most users are seeing, but if you want a programming language that gives you a massive rabbit hole to explore then this is just a peek at what's available.

Edit: 10/20/2018

I wanted to add a link to this paper on Zygote.jl which explains how it uses Julia's dependent compilation process to build and compile a derivative function for other functions, including those in separate packages, using the Julia Abstract Syntax Tree (AST) that's presented to it at the first call (compilation-time). Unlike approaches like Flux.jl's Tracker or Python's Autograd, this does not take a tracing approach and instead autodifferentiates all branches, giving something that can be optimized and compiled once, while directly supporting language features like loops and conditionals. These are the kinds of tools that are available when compilation of functions gets to see the full code below it, with of course the engineering trade-off that a higher level function needs to compile it and many times its internal calls on its first call.

Edit: 12/4/2019

Updated timings and added pure Julia timings. Also, by popular demand I benchmarked torchdiffeq. From the timings, torchdiffeq is 30,000x slower than DifferentialEquations.jl. An investigation of this can be found here, which includes a bunch of other equations like 27-body equations and PDE discretizations to show orders of magnitude performance differences across the board. Please run the scripts yourself and see the difference in action.

Edit: 8/10/2020

A similar demonstration of a 100x performance difference in training neural ODEs is demonstrated here. Additionally, over 1000x performance advantage for Julia over PyTorch in SDEs is demonstrated here.