Something that has been bothering me about discussions about microbenchmarks is that people seem to ignore that the benchmarks are highly application-dependent. The easiest way to judge whether the benchmark really matters to a particular application is the operation overhead of the largest and most common calls. If you have a single operation dominating all of your runtime 99.9%, making everything else 100x faster still won't do anything to your real runtime performance. But at the same time, if your bottleneck is some small operation that's in a tight loop, then that operation may be your bottleneck. This is a classic chart to keep in the back of your mind when considering optimizations.

Here is a very brief overview on what to think about when optimizing code and how to figure out when to stop.

Function Call Overhead

When dealing with scalar operations, the cost of basic operations matters a lot. And one basic operation that can bite you in higher level languages is the function call overhead. Notice that 10's of basic arithmetic (more with SIMD) can happen in the time that a single function can be called. Slower languages, like Python and R, have a higher function call overhead than something like C++, Fortran, of Julia for a few reasons. First of all it's calling the function dynamically so there's a lookup involved (and in Julia, if types are unstable there is a lookup as well, this is the cost of a "dynamic dispatch"). But secondly, in these compiled languages, of a function is determined to be sufficiently cheap, the function call can be completely gotten rid of by inlining the function (this is essentially copy-pasting the code in, instead of making it call the function). The heuristics for deciding whether a function is costly enough to inline is then very in-depth, since inlining a function means that a separately compiled function cannot be called so it's a trade-off between run-time and compile-time speed. If the function costs too much, then inlining essentially is not helpful so you'd prefer to just compile faster, while if the function is cheap then it's necessary to inline since otherwise the dominating cost is the function call.

In some domains this matters a lot. Ordinary differential equations for example are defined by huge swaths of scalar operations (this is one example of a common stiff ODE benchmark). If you had high operation overhead there your performance would be destroyed, which is why this domain has not only traditionally been dominated by compiled languages, but this domain is still dominated by compiled languages (counting Julia as one). In domains like optimization, ODEs, and nonlinear solving, this is particularly an issue for interpreted languages because these mathematical problems are higher order functions. This means that the API requires a function that is defined by the user, but the repeated calls to the interpreted function is the mostly costly part of the calculation, and so performance can be very hurt in this scenario. Even when accelerators like Numba are used, the context switch between the interpreted and compiled code can still hurt performance. The only way around this of course is to make an interface where you never accept a function from the user and instead you build it for them. This is why Pyomo is a Python optimization library where you use a modeling object. In both cases, a compiled function is built behind the scenes so that no interpreters get in the way.

But as you go up in computational cost per function, the performance lost by the function overhead cost goes down. This is why in languages like Python or R vectorization is encouraged because the 350ns or so function call cost is then dominated by the cost of the function calculation itself (and the cost is high enough that inlining doesn't even play into the picture). Just for comparison, that's taking scalar operations like + and * from 5ns-20ns to 350ns... so now you know where that performance went. With C++, Fortran, or Julia, it doesn't matter whether you vectorize or loop through scalar operations because the vectorized call and the loop are the same under the hood, dropping down to optimally low function call cost or inlining it to remove the cost entirely (whereas the interpreted vectorized call is essentially the same as the compiled loop).

When do heap allocations matter?

Another major issue that comes up is heap allocations. The idea in brief is, allocations have a high floor and scale ~O(n). Let's demonstrate this with some vectorized functions. First, let's do element-wise multiplication.

using LinearAlgebra, BenchmarkTools function alloc_timer ( n ) A = rand ( n,n ) B = rand ( n,n ) C = rand ( n,n ) t1 = @belapsed $A . * $B t2 = @belapsed ( $C .= $A . * $B ) t1,t2 end ns = 2 .^ ( 2 : 11 ) res = [ alloc_timer ( n ) for n in ns ] alloc = [ x [ 1 ] for x in res ] noalloc = [ x [ 2 ] for x in res ] using Plots plot ( ns,alloc,label= "=" ,xscale=:log10,yscale=:log10,legend=:bottomright, title= "Micro-optimizations matter for BLAS1" ) plot ! ( ns,noalloc,label= ".=" ) savefig ( "microopts_blas1.png" )

In this case we see that there's almost a 3x-5x across the board advantage (remember the graph is log-scale) to using allocation-free interfaces because the allocation of the output array scales with the size of the array, and it has a high overhead. However, when we get to matrix multiplications, the scaling of the calculation comes into play:

using LinearAlgebra, BenchmarkTools function alloc_timer ( n ) A = rand ( n,n ) B = rand ( n,n ) C = rand ( n,n ) t1 = @belapsed $A * $B t2 = @belapsed mul ! ( $C,$A,$B ) t1,t2 end ns = 2 .^ ( 2 : 9 ) res = [ alloc_timer ( n ) for n in ns ] alloc = [ x [ 1 ] for x in res ] noalloc = [ x [ 2 ] for x in res ] using Plots plot ( ns,alloc,label= "*" ,xscale=:log10,yscale=:log10,legend=:bottomright, title= "Micro-optimizations only matter for small matmuls" ) plot ! ( ns,noalloc,label= "mul!" ) savefig ( "microopts.png" )

Notice that when the matrix multiplications are small, the non-allocating `mul!` form has an advantage because of the high overhead of an allocation. However, as the size of the matrices increase, the O(n^3) scaling of matrix multiplication becomes the dominant force, and so the cost of the allocation is negligible.

Optimizing Machine Learning and Partial Differential Equations

So how much do different disciplines require lower level features? It turns out that same aspects of technical computing really require fast function calls and scalar operations, like solving nonlinear differential equations, and thus something like C++, Fortran, or Julia is really required for top notch speeds. In other disciplines, the time is completely dominated by large matrix multiplication or convolutional kernels. Machine learning is a prime example of this, where >99% of the computation time is spent inside of the large matrix multiplication of dense layers or convolution calls of a convolutional neural network. For these functions, pretty much every language, (R, Python, Julia, MATLAB, Octave, etc.), is calling into a BLAS implementation. One popular one is OpenBLAS, while another is Intel's MKL. When things move to the GPU, essentially everyone is calling NVIDIA's CuBLAS and cuDNN. With large matrix operations taking on the order of minutes (to even hours as it gets large and distributed), the 350ns function call overhead of a slow language just doesn't even matter anymore.

The same tends to happen with PDE solving, where the dominant costs are sparse matrix multiplications and sparse factorizations, usually handled by SuiteSparse or libraries like PETSc when you start to get parallel. Here, factorizations taking an hour isn't even uncommon, so I think you could sparse a few milliseconds. At this point, what matters more than a micro-optimization is the strategy that is used: whoever uses the last matrix multiplications or matrix factorizations wins the game.

This doesn't mean that the underlying libraries can be written in a slow language. No, these pieces, like SuiteSparse, OpenBLAS, etc. need to be in a fast language like C++ or Fortran (or even Julia is starting to see performant BLAS implementations). But calling and using libraries in this domain tends to have negligible overhead.

I will just end by noting that big data analysis with pre-written kernels is another area that falls into this performance domain, while the moment you allow users to write their own kernels ("give me a function and I'll do a map reduce with it!") you suddenly need to compile that function in a fast way.

Optimizing Scientific Machine Learning and Differentiable Programming

When you get to scientific machine learning and differentiable programming, things get tricky. If you haven't heard of scientific machine learning, read this overview of the tools of the domain, and for differentiable programming this paper is a good overview of things people (we) are trying to target with the technique. The amount of micro-optimizations that you need really depends on where you fall in this spectrum from numerical ODEs to machine learning. Generally, if your matrices are "small enough", you need to use mutation and make sure scalar usage is fast, while once the matrices are large enough then all of the time is taken in GPU BLAS kernels so the name of the game is to use a better strategy.

There is a minor caveat here because tracing-based ADs, like Flux's Tracker, PyTorch, or TensorFlow Eager, have a really really high op overhead. PyTorch announced that its overhead cost has gotten to 1us. Again, when matrix multiplications are clearly larger than seconds or minutes in any machine learning task, this overhead is so low that it does not make an impact on the total time itself, so we should congratulate the machine learning libraries for essentially eliminating overhead for their standard use case. But remember that if you compare that to the 5ns-20ns of basic scalar cost for things like `+` and `*`, you can see that, for example, tracing-based reverse mode automatic differentiation through the definition of a nonlinear ordinary differential equation will be horrifyingly slow. This means that tracing-based ADs do not do well in more general-purpose codes, making it difficult to use for things like differentiable programming in scientific machine learning where neural networks are mixed with large nonlinear kernels.

There is a way through though. Zygote has showcased operation overheads of around 500ns. The way this is done is through source-to-source automatic differentiation. This isn't new, and the very old ADIFOR Fortran AD was a source-to-source AD that first noticed major per-op overhead advantages. The reason is because a tracing-based reverse-mode AD needs to build a tape at every single step so that way it knows the operations and the values. This tape needs to be heap allocated in any sensibly large calculation. This is why x[1] + x[2]*x[3] turns into a whopping multi-microsecond calculation when being traced while only being a ~20ns fma call if just scalar! A source-to-source AD doesn't need to build a trace because it has already built and compiled an entire adjoint code. It does need to still store some values (since values from the forward pass need to be used in the backpass), and there are multiple ways that one can go about doing this. Zygote heap allocates forward values by making closures, this is where most of its overhead seems to come from. ADIFOR mixes forward mode in when it needs to compute values, which is another valid strategy. There are ways to also precompute some values of the fly while going in reverse, or mixing different strategies such as checkpointing. I think there's a lot that can done to better optimize reverse-mode ADs like Zygote for the case of scalar operations (in fact, there seem to be quite a few compiler optimizations that it's currently missing), and so if you plan to reverse-mode a bunch of scalar codes this may be an area of study to track in more detail.

Conclusion

The art of optimizing code is to only optimize what you need to. If you have small function calls (such as having lots of scalar operations), use a fast language and tools with low operation overhead. If you have really costly operations, then it really doesn't matter what you do: optimize the costly operations. If you don't know what regime your code lives in... then profile it. And remember that allocation costs scale almost linearly, so don't worry about them if you're doing O(n^3) operations with large n. You can use these ideas to know if using a feature from a higher level language like Python or R is perfectly fine performance-wise for the application. That said, this is also a good explanation as to why the libraries for those languages are not developed within the language and instead drop down to something else.