Using C extensions to improve the performance of mathematical code

I recently had the pleasure of working on a highly iterative algorithm in Python. The algorithm was a major performance bottleneck in our code so naturally I wanted to see if there were easy ways to speed it up.

One of the larger bottlenecks was in the standard deviation calculation. This method was called (potentially) millions of times. The code went something like this:

Ah, of course this was an easy fix! Why manually calculate the standard deviation when NumPy has an ultra efficient C based version? So happily I replaced the above with:

np.std(lst)

I ran the numbers to see how much of a gain we got with this switch. I ran each method 10,000 times with random sets of data. The results were surprising:

Python: 0.06 seconds

NumPy: 0.39 seconds

The NumPy code was 6.5 times slower. What on earth was happening?

NumPy and Array Size

The key comes in the data set this algorithm used. It had to calculate the standard deviation of very small chunks of data at a time, roughly 25 elements.

In searching around I found several references claiming that NumPy is only optimized for larger arrays. So in this case we’re below that threshold and slow, interpreted Python actually beats NumPy.

So at what array size does NumPy start to beat a Python implementation? I timed 10k runs of each method on varying array sizes to find out:

So at least on my machine pure Python will outperform NumPy by a decent margin for anything less than roughly 150 elements. In our case, the Python implementation was bad and NumPy was even worse.

Speeding it up with C++

The algorithm was built on calculating the standard deviation of small arrays in a loop. There wasn’t a way to refactor to work with larger arrays at a time. That coupled with the fact that this calculation could happen several million times meant seemingly no way to optimize.

Python however lets us add C extensions. We can write our standard deviation in C++ and call it from within our Python code. In theory, this should give us a performance bump if we can optimize for smaller arrays better than NumPy.

We can use STD vectors to write a standard deviation function pretty easily:

We can then use the Python C API to convert a Python list to a vector and call our C++ implementation:

Then we use distutils to package this as a module and call it from Python directly. Comparing this to our original run the performance gain is quite apparent:

Our C++ implementation beats Python by a decent amount for smaller arrays. For n = 25 we get roughly a 8x speedup over Python and close to 50x speedup over NumPy. Even for larger arrays we still beat NumPy significantly.

So does NumPy ever beat our simple C++ implementation? Let’s do the same timing test but with bigger arrays:

It turns out our C++ implementation does in fact start to lose out to NumPy’s optimizations at roughly 15k elements. However below that we still outperform NumPy by a decent margin.

Conclusion

I eventually ported over this entire algorithm to C++. Given its highly iterative nature, overall performance sped up over 100 times.

NumPy is an amazing numerical library, however it has its limitations. It is optimized to work on larger data sets, but sometimes all you have is small data. Sometimes you have iterative algorithms that don’t nicely translate into the NumPy way of doing things.

C extensions are one among many ways to deal with this issue. Different interpreters exist, as well as JIT compilation libraries such as Numba. In this case C extensions provided the path of least resistance to optimize and provided a drastic speedup.

Source Code

Source of the C extension and plots is available at: