First, the spec: I run this on an (old-ish but still powerful) i7-4790K desktop with 32GB of RAM. OS is Arch Linux, with the latest Intel MKL and OpenBLAS installed globally. NumPy is provided by pip (Conda on Arch is a mess, and I leave Docker, Kubernetes, and whatnot, to other people). Sadly, intel-numpy doesn't seem to be maintained that well, and it refuses to install through Arch's pip . That leaves NumPy with OpenBLAS, which should not be an issue, since OpenBLAS is very fast, and I expect it to be within a few percent of Intel's MKL. (You might cry foul, but, hey, grab NumPy & MKL and check for yourself).

I enjoy working in Clojure's REPL, so I'll use it for calling NumPy via libpython-clj . You may object again, since that might be a cause of NumPy's troubles, but I claim it's not, and I do it on purpose, to motivate you to fire up your favorite Python dev tool and try to prove me wrong.

We create a NumPy array.

( def x ( -> ( numpy /linspace 0 2 100000000 :dtype "float32" ) ( numpy /reshape [ 1000 100000 ] ) ) )

I made sure that we use single-precison floats, so we can do a fair comparison with the GPU, which supports fast computations with float32 , but is crippled for float64 . I reshaped it to \(1000\times{100000}\), a size that you might not need every day, but is nothing unusually big. It represents a data set with \(1000\) variables and \(100,000\) observations.

Next, we call corrcoef to check whether it does what we expect it to.

( numpy /corrcoef x )

[[1. 1. 1. ... 1. 1. 1. ] [1. 1. 1. ... 1. 1. 1. ] [1. 1. 1. ... 1. 1. 1. ] ... [1. 1. 1. ... 1. 1. 0.99999999] [1. 1. 1. ... 1. 1. 1. ] [1. 1. 1. ... 0.99999999 1. 1. ]]

We measure a wall clock time. I don't see a need for a more sophisticated benchmark here, involving samples, standard deviations, etc. since this is a long-running computation implemented by a native backend. There's no caches and JITs to warm here, just brutal number crunching. Try it many times, or put it in a loop, and see. The timings will vary, naturally, but within a few percent. Or, if you really want to be sure, please do whatever advanced benchmark you can think of and let me know of the results.

( time ( numpy /corrcoef x ) )

"Elapsed time: 946.143499 msecs"

Just to be clear, this is a great result! This is fast, and is much faster than it would be with interpreted Python, hand-written C++, or pure Java. Nothing is wrong or deficient here. This is good enough.