More fun with NumPy, CuPy, Clojure and GPU acceleration. Hold my Cider 2!

April 14, 2020

Please share: Twitter.

New books are available for subscription.

NumPy and CuPy converted my computations to float64 despite being told to work with float32 , and that prompted some readers to discard my previous article as faulty. My article doesn't make CuPy and NumPy more or less perfect, though, and that forceful conversion stings everyone who is using Nvidia's GTX gaming GPUs. I don't have grants that would cover $10,000 for uncrippled float64 support on Tesla GPU. So, if you're poor like me, that article probably offered valuable information.

I've received some nice suggestions, too. Some readers were inspired by my article and re-implemented that corrcoef function in their favorite Python tools (Numpy, CuPy, PyTorch) and found out that the advice from my book Numerical Linear Algebra for Programmers is not constrained to Clojure, and that it is fast in NumPy and CuPy, too. Of course! Most of my book applies to any language, and is written for all programmers, not just Clojure programmers. Seriously - check it out, and while you're there, maybe you'll also be interested in Deep Learning for Programmers.

Unfortunately, these readers haven't sent me the code (yet), and I didn't have the opportunity to compare it to Clojure on my machine, but I trust them that they were satisfied with the result.

There is an interesting way to compare my unambitious textbook implementation with NumPy and CuPy without that float64 conversion impact. Simply, use float64 computations in Neanderthal ( ), too, just for the sake of comparison. In this article, I compare implementation of Pearson's correlation coefficient (a widely used statistical function) on a range of large and small matrices.

TL;DR The following two tables contain the summary of the micro-benchmarks that I run on my CPU and GPU. Since NumPy and CuPy forcefully do float64 computation regardless of data type, there is one column for them. Neanderthal ( ) follows what it's told, so I include both float64 ( double in Javaspeak) for direct comparison, and float32 ( float ) as a reference. NumPy and CuPy code is run from the interactive Python console, while Clojure uses interactive REPL. Note that my code is not a port of NumPy, nor a drop-in replacement; this is simply a head-to-head comparison of two different implementations. Python has its strengths and weaknesses, but JVM has its own constraints, too, so don't blame me for either of them, nor expect that I care about fixing them. CPU i7 4790K All time is reported in milliseconds (lower is better). size NumPy Neanderthal double Neanderthal float 1K×100K 949 686 350 1K×10K 113 111 56.2 100×100K 61 21 10.5 100×10K 5.47 2.22 1.19 10×10K 0.29 0.16 0.16 10×1K 0.085 0.026 0.028 10×100 0.057 0.011 0.012 10×1M 43 18 15 10K×10K 11476 8079 4902 GPU Nvidia GTX 1080Ti All time is reported in milliseconds (lower is better). size CuPy Neanderthal double Neanderthal float 1K×100K 686 401 20 1K×10K 70 42 2.60 100×100K 9.9 9.2 2.25 100×10K 1.38 1.56 0.3 10×10K 0.21 0.22 0.14 10×1K 0.18 0.67 0.1 10×100 0.18 0.1 0.1 10×1M 6.32 6.57 0.26 10K×10K 5078 2597 117 Honestly, if the several lines of code from my textbook implementation reached within an order of magnitude of professional libraries such as NumPy an CuPy, that would still be rather nice. I can't say that these results do not make me satisfied :)

The Python code To prevent folks bugging me about why I called NumPy from Clojure, I wrote a Python script this time. Micro benchmarks in Python code that I see on the Internet are clunky at best, and more sophisticated tools are equally ridiculous. Having to call my functions in a string- embedded DSL? Thanks, but no. So I did the lazy thing, and used a simple time_ns() based measurement of a loop that calls the corrcoef function enough times to mitigate the imprecision of the timestamp that Python gets. I believe that's a good approximation of how would a user experience this function when calling it from the code. It's good enough considering that we are calling exactly one function that delegates to native code anyway. On the CPU, it looks like this: def numpy_corrcoef (m, n, reps): a = numpy.random.random(m * n).reshape(m,n) s = time.time_ns() for x in range (reps): numpy.corrcoef(a) e = time.time_ns() print ((e-s)/reps/(10**6), " milliseconds" ) numpy_corrcoef(1000, 100000, 1) numpy_corrcoef(10, 10000, 1000) # etc. On the GPU, it takes care of synchronization: def cupy_corrcoef (m, n, reps): a = cupy.random.random(m * n).reshape(m,n) cupy.cuda.Stream.null.synchronize() s = time.time_ns() for x in range (reps): cupy.corrcoef(a) cupy.cuda.Stream.null.synchronize() e = time.time_ns() print ((e-s)/reps/(10**6), " milliseconds" ) cupy_corrcoef(1000, 100000, 1) # etc.

The Clojure code In Clojure, we have sophisticate benchmarking capabilities right there in the REPL through the Criterium library. Although would like to use its quick-bench , as it gives reliable and precise measurements with one simple line of code, I opted to follow the Python way, and use time and the appropriate loop, to make it as similar as the Python code as possible. ( defn bench-neanderthal-cpu [ m n factory reps ] ( with-release [ x ( rand-uniform! ( ge factory m n ) ) ] ( time ( dotimes [ i reps ] ( matrix-correlation! x ) ) ) ) ) ( defn bench-neanderthal-gpu [ m n factory reps ] ( with-default ( cuda /with-engine factory default-stream ( with-release [ x ( rand-uniform! ( cuge m n ) ) ] ( synchronize! ) ( time ( do ( dotimes [ i reps ] ( release ( matrix-correlation! x ) ) ) ( synchronize! ) ) ) ) ) ) ) In the previous article you can find the code for matrix-correlation! . That implementation did a full general matrix multiplication, although the correlation matrix is symmetric, since for most matrix sizes it is faster to do twice as many flops as necessary instead of using logic to avoid duplicate work. Especially on the GPU, the matrix should have to be unrealistically large for the symmetric matrix multiply to pay off. Also, consider what corrcoef computes: correlation between different variables. There will typically be much fewer variables than observations, so the matrix will be a long rectangle. There, my old implementation will be optimal, even for large matrices. However, I confirmed by measurement and comparison that NumPy uses symmetric multiply for large matrices. This cannot be seen directly from the code, but is 100% certain since it takes NumPy 1.5 second to do the full matrix multiply of this matrix, and around 1 sec for the whole corrcoef . That's why I added this option to Neanderthal ( ), and made a simple heuristic to use symmetric multiply for huge fat matrices (the mmt! function, and GEMM for everything else. That made the code less elegant, and several lines longer, but, hey, everything for performance! :) ( defn matrix-correlation! [ a! ] ( with-release [ ones ( entry! ( vctr a! ( ncols a! ) ) 1.0 ) work ( mv a! ones ) ] ( let [ a-mean ( rk! ( / -1.0 ( dim ones ) ) work ones a! ) ] ( let-release [ large ( and ( < 512 ( mrows a! ) ) ( < 8196 ( ncols a! ) ) ) sigma ( sy a! ( mrows a! ) ) ] ( if large ( mmt! a-mean sigma ) ( mm! 1.0 a-mean ( trans a-mean ) ( view-ge sigma ) ) ) ( sqrt! ( copy! ( dia sigma ) work ) ) ( with-release [ sigma-x*y ( rk work ) ] ( div! ( view-ge sigma ) ( view-ge sigma-x*y ) ) ) sigma ) ) ) )