Neanderthal vs ND4J - vol 5 - Why are native map and reduce up to 100x faster in Clojure?

November 14, 2018

Please share: Twitter.

New books are available for subscription.

See Part 1, 2, 3, and 4, of this series, for comparisons of matrix operations on the CPU and broadcasting on the CPU and GPU. Now I demonstrate another common area where Neanderthal ( ) is much faster than Nd4j despite both libraries using optimized native routines and the same hardware.

When I wrote the last article, I had too much material, and I left comparisons of mapping and reduction functions for later. Some months pass, without me finding time to write a proper narrative. But, now I think there is no need for much explanation. Here is the code, and I put results that I get on my old Intel i7 4790k processor for different sizes of vectors and matrices. What else speaks better than code?

As examples, I typically use functions that I found in the Nd4j documentation and tutorial.

TL;DR: Neanderthal ( ) is typically an order of magnitude faster. For small data even more. For very large, Nd4j catches up.

( require ' [ uncomplicate.commons.core :refer [ with-release release ] ] ' [ uncomplicate.fluokitten.core :refer [ fmap! ] ] ' [ uncomplicate.neanderthal [ core :refer [ dot raw row col entry! mv! sum mrows asum axpy! vctr rk! ] ] [ native :refer [ fge fv ] ] [ math :refer [ sqrt ] ] ] ' [ criterium.core :refer [ quick-bench ] ] )

( import org.nd4j.linalg.factory.Nd4j org.nd4j.linalg.api.ndarray.INDArray java.util.SplittableRandom ) ( let [ splittable-random ( SplittableRandom. ) ] ( defn random ^double [ ^double _ ] ( .nextDouble ^SplittableRandom splittable-random ) ) )

Distance (dot product) Calculating distance between two vectors or matrices involves both mapping, which is trivially parallelizable, and reduction. Nd4j ( defn bench-nd4j-distance ( [ n ] ( let [ x ( Nd4j /rand ( int-array [ n ] ) ) y ( Nd4j /rand ( int-array [ n ] ) ) ] ( quick-bench ( .distance2 ^INDArray x ^INDArray y ) ) ) ) ( [ m n ] ( let [ x ( Nd4j /rand \f ( int-array [ m n ] ) ) y ( Nd4j /rand \f ( int-array [ m n ] ) ) ] ( quick-bench ( .distance2 ^INDArray x ^INDArray y ) ) ) ) ) Neanderthal ( defn bench-neanderthal-distance ( [ n ] ( with-release [ x ( fmap! random ( fv n ) ) y ( fmap! random ( fv n ) ) ] ( quick-bench ( sqrt ( dot x y ) ) ) ) ) ( [ m n ] ( with-release [ x ( fmap! random ( fge m n ) ) y ( fmap! random ( fge m n ) ) ] ( quick-bench ( sqrt ( dot x y ) ) ) ) ) ) Speed Distance between vectors

dimension Nd4j Neanderthal Neanderthal speedup 2 2.865009 µs 64.005593 ns \(\times\) 45 10 3.055880 µs 62.042329 ns \(\times\) 49 100 3.104767 µs 66.816429 ns \(\times\) 47 1,000 3.712104 µs 92.034678 ns \(\times\) 40 10,000 7.369057 µs 908.968553 ns \(\times\) 8 1 M 369.138225 µs 68.781036 µs \(\times\) 5 100 M 39.677566 ms 25.768197 ms \(\times\) 1.5

Distance between matrices

dimensions Nd4j Neanderthal Neanderthal speedup 10 \(\times\) 15 3.313088 µs 278.192235 ns \(\times\) 12 100 \(\times\) 100 7.109816 µs 1.115164 µs \(\times\) 6 1,000 \(\times\) 1,000 357.002435 µs 82.225425 µs \(\times\) 4 100 \(\times\) 10,000 423.156695 µs 76.504216 µs \(\times\) 5 10,000 \(\times\) 100 335.000853 µs 81.241423 µs \(\times\) 4 10,000 \(\times\) 10,000 40.107896 ms 27.167233 ms \(\times\) 1.5

Sum Nd4j ( defn bench-nd4j-sum ( [ n ] ( let [ x ( Nd4j /rand ( int-array [ n ] ) ) ] ( quick-bench ( .sumNumber ^INDArray x ) ) ) ) ( [ m n ] ( let [ a ( Nd4j /rand \f ( int-array [ m n ] ) ) d ( int-array [ 0 ] ) ] ( quick-bench ( .sum ^INDArray a d ) ) ) ) ) Neanderthal ( defn bench-neanderthal-sum ( [ n ] ( with-release [ x ( fmap! random ( fv n ) ) ] ( quick-bench ( sum x ) ) ) ) ( [ m n ] ( with-release [ a ( fmap! random ( fge m n ) ) ] ( quick-bench ( sum a ) ) ) ) ) Speed You'll notice that this is the only function on the CPU where Nd4j can be faster than Neanderthal. Vector sum

dimension Nd4j Neanderthal Neanderthal speedup 2 4.214011 µs 71.435637 ns \(\times\) 59 10 4.232423 µs 70.562578 ns \(\times\) 60 100 4.299567 µs 84.437461 ns \(\times\) 51 1,000 4.627851 µs 120.773336 ns \(\times\) 39 10,000 14.572367 µs 1.225827 µs \(\times\) 12 1 M 219.302064 µs 130.245611 µs \(\times\) 1.7 100 M 22.361447 ms 28.147435 ms \(\times\) 0.8

Matrix sum

dimensions Nd4j Neanderthal Neanderthal speedup 10 \(\times\) 15 5.904680 µs 84.925045 ns \(\times\) 69 100 \(\times\) 100 21.092154 µs 1.254015 µs \(\times\) 17 1,000 \(\times\) 1,000 52.358064 µs 131.412269 µs \(\times\) 0.4 100 \(\times\) 10,000 103.955403 µs 147.992786 µs \(\times\) 0.7 10,000 \(\times\) 100 64.909655 µs 129.061267 µs \(\times\) 0.5 10,000 \(\times\) 10,000 15.360876 ms 27.866048 ms \(\times\) 0.5

Absolute sum OK, I am guilty of not making sum as fast as I should have. However, I can offer the asum function, which covers most of sum use cases. See how fast that is in Neanderthal. Nd4j Absolute sum is cumbersome to call in Nd4j so I'll skip that code and use sum results from Nd4j for comparison. asum has more work to do anyway, and in my measurements, Nd4j sum was never slower than Nd4j asum . Neanderthal ( defn bench-neanderthal-asum ( [ n ] ( with-release [ x ( fmap! random ( fv n ) ) ] ( quick-bench ( asum x ) ) ) ) ( [ m n ] ( with-release [ a ( fmap! random ( fge m n ) ) ] ( quick-bench ( asum a ) ) ) ) ) Speed Vector asum

dimension Nd4j sum Neanderthal asum Neanderthal speedup 2 4.214011 µs 50.302131 ns \(\times\) 84 10 4.232423 µs 45.395512 ns \(\times\) 94 100 4.299567 µs 49.682850 ns \(\times\) 86 1,000 4.627851 µs 76.036001 ns \(\times\) 61 10,000 14.572367 µs 573.208513 ns \(\times\) 25 1 M 219.302064 µs 32.646665 µs \(\times\) 6.5 100 M 22.361447 ms 13.238496 ms \(\times\) 1.7

Matrix sum

dimensions Nd4j sum Neanderthal asum Neanderthal speedup 10 \(\times\) 15 5.904680 µs 50.816077 ns \(\times\) 116 100 \(\times\) 100 21.092154 µs 466.800109 ns \(\times\) 45 1,000 \(\times\) 1,000 52.358064 µs 31.799570 µs \(\times\) 1.7 100 \(\times\) 10,000 103.955403 µs 33.643302 µs \(\times\) 3 10,000 \(\times\) 100 64.909655 µs 32.994745 µs \(\times\) 2 10,000 \(\times\) 10,000 15.360876 ms 13.070991 ms \(\times\) 1.15

Finding the position of maximum absolute value Often we need to know which element is the largest or smallest. Nd4j In the last section, I was joking that calling absolute sum in Nd4j is too cumbersome to show you the code, to make you write some code and measure some of that stuff yourself. Rather that being too cumbersome, it is just cumbersome, as in the case of iamax that I show here: ( defn bench-nd4j-iamax [ n ] ( let [ x ( Nd4j /rand ( int-array [ n ] ) ) exec ( Nd4j /getExecutioner ) iamax ( org.nd4j.linalg.api.ops.impl.indexaccum.IAMax. x ) ] ( quick-bench ( .getFinalResult ( .execAndReturn exec iamax ) ) ) ) ) Notice how you need to setup the executor and keep it around. I do not measure the executor creation time here. But beware, wrapping lots of similar Nd4j code inside simpler API might seem as a low hanging fruit, but might be a trap if you don't find a way to properly handle the executor life-cycle. Neanderthal ( defn bench-neanderthal-iamax [ n ] ( with-release [ x ( fmap! random ( fv n ) ) ] ( quick-bench ( iamax x ) ) ) ) Speed Vector iamax

dimension Nd4j sum Neanderthal asum Neanderthal speedup 2 1.212382 µs 75.503383 ns \(\times\) 16 10 1.234349 µs 66.782601 ns \(\times\) 19 100 1.346079 µs 63.711496 ns \(\times\) 21 1,000 2.080715 µs 305.853934 ns \(\times\) 7 10,000 10.239278 µs 2.540639 µs \(\times\) 4 1 M 209.913386 µs 116.195962 µs \(\times\) 1.8 100 M 20.508549 ms 15.084416 ms \(\times\) 1.4