Neanderthal vs ND4J - vol 3 - Clojure Beyond the Fast Native MKL Backend

June 29, 2018

Please share: Twitter.

New books are available for subscription.

Some free time fell into my lap, and I decided to continue helping the DL4J team at Skymind improve ND4J, free of charge, since they are generous folks who contribute DL4J as free software so anyone can use it for learning, fun, and profit. In Part 2 of this series, I demonstrated just one of the strengths of Neanderthal ( ) (the library I develop), and here I analyze the code and measurements the ND4J developers wrote as the response.

The part 1 of this series contains the backstory and the initial comparisons of these two fast scientific computing libraries for Java platform.

First of all, let me thank the ND4J guys, especially Paul Dubs, for handling this thread very politely. I know these kinds of discussions can heat up, so I apologize in advance if I cause the heat to go up more than necessary from time to time. I respect them and their work. I hope we'll continue these purely technical comparisons cool-headedly.

ND4J team's implementation Instead of the ND4J code that I wrote in Part 2 and used as a strawman, the ND4J guys supplied a better implementation. First, the imports. ( require ' [ uncomplicate.commons.core :refer [ with-release release ] ] ' [ uncomplicate.fluokitten.core :refer [ fmap! ] ] ' [ uncomplicate.neanderthal [ core :refer [ mm! mm ] ] [ native :refer [ dge fge ] ] ] ' [ criterium.core :refer [ quick-bench ] ] ) ( import org.nd4j.linalg.factory.Nd4j org.nd4j.linalg.api.ndarray.INDArray org.nd4j.linalg.cpu.nativecpu.NDArray java.util.SplittableRandom ) The new ND4J benchmark method: ( defn bench-nd4j-mm-forwards [ m k1 k2 k3 k4 k5 n ] ( let [ m1 ( Nd4j /rand ^int m ^int k1 ) m2 ( Nd4j /rand ^int k1 ^int k2 ) m3 ( Nd4j /rand ^int k2 ^int k3 ) m4 ( Nd4j /rand ^int k3 ^int k4 ) m5 ( Nd4j /rand ^int k4 ^int k5 ) m6 ( Nd4j /rand ^int k5 ^int n ) ] ( quick-bench ( do ( -> ^INDArray m1 ( .mmul ^INDArray m2 ) ( .mmul ^INDArray m3 ) ( .mmul ^INDArray m4 ) ( .mmul ^INDArray m5 ) ( .mmul ^INDArray m6 ) ) true ) ) ) ) You'll notice that this mmul chain is called from left to right instead from right to left.

Trying out the improved ND4J code Trying it with the same data as in the previous part: ( bench-nd4j-mm-forwards 13 13456 2300 125 7700 810 100000 ) Evaluation count : 12 in 6 samples of 2 calls. Execution time mean : 57.653097 ms Execution time std-deviation : 14.216856 ms Execution time lower quantile : 44.730686 ms ( 2.5%) Execution time upper quantile : 71.051185 ms (97.5%) Overhead used : 1.640938 ns As Paul also measured, that's roughly only twice slower than Neanderthal's 25 milliseconds, and nowhere near 25 seconds! Paul also correctly noted that the new implementation is not really any better than the implementation I provided in my previous post. It does not work so well with arbitrary dimensions, and is optimized to handle the opposite direction. Let's do the most trivial thing and reverse those arguments: ( bench-nd4j-mm-forwards 100000 810 7700 125 2300 13456 13 ) Don't do this at home. You'll have to wait for two hours. Use the equivalent time-nd4j-mm-forwards , since ND4J has the same problem again: 26 seconds: "Elapsed time: 25971.33278 msecs" In almost everything Paul said in his response, he is right, and I agree (I might nitpick here and there, but it doesn't matter). However, he missed the main point in thinking that any of these directions specifically favor Neanderthal. In general case, which is not amenable to any of the two directions, you'll get more than 2x difference. First, ND4J, from both directions Paul complains that the dimension 13 at left and 100000 at right stacks this computation favorably to be computed from the left, while I intentionally implemented ND4J.mmul to compute from the right. I'll remove that obstacle and make both sides equal: 10000 (100000×100000 can't fit into memory). ( time-nd4j-mm-forwards 10000 13456 2300 125 7700 810 10000 ) "Elapsed time: 2570.365788 msecs" The right-sided ND4J call from the previous article: ( time-nd4j-mmul 10000 13456 2300 125 7700 810 10000 ) "Elapsed time: 10013.720258 msecs"

Neandrthal's mm Let's see Neanderthal: ( defn bench-neanderthal-mm [ m k1 k2 k3 k4 k5 n ] ( with-release [ m1 ( fmap! random ( fge m k1 ) ) m2 ( fmap! random ( fge k1 k2 ) ) m3 ( fmap! random ( fge k2 k3 ) ) m4 ( fmap! random ( fge k3 k4 ) ) m5 ( fmap! random ( fge k4 k5 ) ) m6 ( fmap! random ( fge k5 n ) ) ] ( quick-bench ( release ( mm m1 m2 m3 m4 m5 m6 ) ) ) ) ) ( bench-neanderthal-mm 10000 13456 2300 125 7700 810 10000 ) Evaluation count : 6 in 6 samples of 1 calls. Execution time mean : 308.434844 ms Execution time std-deviation : 63.680791 ms Execution time lower quantile : 242.705198 ms ( 2.5%) Execution time upper quantile : 376.655029 ms (97.5%) Overhead used : 1.640938 ns That's roughly 8 times faster than the fastest of the two directions computed with ND4J.

It's not as rare as one may think And, the implementation of the chained multiplication that I used in the previous post is not even a fictional strawman. It's exactly what 3-rd party ND4J wrappers use. See this implementation of chained mmul in core.matrix, which is a library DL4J guys (semi)officially provide a ND4J backend for. That library is used by many Clojure projects, and anyone using this functionality will be bitten by it. It's not ND4J's fault, but it's relevant. I will eat my hat (don't worry, it's a hat made of fine Belgian chocolate) if it is not true that virtually every user of ND4J would come up with exactly the same ND4J implementation that Paul and I used here, and will think that it's as fast as it can go, since ND4J is fast, and is using MKL under the hood! That specific functionality may not be used by that many people, much often, I admit, but still - there are many places that accumulate similar sorts of problems…