Comparisons of Neanderthal ( ) and ND4J started with ND4J's creator claiming that ND4J was considerably faster than Neanderthal in the tests that Skymind (the company that develops ND4J) conducted. Having been intrigued, I run the benchmarks they suggested. I've found the contrary, that Neanderthal was multiple times faster for small matrices, a couple of times faster than mid-sized matrices, and a few dozen percents faster for huge matrices. That was better than I expected! Since I checked the ND4J documentation and confirmed that I followed its recommendations, I called it a day and handed the ball over to the ND4J guys. This has been described in part 1 of this series, Neanderthal vs ND4J - vol 1 - Native Performance, Java, and CPU.

The ND4J guys investigated what happened, and found out that:

While my measurements were correct, ND4J's performance is not optimal with default memory layout. To get the best performance, the resulting matrix has to be in \f order, while the default is \c. They dug out and fixed some bugs in ND4J.

Those being fixed, ND4J improved. \f instead of \c order makes ND4J only a handful of times slower for small matrices, a few dozen percents slower for mid-sized, and almost the same speed as Neanderthal with large matrices. Paul published the process and the result on his blog.

I might have a minor issue with him changing the test to use the low-level static Nd4j.gemm method instead of mmuli , a method suggested by the documentation that the user would normally call, but since Neanderthal's high-level polymorphic mm! function still keeps the lead, I won't be splitting hairs and crying foul here. The results he reports are valid. Paul also reasoned that mm! is, after all, only a GEMM call with a few additional checks. I argue it does more than that.

As a warm-up, I'll rerun some benchmarks with \f order, leaving you to run various dimensions yourself.

The updated project is here.

First, the imports:

( require ' [ uncomplicate.commons.core :refer [ with-release release double-fn ] ] ' [ uncomplicate.fluokitten.core :refer [ fmap! ] ] ' [ uncomplicate.neanderthal [ core :refer [ mm! mm ] ] [ native :refer [ dge fge ] ] ] ' [ criterium.core :refer [ quick-bench ] ] )

( import org.nd4j.linalg.factory.Nd4j org.nd4j.linalg.api.ndarray.INDArray org.nd4j.linalg.cpu.nativecpu.NDArray java.util.SplittableRandom )

We'll use \f layout and Nd4j/gemm :

( defn bench-nd4j-gemm-float [ ^long m ^long k ^long n ] ( let [ m1 ( Nd4j /rand m k ) m2 ( Nd4j /rand k n ) result ( Nd4j /createUninitialized ( int-array [ m n ] ) \f ) ] ( quick-bench ( do ( Nd4j /gemm ^INDArray m1 ^INDArray m2 ^INDArray result false false 1.0 0.0 ) true ) ) ) )

( bench-nd4j-gemm-float 128 128 128 )

Evaluation count : 34482 in 6 samples of 5747 calls. Execution time mean : 19.596878 µs Execution time std-deviation : 3.621484 µs Execution time lower quantile : 17.475822 µs ( 2.5%) Execution time upper quantile : 24.146490 µs (97.5%) Overhead used : 1.140086 ns

Much better than previously, but still behind Neanderthal's timing, which is similar to what ND4J guys got.

May I mention that Neanderthal handles all memory layouts equally well? It doesn't care whether it is CCC, FFF, CCF, CFC (in ND4J's terminology), Neanderthal will figure out how to get the same top speed from the underlying backend. Check that out by providing {:layout :row} (ND4J's C) as an optional argument to fge : (fge m n {:layout :row}) .

I can conclude that Neanderthal is the faster library indeed, at least when it comes to matrix multiplication.

when the layout is optimal for ND4J, Neandrethal is at equal speed at worst, but often faster;

with arbitrary memory layout combinations ND4J slows down, while Neanderthal keeps up its top speed.

If the only thing we had to do was a single matrix multiplication, this would have been the concluding article in the series. It's not a difference that matters in most programs, and they have much larger inefficiencies in other places anyway.