Long time ago, when I started developing Neanderthal, the folk wisdom was that FFI in Java has huge overhead, so pure Java libraries are faster with small and medium sized matrices, while you can hope to get some speedup from using native bindings (such as JBlas) only when matrices are very large. Neanderthal showed this popular opinion to not be true; Neanderthal, a Clojure library backed by custom bindings to MKL, is much faster than pure Java libraries even for rather small matrices.

Roughly at the same time, the folks from Skymind started developing Deeplearning4J and ND4J. They took a similar route, and also got some quite good results, because they did many things right. Somehow, though, it is difficult to find the actual performance numbers. Most folks who use DL4J and ND4J take for granted that ND4J is as fast as possible, and I don't blame them, for ND4J is much faster than any previous Java matrix library. Except Neanderthal, of course :)

I developed Neanderthal to be able to meet some demanding computation requirements. I tirelessly optimized not only the speed, but API details that can't be found in other libraries. During that process, I compared the results not only to Java wrappers, but to native execution itself, to make sure the overhead is tiny, ideally non-existent. Whenever I measured ND4J, Neanderthal was faster. In some cases a little faster, in some cases much faster.

That's why I was surprised to hear that Adam Gibson and his team at Skymind did some benchmarks where they found out that ND4J was faster than Neanderthal. Having measured Neanderthal extensively vs MKL itself, and knowing that the overhead is really, really small, I simply could not see how that would be possible, since that would mean that ND4J backed by MKL must be faster than MKL itself. I was intrigued.

Adam pointed me to a code repository with some basic matrix multiplication benchmarks that they tried. Unfortunately, the code only contains ND4J calls, no Neanderthal calls. The results are not available, but the author, Paul Dubs, reports this: "The result was that for matrices smaller than 128x128 Neanderthal won, for larger ND4J."

In this experiment, I'll concentrate on replicating the tests that the guys from Skymind pointed out, comparing Neanderthal and ND4J brutal in-place matrix multiplication. We'll leave other issues such as the API ergonomics, GPU computing, solving linear equations for future posts. Here I am measuring the approaches: given that both libraries use MKL, I assume that the raw computation speed is the same, and ascribe any differences to the overhead that the library uses in keeping data around and calling appropriate operations.