Benchmarking ND4J and Neanderthal

ND4J and Neanderthal are both libraries for fast matrix math on the JVM. ND4J targets Java users, while Neanderthal is aimed at Clojure users. Due to Clojure’s excellent Java Interop, it is quite easy to use ND4J in Clojure as well — even though it doesn’t provide an idiomatic Clojure API out of the box.

Dragan Djuric, the creator of Neanderthal, has recently conducted a micro-benchmark of both ND4J and Neanderthal. The operation under test is matrix multiplication, in particular, calling GEMM from Intel’s MKL library. The results have been quite unexpected, since both libraries shouldn’t do that much at that point, they basically pass on the call to MKL.

When the results had shown that Neanderthal is 24 times faster with the smallest input of a 4x4 matrix, and still 20% faster at 4096x4096, it made me curious to what is going on. Especially since his ND4J code is based on my original benchmarks.

When I originally compared ND4J and Neanderthal matrix multiplication speeds, the results left me wondering, since ND4J was slower at small sizes, yet faster at larger sizes. For this reason I never actually published any numbers. I originally based my comparison on Dragan’s benchmark code, but I didn’t notice that doubles were used there instead of floats. His new benchmark has cleared this confusion, and I’m glad that Dragan has shared both code and results.

In this post I try to validate Dragan’s results, show the detail that changes the numbers considerably, and rerun the benchmark after some additional optimizations have been added to ND4J.

Apple to Apple comparison: Changing some code

In Dragan’s benchmark, Neanderthal wins with by a large margin. So let’s take a look at the code to see if there is anything we can do to improve ND4J’s performance. Dragan uses this code to run the benchmark for nd4j:

( defn bench-nd4j-mmuli-float [ ^ long m ^ long k ^ long n ] ( let [ m1 ( Nd4j/rand m k ) m2 ( Nd4j/rand k n ) result ( Nd4j/createUninitialized m n )] ( quick-bench ( do ( .mmuli ^ INDArray m1 ^ INDArray m2 ^ INDArray result ) true ))))

And while this looks correct, it actually has an issue. In ND4J arrays are C-ordered by default, i.e. their memory layout is as if you were to allocate an array in C. Yet, GEMM returns its result in F-order, i.e. with a memory layout that you would get if you allocated an array in Fortran. The difference is whether your two dimensional array is organized as [rows][columns] or [columns][rows] . If you pass a C-ordered array to take the result here, ND4J will notice this, create a new array in F-order, and then transfer the results to the original result array. All of this takes time, and especially in a micro-benchmark case where this is called thousands of times per second, memory allocation can become quite the bottleneck.

After changing the code to use F-ordered arrays it looked like this:

( defn bench-nd4j-mmuli-float [ ^ long m ^ long k ^ long n ] ( let [ m1 ( Nd4j/rand m k ) m2 ( Nd4j/rand k n ) result ( Nd4j/createUninitialized ( int-array [ m n ]) \f )] ( quick-bench ( do ( .mmuli ^ INDArray m1 ^ INDArray m2 ^ INDArray result ) true ))))

And when I ran it with a tiny matrix, it was a lot faster – 5 times faster – but it was also still about 2 times slower than running the same code from Java directly in my JMH benchmark suite. I’m not sure what is the cause for this. But, since we want to actually compare apples to apples, I decided to change the call itself. INDArray.mmuli has some additional checks to support some not-actually matrix multiplication use-cases.

After checking Neanderthal’s source code to see if it would still be a fair comparison, I moved on to using Nd4j.gemm directly. It is the closest in actual functionality to Neanderthal’s mm! call. Both of them do some basic parameter checking before passing them on to MKL. In the case of ND4J it also enforces the ordering, as explained earlier. The following is the benchmark code that I ended up using:

( defn bench-nd4j-gemm-float [ ^ long m ^ long k ^ long n ] ( let [ m1 ( Nd4j/rand m k ) m2 ( Nd4j/rand k n ) result ( Nd4j/createUninitialized ( int-array [ m n ]) \f )] ( quick-bench ( do ( Nd4j/gemm ^ INDArray m1 ^ INDArray m2 ^ INDArray result false false 1.0 0.0 ) true ))))

And it turns out that this way of calling GEMM appears to be exactly as fast when called from JMH and from Criterium (the benchmarking library that provides us with the quick-bench method).

I’ve also created a pull request for Neanderthal, so the benchmark code there is closer to an apple to apple comparison.

First benchmarking results

Aside from this modification, I use the original benchmark code by Dragan, using Criterium as the benchmarking library and Neanderthal 0.19 and ND4J 1.0.0-beta. My computer is equipped with an Intel Core i7-6700K, running at 4.6GHz, 32GB RAM running at 2933MHz and uses Windows 10 as the operating system.

Since Windows likes doing Windows things in the background, I’ve ran the benchmark 10 times for each matrix size, alternating between Neanderthal and ND4J, and averaged the numbers afterwards.

Library Size Time per Op (ns) Diff vs Neanderthal ND4J 2x2 595 166 % Neanderthal 2x2 223 ND4J 4x4 598 163 % Neanderthal 4x4 227 ND4J 8x8 612 156 % Neanderthal 8x8 239 ND4J 16x16 715 134 % Neanderthal 16x16 305 ND4J 32x32 1312 69 % Neanderthal 32x32 774 ND4J 64x64 4519 40 % Neanderthal 64x64 3208 ND4J 128x128 19288 18 % Neanderthal 128x128 16285 ND4J 256x256 120588 1 % Neanderthal 256x256 118917 ND4J 512x512 907426 3 % Neanderthal 512x512 880935 ND4J 1024x1024 7119631 5 % Neanderthal 1024x1024 6803776 ND4J 2048x2048 53491781 7 % Neanderthal 2048x2048 49876333 ND4J 4096x4096 397762380 -8 % Neanderthal 4096x4096 437036465 ND4J 8192x8192 3480873900 0 % Neanderthal 8192x8192 3452838100

The table shows that after a matrix size of 256x256 the performance of both libraries is within the margin of error of each other. But when using smaller matrices, it is apparent that Neanderthal indeed has a lower overhead. The difference isn’t as high as Dragan found, and in absolute terms about 350ns to 400ns may seem insignificant, yet we should still try to get it down to the bare minimum. This is even more true, if you consider that for those tiny matrices where this overhead is twice the time that Neanderthal needs.

Investigating the source of added overhead

In order to find out where some of that latency was hiding, I used an even lower level way of calling GEMM from Java. Since JavaCPP provides the bindings to the lower level libraries, and those bindings are public static methods, they can be also used directly. So, in order to find out if the source of this additional latency is on the Java side of things or on the native side, I used that call directly. The result: 231 ± 4 ns per operation, which looks very much like it is within the margin of error of Neanderthal. The additional latency has to be on the Java side.

With those numbers in hand @raver119 has taken to the code, and started investigating what may be the cause of it. He found one reason, and the change has already landed on master, and is therefore available on SNAPSHOT releases.

Repeating the Benchmark

With that change in place, I wanted to repeat the benchmark. Now something weird happened: Using the criterium based benchmark code, now both Neanderthal and ND4J were 2 times slower than before. I changed back to the old version to make sure it wasn’t due to the change in ND4J, but it stayed this way.

Interestingly, my own benchmarks with JMH didn’t suffer from this, so I set out to port the Clojure code to Java. Thus using the Clojure to Java Interop into this direction for the first time. While the direction Java to Clojure is quite a breeze, the other way around is pretty ugly as long as there is no specialized API around it. Anyway, I marched on, and figured out how to do it (for more comments on this see Oddities).

Using the numbers that I originally collected, I validated that Neanderthal was still as fast as it was using the criterium based benchmark. The following table shows the results using JMH as the benchmarking framework, Neanderthal 0.19 and ND4J 1.0.0-SNAPSHOT.

Library Size Time per Op (ns) Diff vs Neanderthal ND4J 2x2 309 32 % Neanderthal 2x2 234 ND4J 4x4 319 31 % Neanderthal 4x4 243 ND4J 8x8 322 29 % Neanderthal 8x8 249 ND4J 16x16 420 31 % Neanderthal 16x16 320 ND4J 32x32 1005 5 % Neanderthal 32x32 958 ND4J 64x64 3786 29 % Neanderthal 64x64 2925 ND4J 128x128 18816 19 % Neanderthal 128x128 16683 ND4J 256x256 104342 -3 % Neanderthal 256x256 108048 ND4J 512x512 775124 1 % Neanderthal 512x512 765648 ND4J 1024x1024 6534687 8 % Neanderthal 1024x1024 6031096 ND4J 2048x2048 44854846 6 % Neanderthal 2048x2048 42211136 ND4J 4096x4096 317275196 -1 % Neanderthal 4096x4096 319272117 ND4J 8192x8192 2783549850 8 % Neanderthal 8192x8192 2571527782

We can see that especially for very small matrices the difference has closed a lot. Neanderthal still wins here though and is still about 30% faster when the overhead dominates the actual calculation. So, we still have to look for ways to reduce our overall overhead some more.

For larger sizes, as could already be seen in the first benchmark, the difference isn’t that clear. During preparation of this post, I’ve seen the numbers fluctuate for about 10% in any direction, so everything within 10% of each other is a draw for me at the moment.

Oddities

While preparing this blog post I ran into several odd behaviors.

One is the criterium benchmark seemingly getting slower over time, even as I’ve restarted JVMs. Only after several reboots it went back to normal behavior. I’m stumped as to why that may happen.

Then there were those 10% swings for both libraries, even when a benchmark was run for many interations. I may run a benchmark for quite a few iterations and get a number on which JMH is rather certain, showing a pretty low standard deviation, but once I repeat it I get a swing in either direction, again with a reported low standard deviation.

I guess that increasing the iteration time could reduce those swings a lot. But, given that I don’t want to compromise on some of the other options (i.e. using at least 2 forks, and using at least 10 benchmark iterations), using 1 minute for each benchmark would require over 8 hours of benchmarking. And since I’ve seen irregularities even with that on the larger sizes as well, I’d probably have to up that to 5 minutes, which would take almost two whole days to finish the benchmark for all sizes. Therefore, I’ll be content with saying that everything within 10% of each other is close enough to be considered a draw.

And I’m not even sure I could run the whole benchmark for that long. Originally, I wanted to use at least 10 seconds for each benchmark iteration, but, my computer crashes during the Neanderthal 64x64 benchmark if it runs for longer than 5 seconds. I guess it is due to the mild overclock, since it seems to cope better once I go back to stock speeds, but I didn’t have any issue with that using ND4J on that same size, and during the last 3 years that I’ve had this computer.

I found yet another odditiy while trying to make Neanderthal work in JMH. I ran into the issue that Clojure couldn’t cast an AOT compiled version of its pretty printer into itself, and the results that google spit out didn’t really help with the issue. In the end, I simply removed the AOT compiled version from the uberjar with a maven configuration option and that resolved the issue.

Repeating my benchmarks

I’ve updated my benchmarking_nd4j repository to contain everything that I’ve used for the second round of benchmarks. If you want to repeat them on your own machine, you can clone the repository:

git clone https://github.com/treo/benchmarking_nd4j.git

Build an Uberjar:

mvn clean package

And run it:

java -jar target/benchmarks.jar -f2 -i10 -wi 2 Neanderthal

This invocation will use 2 forks, 10 iterations per fork and 2 warm up iterations. It uses the default iteration time of just one second. By passing a name fragment, JMH will only start the benchmarks that start with that name. If you leave it out, it will run all benchmarks within the repository, which can take quite a considerable amount of time.

For more options you can run it as follows to print its help screen:

java -jar target/benchmarks.jar -h

Also, please notice, that since I’m using a SNAPSHOT version here, I’m not using the ND4J -platform artifact. For this reason, it will not work, if you upload the jar to a machine using a different operating system or CPU architecture.

Conclusion

I’m very grateful for Dragan to have conducted his benchmarks on both ND4J and Neanderthal. The investigation that it started has already borne fruit: we have already found and fixed some issues, as you can see in the second benchmark.

And while the difference even in the first benchmark wasn’t as dramatic once the result array ordering is properly set, it has still shown that Neanderthal indeed has a very low overhead and that to get the full performance out of ND4J you should know what you are doing.

There are still some points where ND4J could lose some more overhead, and we are investigating them, so I’m looking forward to repeating these benchmarks as soon as we have them figured out as well.