Benchmarking ND4J and Neanderthal – Part 2

Dragan Djuric, the creator of Neanderthal, has conducted another series of benchmarks comparing ND4J and Neanderthal. This time the operation under test is a chain of matrix multiplications.

He claims that in one of the benchmarks Neanderthal is 1000 times faster. This piqued my interest again since those kinds of differences don’t magically show up – especially not in a library as widely used as ND4J.

In this post I investigate the source of this difference, show why it happens and what to do about it as a user of the library.

This is a follow up post to Benchmarking ND4J and Neanderthal, so for some more context read that post as well.

Comparing Apples and Pomelos

ND4J does not have a dedicated mm method like Neanderthal since chaining matrix multiplications can be easily achieved by chaining multiple mmul calls. That is also what Dragan did while trying to simulate mm for his benchmark. The following code snippet shows how he implemented it for his benchmark.

( defn bench-nd4j-mmul [ m k1 k2 k3 k4 k5 n ] ( with-release [ m1 ( Nd4j/rand ^ int m ^ int k1 ) m2 ( Nd4j/rand ^ int k1 ^ int k2 ) m3 ( Nd4j/rand ^ int k2 ^ int k3 ) m4 ( Nd4j/rand ^ int k3 ^ int k4 ) m5 ( Nd4j/rand ^ int k4 ^ int k5 ) m6 ( Nd4j/rand ^ int k5 ^ int n )] ( quick-bench ( do ( .mmul ^ INDArray m1 ( .mmul ^ INDArray m2 ( .mmul ^ INDArray m3 ( .mmul ^ INDArray m4 ( .mmul ^ INDArray m5 ^ INDArray m6 )) ))) true ))))

Translated to Java this invocation would look like this:

m1 . mmul ( m2 . mmul ( m3 . mmul ( m4 . mmul ( m5 . mmul ( m6 )))))

This kind of invocation will result in the computation going backwards, as if you literally had set the parenthesis in the multiplication like this: M = A(B(C(D(EF)))) .

Normally you wouldn’t do it like that. Instead it would probably look like this:

m1 . mmul ( m2 ). mmul ( m3 ). mmul ( m4 ). mmul ( m5 ). mmul ( m6 )

This second way of doing things is like setting the parenthesis for the multiplication like this: M = ((((AB)C)D)E)F .

The big difference between those two ways is how the temporary results are computed. Both ways of setting parenthesis are allowed because matrix multiplication is associative. Both of them arrive at the same result, but the matrix sizes at each different step may differ widely.

1000x difference?

Dragan has found Neanderthal to be 80 times and 1000 times faster in his non-uniform size benchmarks. Given what we now know about his implementation, and how it would be normally done, lets take a look how the differences play out.

In the following examination, I am using these matrix shapes, as they are what Dragan used in his benchmarks:

A: 13x13456 B: 13456x2300 C: 2300x125 D: 125x7700 E: 7700x810 F: 810x100000

During the calculation of M = A(B(C(D(EF)))) , i.e. the way Dragan implemented it, the computation carries through the column size of the last matrix.

Computation Step Result shape Required operations Size of result EF 7700x100000 1246 GFLOP 2937 MB D(EF) 125x100000 192 GFLOP 47 MB C(D(EF)) 2300x100000 57 GFLOP 877 MB B(C(D(EF))) 13456x100000 6188 GFLOP 5133 MB A(B(C(D(EF)))) 13x100000 34 GFLOP 4,95 MB

Now, lets take a look at how the more common usage would look like, i.e. using M = ((((AB)C)D)E)F

Computation Step Result shape Required operations Size of result AB 13x2300 0,805 GFLOP 0,11 MB (AB)C 13x125 0,007 GFLOP 0,006 MB ((AB)C)D 13x7700 0,025 GFLOP 0,38 MB (((AB)C)D)E 13x810 0,162 GFLOP 0,04 MB ((((AB)C)D)E)F) 13x100000 2,105 GFLOP 4,95 MB

As you can see, this way of structuring the order of the computation results with the number of rows of the first matrix being carried through the computation. The end result is still the same.

The difference in the required number of floating point operations to perform the actual computation is enormous. The first way requires a total of 7719 GFLOP, while the second one requires only about 3,1 GFLOP, with pretty much all the time spent on just the last multiplication.

Just the order of those operations results already in a huge difference, but there is even more to it. Once you take into consideration that each matrix at each step is also not just large, but absolutely enormous when compared to each other, you see how caching efficiency and memory access costs will dwarf even the huge difference in required operations.

So why do we see only a 1000x difference in speed? Looking at the GFLOPs alone it should be closer to 2500x. The reason is that on large multiplications MKL actually uses multiple cores. As my machine has 4 cores, as Dragan’s probably does. However, this often doesn’t scale linearly, especially when memory access and synchronization between those cores has to be accounted for.

Comparing apples to apples again

Let’s use Clojure threading macro to both, make the benchmark easier to read, and turn into more of an apples to apples comparison:

( defn bench-nd4j-mmul [ m k1 k2 k3 k4 k5 n ] ( with-release [ m1 ( Nd4j/rand ^ int m ^ int k1 ) m2 ( Nd4j/rand ^ int k1 ^ int k2 ) m3 ( Nd4j/rand ^ int k2 ^ int k3 ) m4 ( Nd4j/rand ^ int k3 ^ int k4 ) m5 ( Nd4j/rand ^ int k4 ^ int k5 ) m6 ( Nd4j/rand ^ int k5 ^ int n )] ( quick-bench ( do ( -> ^ INDArray m1 ( .mmul ^ INDArray m2 ) ( .mmul ^ INDArray m3 ) ( .mmul ^ INDArray m4 ) ( .mmul ^ INDArray m5 ) ( .mmul ^ INDArray m6 )) true ))))

And now the results aren’t as far away from each other as they were in Dragan’s benchmark.

( bench-nd4j-mmul 13 13456 2300 125 7700 810 100000 )

Evaluation count : 18 in 6 samples of 3 calls. Execution time mean : 46,150162 ms Execution time std-deviation : 5,780911 ms Execution time lower quantile : 39,230463 ms ( 2,5%) Execution time upper quantile : 52,629633 ms (97,5%) Overhead used : 1,397597 ns

( bench-neanderthal-mm 13 13456 2300 125 7700 810 100000 )

Evaluation count : 24 in 6 samples of 4 calls. Execution time mean : 26,642012 ms Execution time std-deviation : 1,391461 ms Execution time lower quantile : 24,798310 ms ( 2,5%) Execution time upper quantile : 27,893699 ms (97,5%) Overhead used : 1,397597 ns

As you can see: it is still pretty much the same 2x difference that he found for uniform sizes. Or in his own words:

Not something to write the blog post about.

Repeating the benchmarks

I have again added those benchmarks to my own JMH benchmark suite and included some micro optimizations for the case when all matrices are uniformly sized and I’ve added a chained gemm invocation for good measure.

It shows that, on the current ND4J SNAPSHOT, the performance of mmul and gemm is pretty much equal. Since I didn’t run into an odd behavior of criterium (the benchmarking library used in Dragan’s benchmark) this time and my JMH benchmark results were pretty close to those shown above, I’m not adding another table of numbers here.

I’ve created another pull request for Neanderthal so its benchmarks can reflect the differences that I’ve found here as well.

Conclusion

Benchmarking is hard. Fair benchmarking is even harder. Again, the difference turns out to be not as large as originally claimed. As ND4J doesn’t have a specialized mm call that takes any number of matrices, you will, also again, have to understand what you are actually doing when composing matrix multiplications in order to get the most performance out of ND4J.

This is even more apparent when you consider other matrix size constellations. Change the benchmark to start with 100000 and end with 13 and you will get the large difference again – because now it would be better to use the way Dragan implemented it originally. Neanderthal’s mm call includes an order optimization step, which given any number of matrices will decide up front in which order those matrices should be multiplied.

The biggest user of ND4J is Deeplearning4J, and this kind of chain matrix multiplications aren’t required there. And as no one else has asked for this functionality yet, no one has cared to implement a specialized method like that. This results in a situation where it is easy to create a benchmark that will show huge differences.

I’m personally not a huge fan of the “you are holding it wrong” type of argument, so I’m pretty impressed by what Neanderthal does to make cases like this work out of the box. In addition to this I’d actually say that even a 2x difference is something to write a blog post about. Especially since this gives us yet another benchmark that we can use to improve ND4J’s speed up out of the box.

Just like the last time, I am glad that Dragan created his benchmarks and published his numbers, as they show us where additional functionality and documentation is missing, and how we can improve ND4J even further.