The current processing of the network, shown above, is such that each input vector is threaded through layers, transformed by a series of matrices and non-linear activation functions, to produce an output vector. The heart of the implementation is the matrix-vector product operation ( mv! ), which is performed by optimized high performance libraries (Neanderthal in this case).

If we have only one input instance to process, there is not much that we can do to speed that up. However, we often have many input instances that we would like to process, either at once, or in a short period. Instead of calling the network function for each of these instances in a loop, we can process the whole batch using matrix functions that do the equivalent thing, but in an optimized way. We will process the whole batch at once.

Consider a matrix \(A\) and its columns. If we want to transform all these columns by a matrix transformation \(T\), we can do this by invoking a matrix-vector product ( mv! ) for each column of \(A\). As I discussed in matrix transformations, we can get the same result by doing a matrix-matrix multiplication of \(T\) and \(A\). Matrix-matrix multiplication has more space for hardware optimization than looping over simpler matrix-vector multiplications, and it will be faster.

Let's see the difference!

( def t ( dge 1000 1000 ) ) ( def a ( dge 1000 10000 ) ) ( def y ( dv 1000 ) ) ( def b ( dge 1000 10000 ) )

( time ( dotimes [ i 10000 ] ( mv! t ( col a i ) y ) ) )

"Elapsed time: 987.553046 msecs"

( time ( mm! t a b ) )

"Elapsed time: 102.086963 msecs"

For this particular size of input (1000) and batch (10000), on my CPU (i7-4790k), matrix-matrix multiplication is roughly 10 times faster than the equivalent loop of 10000 matrix-vector multiplications.

The exact difference in performance depends on the hardware and the matrix dimensions. The rule of thumb is that the speed-up is higher for larger matrices. In consequence, since the input dimension of a network stays constant, we expect that the network speed per input vector increases with larger batch size.

On the GPU, the difference is even more noticeable than on the CPU. We will explore this soon, when we support GPU computing.