Here I’ll compare the speed of small matrix multiplication of a few different popular libraries with PaddedMatrices.jl:

I will benchmark the operation $\textbf{C} = \textbf{A} \times \textbf{B}$, where $\textbf{C}\in\mathbb{R}^{M\times N}$, $\textbf{A}\in\mathbb{R}^{M\times K}$, and $\textbf{B}\in\mathbb{R}^{K\times N}$.

I’ll consider every combination of $M\in(3,\ldots,32)$ and $N\in(3,\ldots,32)$ with $K=32$, and use a column major data layout. When the matrix sizes are small enough to avoid memory bandwidth problems, as is the case here, $K$ should effect runtime in a perfectly linear fashion because it should not effect the shape of the kernel, vectorization, or possible register spills, because the outer loop should be over $K$.

See here for an introduction on matmul kernels.

We’re testing 900 fucntions total.

So that all matrix sizes are known at compile time, and these templated C++ libraries can take advantage of this information to optimize the operations, I generate the C++ files programatically: