2016-11-12 Tavian Barnes

If you need to multiply some matrices together very quickly, usually it's best to use a highly optimized library like ATLAS. But sometimes adding such a dependency isn't worth it, if you're worried about portability, code size, etc. If you just need good performance, rather than the best possible performance, it can make sense to hand-roll your own matrix multiplication function.

Unfortunately, the way that matrix multiplication is usually taught:

C_{i,j} = \sum_k A_{i,k} \, B_{k,j}

$\Bigg($ $\Bigg) = \Bigg($ $\Bigg) \, \Bigg($ $\Bigg)$

leads to a slow implementation:

void matmul(double *dest, const double *lhs, const double *rhs, size_t rows, size_t mid, size_t cols) { for (size_t i = 0; i < rows; ++i) { for (size_t j = 0; j < cols; ++j) { const double *rhs_row = rhs; double sum = 0.0; for (size_t k = 0; k < mid; ++k) { sum += lhs[k] * rhs_row[j]; rhs_row += cols; } *dest++ = sum; } lhs += mid; } }

This function multiplies a rows × mid matrix with a mid × cols matrix using the "linear algebra 101" algorithm. Unfortunately, it has a bad memory access pattern: we loop over dest and lhs pretty much in order, but jump all over the place in rhs , since it's stored row-major but we need its columns.

Luckily there's a simple fix that's dramatically faster: instead of computing each cell of the destination separately, we can update whole rows of it at a time. Effectively, we do this:

C_{i} = \sum_j A_{i,j} \, B_j

$\Bigg($ $\Bigg) = \Bigg($ $\Bigg) \, \Bigg($ $\Bigg)$

In code, it looks like this:

void matmul(double *dest, const double *lhs, const double *rhs, size_t rows, size_t mid, size_t cols) { memset(dest, 0, rows * cols * sizeof(double)); for (size_t i = 0; i < rows; ++i) { const double *rhs_row = rhs; for (size_t j = 0; j < mid; ++j) { for (size_t k = 0; k < cols; ++k) { dest[k] += lhs[j] * rhs_row[k]; } rhs_row += cols; } dest += cols; lhs += mid; } }