In a nice struck of luck, the rate of change of the cost in respect to bias has already been calculated, since it is equal to the error itself!

\(

abla_b C^l = \delta^l\) (3)

What's left is to update the bias. Hey, but bias is a vector, while the error is a matrix… How do I subtract a matrix from a vector? Do I need some sort of broadcast? No!

The error is a matrix only because we are processing the whole batch of samples at once, instead of one by one. We want to shrink many vector updates into one. Broadcast would mean "expanding" the vector to fit a matrix. Here, we are going in the opposite direction. The way to do it here is to average the updates that each column carries.

The entries of the resulting vector should each be the average of its respective row.

One naive way to implement this is to write a loop ( map , fmap , or a low level loop/recur) and call sum on each row. I guess that this idea popped up immediately in the minds of most readers.

There is, of course, an easier and faster way to do it with matrix operations. Recall what the matrix-vector product does. It multiplies a matrix with a vector and the result is a vector.

This "structure" matches the problem we are dealing with. But, how does that help with summing the numbers? Luckily, each entry in the resulting vector is a dot product of a matrix row and the other vector, \(e = \sum\limits_{j=n} r_j x_j\). Now, imagine that \(x_j\) is always one, and we have our sum: \(e = \sum\limits_{j=n} r_j\)

Cool, if we only had a vector of ones, that would be one call to mv! . Maybe you forgot, it was half a dozen articles ago, but we needed a vector of ones to implement broadcasting in the forward pass. That vector is still here! Problem solved; the third equation can be added to our implementation.

... (backward [_] (mul! (prime activ-fn z) a) ;; From the last article (1 and 2) (mm! 1.0 (trans w) z 0.0 a-1) ;; (2) (mv! (/ -1.0 (dim ones)) z ones 1.0 b)) ;; (3)

Note that mv! does not only compute the matrix-vector product, but can add it to the resulting vector. Fine, we have just saved us some memory, since we can fuse the calculation of the change of bias, and its subtraction from the bias into one operation that does not need a place to store the intermediate result.