Similarly to how we implemented weight decay, we will look at basic equations of backpropagation, compare them to how velocity is tracked, and see if we can fit it into the existing implementation. If we can do that, we can implement momentum at no additional computational cost!

The \(

abla{C_0}\) has been calculated as usual, and stored in the matrix v in the FullyConnectedTraining type.

To implement momentum, we have to do two things. First we update velocity with the gradient, and then we add the updated velocity to weights. The updated velocity then sits in place waiting for the next cycle. When updating velocity, we assume that the old values should be taken into account less and less. Therefore, the old velocity is multiplied by mu , which is expected to be a number between 0 and 1.

\(v \rightarrow v' = \mu v - \eta

abla{C}\) (1)

\(w \rightarrow w' = w + v'\) (2)

You'll notice that our existing implementation already covers the second part. The weight update already adds v to w (including the weight decay from the previous article).

( axpby! 1.0 v ( inc ( * eta-avg ( double lambda ) ) ) w )

The current implementation just erases the old values of v when updating gradients, in effect calculating \(v \rightarrow v' = 0 v - \eta

abla{C}\). The only change we need to make is to multiply v by mu instead of with zero!

We change this expression:

( mm! eta-avg z ( trans a-1 ) 0.0 v )

Into this:

( mm! eta-avg z ( trans a-1 ) mu v )

We have already had the implementation of momentum. We just need to switch it on! Note that this implementation does not clash with the existing implementation of weight decay. Way to go, Clojure :)

The update to FullyConnectedtraining contains the addition of mu to the second argument of the backward method, and changing 0.0 to mu in the appropriate mm! call.