Weight decay is a form of regularization. Regularization is one of the techniques for decreasing overfitting and improving generalization. This particular example can not demonstrate overfitting, since the function we're learning can give us lots of useful training data, so I will discuss overfitting when we build up this infrastructure a bit more, and tackle tougher problems.

On the technical level, weight decay reduces overfitting by controlling the growth of weight values, which we need to make this example work at all.

As we discussed in an earlier article, Learning and Backpropagation, the learning algorithm we are using is centered around the quest to minimize the cost function. So far, we have been using the quadratic cost function, so I'll base the discussion on the same formula. The base cost function and the weight regularization addition in general can be based on some other formulas, too.

Here's the existing cost function that we are minimizing.

\(C(w,b) \equiv \frac{1}{2n} \sum\limits_x \lVert \mathbf{y}_x - \mathbf{a}_x \lVert ^2\)

The idea of weight decay is that we add a component that depends on weights to this function.

For example, something like this:

\(C(w,b) \equiv \frac{1}{2n} \sum\limits_x \lVert \mathbf{y}_x - \mathbf{a}_x \lVert ^2 + \frac{\lambda}{2n}\sum\limits_w w^2\)

More generally, if we denote the vanilla cost function with \(C_0\), and the weight component as \(C_w\), the regularized cost function \(C\) is computed as \(C = C_0 + C_w\). When \(C_w = \frac{\lambda}{2n}\sum\limits_w w^2\) we are doing the L2 Regularization.

The value of \(C\) is always larger than the value of \(C_0\), since we are assuming that there are at least some weights that are different from zero, and we are summing their squares. Intuitively, the \(C_w\) will be smaller when the absolute values of weights are smaller, so this component favors small weights, ideally zeros. On the other hand the \(C_0\) component favors weights that minimize the base cost, which are typically weights that are different than zero. The complete cost function balances these two tendencies, weighted by the coefficient \(\lambda\).

The question is now how this change in the cost function reflects on the backpropagation algorithm and the derivative. Fortunately, the derivative of a sum is the sum of the derivatives. This is wonderful news. Our backpropagation implementation can stay the same. The gradient with respect to weights is now:

\(

abla{C} \equiv (\frac{\partial{C}}{\partial{w_1}},\frac{\partial{C}}{\partial{w_2}},\dots) =

abla{C_0} + \frac{\lambda}{n} w\)

The gradient with respect to biases stays the same, since it does not depend on weights.