Did you ever look at the L² regularization term of a neural network’s cost function and wondered why it is scaled by both 2 and m?

Equation 1: An L2-regularized version of the cost function used in SGD for NN

You may have encountered it in one of the numerous papers using it to regularize a neural network model, or when taking a course on the subject of neural networks. Surprisingly, when the concept of L² regularization is presented in this context, the term is usually introduced along with these factors without further explanation.

I recently encountered the term again when working through the Improving Deep Neural Networks course (the 2nd course in Coursera’s excellent Deep Learning Specialization by Andrew Ng/deeplearning.ai), where indeed no explanation was given to these scaling factors, so I set out to search the interwebs. The following mini-post is a summation of what I have learned during that search.

A Reminder: L² regularization for Gradient Descent in Neural Networks

Just to make sure we are all on the same page, here is a brief recap of what L² regularization is in the context of Stochastic Gradient Descent in Neural Networks.

Generally in Machine Learning, when we fit our model we search the solution space for the most fitting solution; In the context of Neural Networks, the solution space can be thought of as the space of all functions our network can represent (or more precisely, approximate to any desired degree). We know that the size of this space depends on (at least) the depth of the network and the activation functions used. We also know that with at least one hidden layer followed by an activation layer using a “squashing”¹ function, this space is very large, and that it grows exponentially with the depth of the network (see the universal approximation theorem).

When we are using Stochastic Gradient Descent (SGD) to fit our network’s parameters to the learning problem at hand, we take, at each iteration of the algorithm, a step in the solution space towards the gradient of the loss function J(θ; X, y) in respect to the network’s parameters θ. Since the solution space of deep neural networks is very rich, this method of learning might overfit to our training data. This overfitting may result in significant generalization error and bad performance on unseen data (or test data, in the context of model development), if no counter-measure is used. Those counter-measures are called regularization techniques.

Figure 1: An unregularized model can overfit to noise/outliers in the training data

“Regularization is any modification we make to a learning algorithm that is intended to reduce its generalization error but not its training error.”

— Ian Goodfellow

There are several regularization techniques in the context of iterative learning algorithms in general, such as early stopping, and of neural networks in particular, e.g. dropout. A common method in statistics and machine learning is to add a regularization term to the loss function, meant to incorporate a measure of model complexity into the function to be minimized. This method is unique to neither iterative learning algorithms nor neural networks, and it uses the common formalization of numerous learning algorithms as optimization problems.

Now, instead of just searching the solution space for the solution with the minimal loss over the training set, we are also taking into account the simplicity of the solution. This increases the chance that a simpler, and thus a more generalizable, solution will be selected while retaining a low error on the training data. In the words of Tim Roughgarden, we become “biased toward simpler models, on the basis that they are capturing something more fundamental, rather than some artifact of the specific data set”.

Figure 2: Applying regularization can help prevent overfitting on the training data

Now that we talked about generalization errors and regularization in general, let’s go back to L² regularization. This technique, also known as Tikhonov regularization and ridge regression in statistics, is a specific way of regularizing a cost function with the addition of a complexity-representing term. In the case of L² regularization in neural networks, the term is simply the squared Euclidean norm ² of the weight matrix of the hidden layer (or the sum of all such squared norms, in the case of multiple hidden layers, and including the output layer) of the network. An additional parameter, λ, is added to allow control of the strength of the regularization.

Adding the L² term usually results in much smaller weights across the entire model. Other types of term-based regularization might have different effects; e.g., L¹ regularization results in sparser solutions, where more parameters will end up with a value of zero.

Regularization, both with L¹ and L², also has a beautiful probabilistic interpretation: It is equivalent to adding a prior over the distribution of the weight matrix W; a Gaussian prior in the case of L², and a Laplacean prior in the case of L¹. This transforms the optimization problem from performing maximum likelihood estimation (MLE) to performing maximum a posteriori (MAP) estimation; i.e. moving from using frequentist inference to Bayesian inference. A thorough overview of this interpretation can be found in a great post by Brian Keng.

Going back to L² regularization, we end up with a term of the form

Equation 2 : Adding L² regularization to a cost function

where ∥∙∥ is the L² norm. This is, indeed, the form we encounter in classical Tikhonov regularization.

So where did the division by m and 2 come from? Halving the term here seems especially redundant, as it adds only division by a constant, and thus has no meaning if we also search for the optimal values for hyper-parameters such as λ, as we usually do. Indeed, in classical machine learning the same regularization term can be encountered without both these factors.

A possible answer is that L² regularization probably entered the field of deep learning through the introduction of the related, but not identical, concept of weight decay.

Weight decay

The idea of weight decay is simple: to prevent overfitting, every time we update a weight w with the gradient ∇J in respect to w, we also subtract from it λ∙w. This gives the weights a tendency to decay towards zero, hence the name.

This is actually quite an early concept in the history of deep learning. Simplifying neural networks by soft weight sharing, a paper from 1992 by Nowlan and Hinton, casually references weight decay to a paper from June, 1986 named Experiments on Learning by Back Propagation by Plaut, Nowlan and Hinton. Since none of the few papers referenced there seems to use the concept, this might actually be the place where the concept was introduced in the context of neural networks. Even Hanson & Pratt (1988) seem to suggest in a footnote that they could not find at the time a published paper to cite in order to reference this concept.

Equation 3: Weight decay for neural networks

When looking at regularization from this angle, the common form starts to become clear. To get this term added in the weight update, we “hijack” the cost function J, and add a term that, when derived, will yield this desired -λ∙w; the term to add is, of course, -0.5 λ∙w². Taking the derivative of J -0.5 λ∙w² will thus yield ∇J-λ∙w, which is what we aimed for.

Equation 4 : L² regularization from a weight decay perspective

Plaut et al already point themselves to this relation to the L² norm in the aforementioned paper:

“One way to view the term h∙w is as the derivative of 0.5 h∙w² , so we can view the learning procedure as a compromise between minimizing E (the error) and minimizing the sum of the squares of the weights.”

Indeed, L² regularization and weight decay regularization are equivalent for standard stochastic gradient descent (when rescaled by the learning rate). This is not necessarily true for all gradient-based learning algorithms, and was recently shown to not be the case for adaptive gradient algorithms, such as Adam.

Why divide by m?

So, we are still left with the question of the division by m. After all, proper hyper-parameter optimization should also be able to handle changes in scale (at least in theory).

Option 1: A side-effect of batch gradient descent

Let’s look at this from the point of view of weight decay. In the stochastic variant of gradient descent (SGD), we evaluate the gradient of the loss function (in respect to parameters θ) over a single training example at a time. If we think of weight decay as introduced in that per-example level (as it originally was), we naturally get that when a single iteration of gradient descent is instead formalized over the entire training set, resulting in the algorithm sometimes called batch gradient descent, the scaling factor of 1/m, introduced to make the cost function comparable across different size datasets, gets automatically applied to the weight decay term.

Option 2: Rescale to the weight of a single example

A second way of framing this, which I encountered in an answer on Cross Validated by user grez, is that this kind of scaling makes sure that the effect of the regularization term on the loss function corresponds to roughly a single training example. This ensures that the de facto effect of regularization doesn’t explode as the amount of data increases — which might explain why this scaling factor started to show up specifically when SGD was used for neural networks, which saw their resurgence in the era of big data.

Thinking of the term from the perspective of a single example, consider the regularization term without such scaling: A term that might had the weight of a 1000 examples in some of the first problems tackled by learning algorithms, suddenly gets the same weight as 10,000,000 examples on each and every iteration of the algorithm, in the era of big datasets. This argument makes sense, in a way, from an information theory perspective; this is a lot of information to relate to the assumption we mentioned earlier, of simpler models capturing something fundamental, and it becomes perhaps too strong an assumption when training sets get really big.

Option 3: Training set representativeness

I’d also like to suggest a statistical point of view on the question. We can think of our training set as a sample of some unseen distribution of unknown complexity. Yet, whatever the complexity of this unseen distribution, it is always true that as our training set grows, so does the probability that it is a representative sample of this unseen source distribution. As a result, it also becomes more representative of any large enough sample of future unseen data (or a test set). In short, the more representative the training set, the less overfitting there is to be done.

In this sense, scaling the regularization term down as the number of examples increases encodes the notion that the more data we have, the less regularization we might need when looking at any specific SGD step; after all, while the loss term should remain the same as m grows, so should the weights of the network, making the regularization term itself shrink in relation to the original loss term.

A recent paper (and its followup), suggesting that weight decay and dropout may not be necessary for object recognition NNs if enough data augmentation is introduced, can perhaps be taken as supporting this notion that the more data we have, the less regularization is needed.

Option 4: Making λ comparable

A final nice motivation: By hopefully mitigating the need to change λ when m changes, this scaling makes λ itself comparable across different size datasets. This make λ a more representative estimator of the actual degree of regularization required by a specific model on a specific learning problem.

Option 5: Empirical value

Whichever the intuitive justification you find pleasing, the empirical value of scaling the regularization term by 1/m, at least for feed-forward networks using ReLU as an activation function, is demonstrated elegantly in the following notebook by the aforementioned grez:

https://github.com/grez911/machine-learning/blob/master/l2.ipynb

Final Words

That’s it! I hope you enjoyed this foray into a small but significant mathematical term. If you have corrections, additions and other ideas as to why and how these scaling terms came to be, please let me know in the comments, or reach out through any of the means I link to on my personal website. :)