Part of the magic sauce for making the deep learning models work in production is regularization. For this blog post I’ll use definition from Ian Goodfellow’s book: regularization is “any modification we make to the learning algorithm that is intended to reduce the generalization error, but not its training error”. For better theoretical understanding, I’d recommend checking out the chapter of the deep learning book dedicated to regularization.



Generalization in machine learning refers to how well the concepts learned by the model apply to examples which were not seen during training. The goal of most machine learning models is to generalize well from the training data, in order to make good predictions in the future for unseen data. Overfitting happens when the models learns too well the details and the noise from training data, but it doesn’t generalize well, so the performance is poor for testing data. It is a very common problem when the dataset is too small compared with the number of model parameters that need to be learned. This problem is particularly acute in deep neural networks where is not uncommon to have millions of parameters. For the readers who would like more intuitive descriptions of overfitting, I’ve found this thread on Quora explaining it pretty well.

Overfitting symptom: Testing error >> Training error

Regularization is a key component in preventing overfitting. Also, some techniques of regularization can be used to reduce model capacity while maintaining accuracy, for example, to drive some of the parameters to zero. This might be desirable for reducing model size or driving down cost of evaluation in mobile environment where processor power is constrained.

This rest of this post reviews some of the most common techniques of regularization used nowadays in industry:

Dataset augmentation Early stopping Dropout layer Weight penalty L1 and L2

Dataset augmentation

An overfitting model (neural network or any other type of model) can perform better if learning algorithm processes more training data. While an existing dataset might be limited, for some machine learning problems there are relatively easy ways of creating synthetic data. For images some common techniques include translating the picture a few pixels, rotation, scaling. For classification problems it’s usually feasible to inject random negatives — e.g. unrelated pictures.



There is no general recipe regarding how the synthetic data should be generated and it varies a lot from problem to problem. The general principle is to expand the dataset by applying operations which reflect real world variations as close as possible. Having better dataset in practice significantly helps quality of the models, independent of the architecture.

Early stopping

Early-stopping combats overfitting interrupting the training procedure once model’s performance on a validation set gets worse. A validation set is a set of examples that we never use for gradient descent, but which is also not a part of the test set. The validation examples are considered to be representative of future test examples. Early stopping is effectively tuning the hyper-parameter number of epochs/steps.

Intuitively as the model sees more data and learns patterns and correlations, both training and test error go down. After enough passes over training data the model might start overfitting and learning noise in the given training set. In this case training error would continue going down while test error (how well we generalize) would get worse. Early stopping is all about finding this right moment with minimum test error.

In practice, instead of actually stopping, people usually setup checkpoints to save the model at regular time intervals with continuous learning and pick the best candidate after the fact.

Dropout layer

At each training iteration a dropout layer randomly removes some nodes in the network along with all of their incoming and outgoing connections. Dropout can be applied to hidden or input layer.

Dropout layer with probability of keeping 0.5

Why dropout works:

The nodes become more insensitive to the weights of the other nodes (co-adaptive), and therefore the model is more robust. If a hidden unit has to be working well with different combinations of other hidden units, it’s more likely to do something individually useful.

Dropout can be viewed as a form of averaging multiple models (“ensemble”), technique which shows better performance in most machine learning tasks (e.g. ensemble training is the intuition behind random forests or gradient boosting decision trees). Training a neural network with dropout can be seen as training a collection of 2^n thinned networks with parameters sharing, where each thinned network gets trained very rarely, or not at all. Most of the thinned models, in fact, will never be used. Those which are used will likely get only one training example, which make it an extreme form of bagging. If you are wondering how it can work, the trick is the sharing of the weights between all the models of this sample which means that each model is very strongly regularized by the others. With this method we don’t need to train separate models which is in general pretty expensive, but we still get some of the benefits of ensemble methods.

1 neuron from Dropout layer

The original paper which introduced the concept was published in 2014. For a better understanding, I’d recommend also the Coursera course by Geoffrey Hinton called “Neural networks for machine learning”.

An intuitive explanation for dropout efficiency might as follows.

Imagine that you have a team of workers and the overall goal is to learn how to erect a building. When each of the workers is overly specialized, if one gets sick or makes a mistake, the whole building will be severely affected. The solution proposed by “dropout” technique is to pick randomly every week some of the workers and send them to business trip. The hope is that the team overall still learns how to build the building and thus would be more resilient to noise or workers being on vacation.

Because of its simplicity and effectiveness, dropout is used today in various architectures, usually immediately after fully connected layers.



Some practical usage considerations:

Typical value for p (probability of keeping the unit) is >= 0.5. p becomes another hyper parameter, so finding the right value depends also on the problem and dataset.

For the input layers the probably of keeping a neuron should be much higher; also for input layers introducing noise instead of dropout might perform better

Libraries like tensorflow or caffe2 already come with a dropout layer implementation. If you are curious about implementation details, here is the caffe2 implementation. A trick in this code is to scale in training with 1/(probability of keeping), so no conversion is required for prediction.

Example of code how to use it:

For experimentation purposes, I’ve modified CNN implementation (LeNET 5) for MINST from convolutional.py

Hyper-parameters were not optimized

Dense-sparse-dense training

An interesting related recent work (2016) which shows good results in various domains is dense-sparse-dense training. The technique consists in 3 steps:

Perform initial regular training, but with the main purpose of seeing which weights are important, not learning the final weight values. Drop the connections where the weights are under a particular threshold. Retrain the sparse network to learn the weights of the important connections. Make the network dense again and retrain it using small learning rate, a step which adds back capacity.

Photo from the original paper which describes the technique

Weight penalty L1 and L2

Weight penalty is standard way for regularization, widely used in training other model types. It relies strongly on the implicit assumption that a model with small weights is somehow simpler than a network with large weights. The penalties try to keep the weights small or non-existent (zero) unless there are big gradients to counteract it, which makes models also more interpretable. An alternative name in literature for weight penalties is “weight decay” since it forces the weights to decay towards zero.



L2 norm

penalizes the square value of the weight (which explains also the “2” from the name).

tends to drive all the weights to smaller values.

L1 norm

penalizes the absolute value of the weight (v- shape function)

tends to drive some weights to exactly zero (introducing sparsity in the model), while allowing some weights to be big

The diagrams bellow show how the weights values modify when we apply different types of regularization. Note the sparsity in the weights when we apply L1.