The authors of this paper propose a method to increase training speed by freezing layers. They experiment with a few different ways of freezing the layers, and demonstrate the training speed up with little(or none) effect on accuracy.

What does Freezing a Layer mean?

Freezing a layer prevents its weights from being modified. This technique is often used in transfer learning, where the base model(trained on some other dataset)is frozen.

How does freezing affect the speed of the model?

If you dont want to modify the weights of a layer, the backward pass to that layer can be completely avoided, resulting in a significant speed boost. For e.g. if half your model is frozen, and you try to train the model, it will take about half the time compared to a fully trainable model.

On the other hand, you still need to train the model, so if you freeze it too early, it will give inaccurate predictions.

What is the ‘novel’ approach?

The authors demonstrated a way to freeze the layers one by one as soon as possible, resulting in fewer and fewer backward passes, which in turn lowers training time.

At first, the entire model is trainable (exactly like a regular model). After a few iterations the first layer is frozen, and the rest of the model is continued to train. After another few iterations , the next layer is frozen, and so on.

Learning Rate Annealing

The authors used learning rate annealing to govern the learning rate of the model. The notably different technique they used was that they changed the learning rate layer by layer instead of the whole model. They used the following equation:

Equation 2.0: α is the learning rate. t is the iteration number. i denotes the ith layer of the model

Equation 2.0 Explanation

The sub i denotes the ith layer. So α sub i denotes the learning rate for the ith layer. Similarly , t sub i denotes the number of iterations the ith layer has been trained on. t denotes the total number of iterations for the whole model.

Equation 2.1

This denotes the initial learning rate for the ith layer.

The authors experimented with different values for Equation 2.1

Initial learning rate for Equation 2.1

The authors tried scaling the initial learning rate so that each layer was trained for an equal amount of time.

Remember that because the first layer of the model would be stopped first, it would be otherwise trained for the least amount of time. To remedy that, they scaled the the learning rate for each layer.