Even though I’ve been only talking about training time, there’s actually one more big advantage of using these techniques.

The research shows that training with cyclical learning rates instead

of fixed values achieves improved classification accuracy

without a need to tune and often in fewer iterations. So why does it improve our accuracy?

Although deep neural networks don’t usually converge to a global minimum, there’s a notion of ‘good’ and ‘bad’ local minimums in terms of generalization. Keskar et al. [6] argue that local minima with flat basins tend to generalize better. It should be intuitive that sharp minima is not the best, because slight changes to the weights tend to change model predictions dramatically. If the learning rate is large enough, intrinsic random motion across gradient steps prevents the optimizer from reaching any of the sharp basins along its optimization path. However, if the learning rate is small, the model tends to converge into the closest local minimum. That being said, increasing the learning rate from time to time helps the optimization algorithm to escape from sharp minimas, resulting in converging to a ‘good’ set of weights.

I. Loshchilov and F. Hutter in their paper show that by using SGDR(Stochastic Gradient Descent with Restarts) method they proposed they were able to improve error rates on state-of-art models on popular datasets.

V. That’s not even a cherry on top.

Providing so many advantages, SGDR actually gives you one more. Gao Huang and Yixuan Li [7] inspired by SGDR wrote a follow-up paper “Snapshots Ensemble: Train 1, get M for free” in which they show how to get even better results when using ‘warm restarts’ with gradient descent.

It is known that a number of local minimas grows exponentially with a number of parameters. And modern deep neural nets can contain millions of them. Authors show that while most of them have similar error rates, the corresponding neural networks tend to make different mistakes. This diversity can be exploited through ensembling — training several neural network with different initialization. It is not surprising that they will converge to different solutions. Averaging over predictions from these model leads to drastic reductions in error rates.

Gao Huang and Yixuan Li were able to get an ensemble of nets at a cost of training a single neural network. They did that by exploiting the fact that at the end of each cycle(at least later ones) neural network can converge to some local minima or be close to it. When ‘restarting’ model will most likely jump over and start converging to the other optima. Gao Huang and Yixuan Li trained their models with SGDR and saved weights after each cycle. They then added M networks to their ensemble based on last M cycles. Their research showed that local minimas, to which model converged, are diverse enough to not overlap on misclassification examples. Using this method resulted in improving error rates even more on state of the art models.

VI. Conclusion

Using techniques I described in this post we can almost fully automate the way we work with the learning rate and actually get even better results. Although these techniques have been around for a while, Jeremy Howard mentioned that not so many researches are actually using them, which is pretty bad considering how advantageous they are.

I wanted to say Thank you to the fast.ai staff for creating these amazing courses and providing the opportunity to so many people to learn. Community they created is amazingly helpful.

References

[1] fast.ai

[2] Leslie N. Smith. Cyclical Learning Rates for Training Neural Networks. arXiv preprint arXiv:1506.01186

[3] Y. N. Dauphin, H. de Vries, J. Chung, and Y. Bengio. Rmsprop and equilibrated adaptive learning rates for non-convex optimization.

arXiv preprint arXiv:1502.04390, 2015.

[4] http://www.deeplearningbook.org/

[5] I. Loshchilov and F. Hutter. Sgdr: Stochastic gradient descent with restarts.

arXiv preprint arXiv:1608.03983, 2016.

[6] Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping Tak Peter Tang. On large-batch training for deep learning: Generalization gap and sharp minima.

arXiv preprint arXiv:1609.04836, 2016

[7] Gao Huang and Yixuan Li. Snapshots Ensemble: Train 1, get M for free. arXiv preprint arXiv:1704.00109