A common problem we all face when working on deep learning projects is choosing a learning rate and optimizer (the hyper-parameters). If you’re like me, you find yourself guessing an optimizer and learning rate, then checking if they work (and we’re not alone).

To better understand the affect of optimizer and learning rate choice, I trained the same model 500 times. The results show that the right hyper-parameters are crucial to training success, yet can be hard to find.

Finally, I’ll discuss solutions to this problem, using automated methods to choose optimal hyper-parameters.

Experimental setup

I trained the basic convolutional neural network from TensorFlow’s tutorial series, which learns to recognize MNIST digits. This is a reasonably small network, with two convolutional layers and two dense layers, a total of roughly 3,400 weights to train. The same random seed is used for each training.

It should be noted that the results below are for one specific model and dataset. The ideal hyper-parameters for other models and datasets will differ.

(If you’d like to donate some GPU time to run a larger version of this experiment on CIFAR-10, please get in touch).

Which learning rate works best?

The first thing we’ll explore is how learning rate affects model training. In each run the same model is trained from scratch, varying only the optimizer and learning rate.

The model was trained with 6 different optimizers: Gradient Descent, Adam, Adagrad, Adadelta, RMS Prop and Momentum. For each optimizer it was trained with 48 different learning rates, from 0.000001 to 100 at logarithmic intervals.

In each run, the network is trained until it achieves at least 97% train accuracy. The maximum time allowed was 120 seconds. The experiments were run on an Nvidia Tesla K80, hosted by FloydHub. The source code is available for download.

Here is the training time for each choice of learning rate and optimizer: