Backpropagation: how to train your dragon

To better understand this training process, let’s once again go back to how linear regression works. The weights are trained in a linear regression with an optimization algorithm called gradient descent. First, the algorithm randomly guesses initial starting values for all of the weights. Second, it uses those weights to make a test prediction for each training instance and computes the sum of squared errors between the predictions and ground truth values (i.e. the cost function). It then computes the gradient of the cost function and tweaks the weights in the “downhill” direction. With each baby step we make down the hill, we tweak the weights according to this update equation:

…where J(w) is the cost function and eta is the “learning rate” (size of each baby step). Once we have an updated guess for the weights, we can make a new prediction (that theoretically should be a bit closer to the ground truth value) and the cycle starts again. In short, gradient descent uses the gradient of the error to tweak its guesses for the weights iteratively (descending along the negative gradient, hence “gradient descent”) until it arrives at the “bottom of the bowl” (i.e. minimizes the error) and achieves its best guess.

Figure 7 — A visualization of gradient descent. Initialize the weights with a random guess, then descend the gradient of error until you reach the bottom of the bowl.

NNs train their weights using a similar, but slightly more complicated methodology called backpropagation. Though the earliest semblances of NNs existed in theory for decades, researchers actually didn’t develop a good strategy for training them until 1986, when Rumelhart et al. unveiled backpropagation in their groundbreaking paper, “Learning Internal Representations by Error Propagation.”

As discussed above, a great way to train your weights is to iteratively optimize an initial guess with many gradient descent steps. Backpropagation performs one gradient descent step for each “batch” of training instances (the size of a batch is up to you). The number of batches it takes to go through all of the training instances once is equivalent to one “epoch.” It can take many epochs of training to arrive at a good set of weights. However, don’t train for too many epochs or you could be at risk of overfitting.

Unlike gradient descent, backpropagation does not make its initial guess for the weights with a simple random initialization. This actually doesn’t work very well when training deep NNs! Different initialization parameters are used depending on your choice of activation function. Further details are beyond the scope of this introduction, but some initializations include “Xavier initialization” and “He initialization.”

Once the weights are initialized, the algorithm begins training. For one batch of training instances, backpropagation makes initial test predictions by feeding all of the training instances in the current batch through the network in a “forward pass” (“left to right” in the network pictured in Figure 6), keeping track of the outputs of each neuron within each layer along the way. The output error is then computed by comparing initial predictions with ground truth values.

So now we have the total output error, but how do we compute the gradient of error so we can plug this into our update equation and we can tweak the weights? The error gradient is determined by making a “reverse pass” through the network, starting with the output and ending with the input. Since we know the output error and we kept track of the input/output values for each neuron during the forward pass, we can propagate the error backwards through the network (hence “backpropagation”) and figure out how much each neuron contributed to the total error. This allows us to directly compute the error gradient across each neuron and tells us how to tweak that neuron’s associated weight. If you want more specifics on how this reverse pass actually works, check out the series of recorded lectures from Stanford’s amazing course on Convolutional Neural Networks, “CS231n.”

Finally, there are many different variations of the update equation above. Each variant is called an “optimizer.” There is a plethora to choose from, including Gradient Descent (GD), Stochastic Gradient Descent (SGD), and Mini Batch Gradient Descent. Optimizers within the GD family are usually challenged in that the learning rate must be defined a priori and they can get stuck in local minima or around saddle points if the error function is not convex (as it is for linear regression). A better alternative is to choose an optimizer that utilizes an “adaptive learning rate,” such as AdaGrad, AdaDelta, RMSProp, or Adam.

Figure 8 is an animation put together by Alec Radford (see his other animations here). It demonstrates the differences between the behavior of several optimizers applied to the same problem. In this case, each optimization algorithm is searching for the global minimum of Beale’s function.

Figure 8 — Several optimizers as applied to Beale’s function. The gray star indicates the location of the global minimum. Source: Alec Radford’s post on visualizing optimization algorithms.

As you can see, Adadelta and RMSProp win the race while Nesterov Accelerated Gradient (NAG) and Momentum squiggle off and do other things before catching up. SGD is by far the slowest to find the global minimum.

In practice, any type of optimizer with an adapted learning rate will be best suited for training a deep NN. In particular, Adam is currently a popular choice; its fast convergence rate will help speed up training. If you’d like to read about optimizers more in depth, this article is a great one to start with.