Why are ReLUs not enough?

Artificial neural networks learn by a gradient-based process called backpropagation. I highly recommend this video series by 3Blue1Brown as an introduction. The basic idea is that a network updates its weights and biases in the direction indicated by the gradients.

A potential problem with backpropagation is that the gradients can get too small. This problem is named vanishing gradients. When your network suffers from vanishing gradients, the weights won’t adjust, and learning stops. On a high-level, deep networks multiply many gradients during backpropagation. If gradients are close to zero, the whole product drops. This, in turn, pushes other gradients closer to zero and so on. Let’s have a look at how ReLUs, the de-facto standard of activation functions, prevent that.

Here is how a ReLU maps the input (on the x-axis, named z in the literature) to the output (on the y-axis, named a for activation):

The rules of ReLU are straighforward. If z is smaller than zero, a is zero. If z is larger than zero, the output stays z. In other words, ReLU replaces negative values with zero and leaves positive values unchanged. The gradient of this activation function is as easy as it can get. It is zero for values smaller than zero, otherwise one. That is why ReLU prevents vanishing gradients. There is no gradient close to zero, which could diminish other gradients. Either the signal flows on (if the gradient is one) or not (if the gradient is zero).

However, there is a potential issue with ReLUs: they can get trapped in a dead state. That is, the weights’ change is so high and the resulting z in the next iteration so small that the activation function is stuck at the left side of zero. The affected cell cannot contribute to the learning of the network anymore, and its gradient stays zero. If this happens to many cells in your network, the power of the trained network stays below its theoretical capabilities. You got rid of vanishing gradients, but now you have to deal with dying ReLUs.

ReLU is excellent for computational simplicity, so we want to keep as much of it as possible but also deal with the problem of dying neurons. Figuratively speaking, ReLUs’ graveyard is on the left side of the y-axis, where z has negative values. One prominent approach to turn the graveyard into a place of life is the so-called leaky ReLU. It looks like this:

As you can see, there is now a slope on the left side. This slope is usually tiny (e.g., 0.01), but it is there, so there is always learning happening and the cell cannot die. Leaky ReLUs keep the simplicity of computation since there are only two different and constant gradients possible (1 and 0.01). The slope on the left defines the risk of vanishing gradients.

Up to now, it seems that we have to choose between two risks: dying neurons (ReLU) or a self-imposed amount of risk for vanishing gradients (leaky ReLU). Where does the Scaled Exponential Linear Unit (SELU) stand in this trade-off? Here is what it looks like: