Activation functions are functions which take an input signal and convert it to an output signal. Activation functions introduce non-linearity to the networks that is why we call them non-linearities. Neural networks are universal function approximators and deep Neural Networks are trained using backpropapagation which requires differentiable activation functions. Backpropapagation uses gradient descent on this function to update the network weights. Understanding activation functions is very important as they play a crucial role in the quality of deep neural networks. In this article I am listing and describing different activation functions.

Identity or Linear Activation Function — Identity or Linear activation function is the simplest activation function of all. It applies identity operation on your data and output data is proportional to the input data. Problem with linear activation function is that it’s derivative is a constant and it’s gradient will be a constant too and the descent will be on a constant gradient.

Equation for Identity or Linear activation function

Range: (-∞, +∞)

Examples: f(2) = 2 or f(-4) = -4

Heaviside (Binary step, 0 or 1, high or low) step function is typically only useful within single-layer perceptrons, an early type of neural networks that can be used for classification in cases where the input data is linearly separable. These functions are useful for binary classification tasks. The output is a certain value, A1, if the input sum is above a certain threshold and A0 if the input sum is below a certain threshold. The values used by the Perceptron were A1 = 1 and A0 = 0

Equation for Heaveside/Binary Step Function (0 or 1, high or low)

Range: {0 or 1} Either o or 1

Examples: f(2) = 1, f(-4) = 0, f(0) = 0, f(1) = 1

Sigmoid or Logistic activation function(Soft Step)-It is mostly used for binary classification problems (i.e. outputs values that range 0–1) . It has problem of vanishing gradients. The network refuses to learn or the learning is very slow after certain epochs because input(X) is causing very small change in output(Y). It is a widely used activation function for classification problems, but recently. This function is more prone to saturation of the later layers, making training more difficult. Calculating derivative of Sigmoid function is very easy.

For the backpropagation process in a neural network, your errors will be squeezed by (at least) a quarter at each layer. Therefore, deeper your network is, more knowledge from the data will be “lost”. Some “big” errors we get from the output layer might not be able to affect the synapses weight of a neuron in a relatively shallow layer much (“shallow” means it’s close to the input layer) — Source https://github.com/Kulbear/deep-learning-nano-foundation/wiki/ReLU-and-Softmax-Activation-Functions

Sigmoid or Logistic activation function

Derivative of the sigmoid function

Range: (0, 1)

Examples: f(4) = 0.982, f(-3) = 0.0474, f(-5) = 0.0067

Hyperbolic tangent (TanH) — It looks like a scaled sigmoid function. Data is centered around zero, so the derivatives will be higher. Tanh quickly converges than sigmoid and logistic activation functions

Equation for Hyperbolic Tangent(TanH) activation function

Range: (-1, 1)

Examples: tanh(2) = 0.9640, tanh(-0.567) = -0.5131, tanh(0) = 0

Rectified Linear Unit(ReLU) — It trains 6 times faster than tanh. Output value will be zero when input value is less than zero. If input is greater than or equal to zero, output is equal to the input. When the input value is positive, derivative is 1, hence there will be no squeezing effect which occurs in the case of backpropagating errors from the sigmoid function.

Equation for Rectified Linear Unit(ReLU) activation function

Range: [0, x)

Examples: f(-5) = 0, f(0) = 0 & f(5) = 5

From Andrej Karpathy’s CS231n course

Leaky rectified linear unit (Leaky ReLU) — Leaky ReLUs allow a small, non-zero gradient when the unit is not active. 0.01 is the small non-zero gradient qhere.

Equation for Leaky Rectified Linear Unit (Leaky ReLU) activation function

Range: (-∞, +∞)

Leaky Rectified Linear Unit(Leaky ReLU)

Parametric Rectified Linear Unit(PReLU) — It makes the coefficient of leakage into a parameter that is learned along with the other neural network parameters. Alpha(α) is the coefficient of leakage here.

For α ≤ 1 f(x) = max(x, αx)

Range: (-∞, +∞)

Equation for Parametric Rectified Linear Unit(PReLU)

Randomized Leaky Rectified Linear Unit(RReLU)

Range: (-∞, +∞)

Randomized Leaky Rectified Linear Unit(RReLU)

Exponential Linear Unit (ELU) — Exponential linear units try to make the mean activations closer to zero which speeds up learning. It has been shown that ELUs can obtain higher classification accuracy than ReLUs. α is a hyper-parameter here and to be tuned and the constraint is α ≥ 0(zero).

Range: (-α, +∞)

Exponential Linear Unit (ELU)

Scaled Exponential Linear Unit (SELU)

Range: (-λα, +∞)

Scaled Exponential Linear Unit(SELU)

S-shaped Rectified Linear Activation Unit (SReLU)

Range: (-∞, +∞)

S-shaped Rectified Linear Activation Unit

Adaptive Piecewise Linear (APL)

Range: (-∞, +∞)

SoftPlus — The derivative of the softplus function is the logistic function. ReLU and Softplus are largely similar, except near 0(zero) where the softplus is enticingly smooth and differentiable. It’s much easier and efficient to compute ReLU and its derivative than for the softplus function which has log(.) and exp(.) in its formulation.

Range: (0, ∞)

Softplus

Derivative of the softplus function is the logistic function.

Derivative of the softplus function

Bent identity

Range: (-∞, +∞)

Bent Identity

Softmax- Softmax functions convert a raw value into a posterior probability. This provides a measure of certainty. It squashes the outputs of each unit to be between 0 and 1, just like a sigmoid function. But it also divides each output such that the total sum of the outputs is equal to 1.

Equation for Softmax function

The output of the softmax function is equivalent to a categorical probability distribution, it tells you the probability that any of the classes are true

Conclusion: ReLU and it’s variants should be preferred over sigmoid or tanh activation functions. As well as ReLUs are faster to train. If ReLU is causing neurons to be dead, use Leaky ReLUs or it’s other variants. Sigmoid and tanh suffers from vanishing gradient problem and should not be used in the hidden layers. ReLUs are best for hidden layers. Activation functions which are easily differentiable and easy to train should be used.

References: