Therefore the more layers there are in our Neural Network or the deeper our Neural Network is, the more info is compressed and lost at each layer and this amplifies and causes major data loss overall.

From Wikipedia page >

2. Rectified Linear Units (ReLU) >

From Wikipedia page >

The rectifier is, as of 2018, the most popular activation function for deep neural networks.

Most Deep Learning applications right now make use of ReLU instead of Logistic Activation functions for Computer Vision, Speech Recognition and Deep Neural Networks etc.

Some of the ReLU variants include : Softplus (SmoothReLU), Noisy ReLU, Leaky ReLU, Parametric ReLU and ExponentialReLU (ELU).

ReLU

ReLU : A Rectified Linear Unit (A unit employing the rectifier is also called a rectified linear unit ReLU) has output 0 if the input is less than 0, and raw output otherwise. That is, if the input is greater than 0, the output is equal to the input. The operation of ReLU is closer to the way our biological neurons work.

ReLU f(x)

ReLU is non-linear and has the advantage of not having any backpropagation errors unlike the sigmoid function, also for larger Neural Networks, the speed of building models based off on ReLU is very fast opposed to using Sigmoids :

Biological plausibility: One-sided, compared to the antisymmetry of tanh.

Sparse activation: For example, in a randomly initialized network, only about 50% of hidden units are activated (having a non-zero output).

Better gradient propagation: Fewer vanishing gradient problems compared to sigmoidal activation functions that saturate in both directions.

Efficient computation: Only comparison, addition and multiplication.

Scale-invariant: max ( 0, a x ) = a max ( 0 , x ) for a ≥ 0

ReLUs aren’t without any drawbacks some of them are that ReLU is Non Zero centered and is non differentiable at Zero, but differentiable anywhere else.

Sigmoid Vs ReLU

Another problem we see in ReLU is the Dying ReLU problem where some ReLU Neurons essentially die for all inputs and remain inactive no matter what input is supplied, here no gradient flows and if large number of dead neurons are there in a Neural Network it’s performance is affected, this can be corrected by making use of what is called Leaky ReLU where slope is changed left of x=0 in above figure and thus causing a leak and extending the range of ReLU.

Leaky ReLU

3. Softmax >

Softmax is a very interesting activation function because it not only maps our output to a [0,1] range but also maps each output in such a way that the total sum is 1. The output of Softmax is therefore a probability distribution.

From Wikipedia >

The softmax function is often used in the final layer of a neural network-based classifier. Such networks are commonly trained under a log loss (or cross-entropy) regime, giving a non-linear variant of multinomial logistic regression.

Softmax Graphed

Mathematically Softmax is the following function where z is vector of inputs to output layer and j indexes the the output units from 1,2, 3 …. k :

Softmax Function

In conclusion, Softmax is used for multi-classification in logistic regression model whereas Sigmoid is used for binary classification in logistic regression model, the sum of probabilities is One for Softmax. Also, i.

For deeper understanding of all the main Activation Functions I would advise you to graph them in Python/MATLAB/R their derivatives too and think of their Ranges and Minimum and Maximum Values, and how these are affected when numbers are multiplied with them.