in this post we will explain everything you need to know about Artificial Neural Networks in a theoretical and programmatic way.

Definition:

Artificial neural networks (ANNs) or connectionist systems are computing systems vaguely inspired by the biological neural networks that constitute animal brains. Such systems “learn” to perform tasks by considering examples, generally without being programmed with any task-specific rules.

The Architecture of an Artificial Neural Network:

ANN is a set of connected neurons organized in layers:

input layer: brings the initial data into the system for further processing by subsequent layers of artificial neurons.

brings the initial data into the system for further processing by subsequent layers of artificial neurons. hidden layer: a layer in between input layers and output layers, where artificial neurons take in a set of weighted inputs and produce an output through an activation function.

a layer in between input layers and output layers, where artificial neurons take in a set of weighted inputs and produce an output through an activation function. output layer: the last layer of neurons that produces given outputs for the program.

Types of ANNs:

Perceptron:

The simplest and oldest model of an ANN, the Perceptron is a linear classifier used for binary predictions. This means that in order for it to work, the data must be linearly separable.

Its Architecture:

Multi-layer ANN:

More sophisticated than the perceptron, a Multi-layer ANN (e.g.: Convolutional Neural Network, Recurrent Neural Network etc …) is capable of solving more complex classification and regression tasks thanks to its hidden layer(s).

Its Architecture:

Activation Functions:

Definition:

In artificial neural networks, the activation function of a node defines the output of that node given an input or set of inputs.

Sigmoid:

A sigmoid function is a mathematical function having a characteristic “S”-shaped curve or sigmoid curve. Often, sigmoid function refers to the special case of the logistic function which generate a set of probability outputs between 0 and 1 when fed with a set of inputs. The sigmoid activation function is widely used in binary classification.

Equation:

Graphical Representation:

Programmatically:

Tan-h:

An alternative to the logistic sigmoid is the hyperbolic tangent, or tan-h function. Like the logistic sigmoid, the tan-h function is also sigmoidal (“S”-shaped), but instead outputs values that range [-1, 1]. Thus strongly negative inputs to the tan-h will map to negative outputs.

Equation:

Graphical Representation:

Programmatically:

Softmax:

Unlike the Sigmoid activation function, the Softmax activation function is used for multi-class classification. Softmax function calculates the probabilities distribution of the event over ’n’ different events. In general way of saying, this function will calculate the probabilities of each target class over all possible target classes. Later the calculated probabilities will be helpful for determining the target class for the given inputs.

Equation:

Programmatically:

ReLU:

Instead of the sigmoid activation function, most recent artificial neural networks use rectified linear units (ReLUs) for the hidden layers. A rectified linear unit has output 0 if the input is less than 0, and raw output otherwise. That is, if the input is greater than 0, the output is equal to the input.

Equation:

Graphical Representation:

Programmatically:

Leaky ReLU:

Definition:

The Leaky ReLU activation function works the same way as the ReLU activation function except that instead of replacing the negative values of the inputs with 0 the latter get multiplied by a small alpha value in an attempt to avoid the “dying ReLU” problem.

Equation:

Graphical Representation:

Programmatically:

Why do we use activation functions ?

You are probably asking yourself after reading the whole Activation Function section about the useflness of the latter, why even bother using it ? why don’t we just use the initial dot product output without feeding it to the activation function ?

Well, without an activation function we will fail to introduce non-linearity into the network.

An activation function will allow us to model a response variable (target variable, class label, or score) that varies non-linearly with its explanatory variables.

non-linear means that the output cannot be reproduced from a linear combination of the inputs.

another way to think of it: without a non-linear activation function in the network, an artificial neural network, no matter how many layers it has, will behave just like a single-layer perceptron, because summing these layers would give you just another linear function.

Which activation function is the “best” to use ?

There are many activation functions out there but which one is the best to use? This question has been asked by many beginners in the deep learning field (including myself).

The short answer for this question is: There is no “best” activation function.

The long answer is: The use of an activation function depends on the task you are dealing with, is the task a regression or a classification problem? if it’s a classification problem then is it a binary classification or a multiclass classification task? it all comes down to the task you are dealing with.

for example we can’t possibly use the Sigmoid or tan-h activation functions for a multiclass classification task we would better be using the Softmax activation function instead. right? as it would output a vector of probabilities of each class. another example, which activation would be better to use in the hidden layer part of a multi-layered neural net? if you use the sigmoid activation function then you would be hit by the vanishing gradient problem (the vanishing gradient problem is when the gradient becomes so small in the earlier layers of a deep neural network to the point that it barely affects the weights of the earlier layers thus failing at optimizing our initial weights.), should we use the tan-h activation function then? not necessarily, it’s true that the tan-h function is an enhanced version of the sigmoid function but the vanishing gradient problem is still persistently there. So, the solution to our persistent problem is the the ReLU activation function, but how does it solve the vanishing gradient problem? the fact that the derivative of the ReLU activation function is 1 if its inputs are positive will assure that the gradient will not get infinitesimally smaller passing from one layer to another but what if our inputs are negative? here we get stuck with what we call the “dying ReLU” problem (the dying ReLU problem is when the ReLU activation function always outputs the same value (zero) for any input. That means that it takes no role in discriminating between inputs. Once a ReLU ends up in this state, it is unlikely to recover, because the function gradient at 0 is also 0, so gradient descent learning will not alter the weights.), here comes the Leaky/Parametric ReLU to rescue and instead of outputting a flat out zero for the negative values the Leaky ReLU multiplies the negative values by an alpha parametric value (α=0.01).

As you can see from the previous explanation, the use of a certain activation function depends on the task you are dealing with and the problems you face while building your neural network.

Forward Propagation:

Definition:

Forward propagation is the process of feeding the Neural Network with a set of inputs to get their dot product with their weights then feeding the latter to an activation function and comparing its numerical value to the actual output called “the ground truth”.

Calculations:

Let’s demonstrate how the output gets calculated in a forward pass in a 3–4–1 Artificial Neural Network:

The initial weights of the first and second layer:

Input:

output:

The activation function of the hidden layer will be the ReLU function and the activation function of the output layer will be the Sigmoid function.

let’s begin our calculculations:

Cross entropy error:

Cross-entropy loss, or log loss, measures the performance of a classification model whose output is a probability value between 0 and 1. Cross-entropy loss increases as the predicted probability diverges from the actual label. So predicting a probability of .012 when the actual observation label is 1 would be bad and result in a high loss value. A perfect model would have a log loss of 0.

Equation:

Where ‘yi’ is the predicted probability value for class ‘i’ and ‘y′i’ is the true probability for that class.

Programmatically:

Conclusion:

This is in a nutshell how a forward propagation works and how a neural network generates its predictions, the forward propagation step has always been less mathematically intense than the backpropagation step where most students and beginners find legitimately harder to grasp.

Back-propagation:

Definition:

Backpropagation is a method used in artificial neural networks to calculate a gradient that is needed in the calculation of the weights to be used in the network.

Calculations:

first let’s lay some important derivatives:

Cross entropy error derivative:

Sigmoid derivative:

Graphical Representation:

Programmatically:

ReLU derivative:

Graphical Representation:

Programmatically:

Let’s now demonstrate the calculations that go underneath the Backpropagation process:

An illustration of how a neural network backpropagates its error:

Programmatically:

let’s now predict the output with the new updated weights.

Congratulations! you’ve just done forward and backpropagation with just a few lines of code.

Building a one hidden layer Neural Network:

let’s now build a fully functional multilayer artificial neural network, our neural network will tackle the XOR task (where the third column of the table is irrelevant).

The classification task

the ANN’s architecture will look like this:

The ANN we’ll build

Code:

Your results should be like this:

Vizualization of the loss function:

As you can see in the results the cross entropy loss has dramatically decreased from the first 1000th iteration to the last 1000th iteration which showcase how good our neural network is doing to find the optimal weights.

the prediction output of our Neural Network is 0.00069824 which is very close to the ground truth 0.

Note:

I strongly recommend you check out the notebook to visualize the plots that are missing in this blog post.

Link to Jupyter Notebook: https://github.com/AegeusZerium/DeepLearning/blob/master/Deep%20Learning/Artificial%20Neural%20Network.ipynb

Conclusion:

With the appearance of Big Data, more Computational power and more challenging and mathematically intense tasks, Neural Networks will be used more than ever.

References:

For Contacts: