Learning is the most important ability and attribute of a Intelligent System. A system which acquires knowledge by experience, trial-and-error or through coaching, exhibits early traces of intelligence. This post explains how ANNs learn.

In the previous post, ‘Layman’s Intro to AI’, we explored a simple analogy of how a Artificial Neural Network or ANN gains to understand the ‘knowledge weight’ of a Cat (or what we termed as the ‘catiness’).

Quick Recap

We said, the best fit arithmetic analogy of a Neural Network is in the following equation (which is a oversimplified lie btw):

E=(x*w)-y

Where, ‘E’ is the error which should tend to zero

‘x’ is a input vector (pixels of cat image)

‘w’ is the knowledge weight that the network needs to learn (about the Catiness of a Cat)

and ‘y’ is the output expected (Which in our case was the classification “Cat”)

The ‘*’ operator is a function called the Activation Function, which was introduced in the post titled “Mathematical foundation for Activation Functions”. This post will now look at the ‘minus’ operator (as an analogy again) which encapsulates the Loss (or Cost)and Learning Function of Neural Network.

How to train your ANN?

In the simplified equation E=(x*w)-y, we said we do not know what the weight ‘w’ needs to be. Instead we started taking guess works at ‘w’ to reduce the error E to zero. Any value for ‘w’ that best fits the equation and reduces the error to zero will be considered as ‘knowledge’.

How do we now fit this analogy to a real ANN?

Let’s say, the network in the above illustration uses a Logistic Sigmoid as the activation function as shown below

Logistic Sigmoid

The logit ‘𝛉’ in the above activation function is the transfer potential function as shown below

Transfer Potential

So the ‘*’ operator is a Activation Function as follows:

Full form of the logistic sigmoid activation function

Here y’ (y-prime) is the output of the activation function.

So, the arithmetic analogy of E=(x*w)-y can be replaced with E=y’-y

Why do I use logistic sigmoid in this example?

Logistic sigmoids are easier to understand.

An important property of the logistic sigmoid is that it always produces a real value between zero and one as output.

Values closer to zero inhibits the neurons from firing while while values closer to one excites the neuron to fire.

It is real valued and differentiable, which is a primary requirement for being an activation function in ANN.

Also it is symmetrical in the left and right asymptotes (which is not a requirement for ANN but helps in understanding the concept of thresholding)

The threshold is nearly mid-point of the sigmoid.

Here is an illustration of the logistic sigmoid from wolframalpha. Here, I have plotted the logit between -10 and 10. Notice how the curve thresholds at 0.5 and switches over. This is quite intuitive.

So far, so good. But, how do we calculate the error? Enter the Loss Function…

Loss Function (or Cost function)

Training on ANN happens in iterations. Let’s consider, the iteration as a single forward pass of input vector to the hidden units to the output of the neural network.

Whenever we see a Error deviation in this single iteration, it’s considered as a local error.

The equation E=y’-y is actually a true representation of a local error which is a standard linear error, where y’ is the output from the activation function and y is the actual expected output.

Training can also be done in batches. For example, if we have 500 pictures of cats used for training, we can set a batch size of 250. Which means we shall send 250 pictures of cat, one after another and capture all local errors for each picture and aggregate that to a global error. Here, we are running 2-iterations of batch size 250. The global error is calculated for each iteration.

(One whole training run on all 500 pictures is called a epoch.)

The global errors can be aggregated using any of the following techniques (But, not limited to).

Now, we know what the error is. How do we tell ANN that what it produced was not the answer expected? (This is the concept of supervised learning where we know what to expect). Enter Backpropagation…

Backpropagation

Backpropagation is a powerful training tool used by most ANNs to learn the knowledge weights of the hidden units. By now, we have the activation function and we have the loss function. What we do not know is how to change the hidden units, or particulary the knowledge weights ‘w’ of the hidden unit in such a way that the error reduces to zero.

In a multi-layered ANN, every unit of neuron affects many output units in the next layer. Let’s consider the illustration below.

Base schematic of a Multi-Layered Neural Network

Here, each hidden activity in the layer ‘i’ can affect many outputs in the layer ‘j’ and hence shall have many different errors. It is prudent to combine these errors. The idea behind combining the different errors is to compute the rate of change of error for all the units at the same time for every iteration.

The rate of change is nothing but the error derivative of the units. (At this point, I would advise you brush up your calculus a bit)

The error derivative we are planning to find is a Gradient Descent function. In other words, we are trying to find a way to reduce the error to zero in every step of the iteration (and hence a “descent” in gradient as a function).

Fundamental Math behind Backpropagation

In order to truly understand backpropagation, we need to understand 2 simple truths,

We cannot directly arrive at the rate of change of the error with respect to weights for all the units. Instead, we need to first compute the rate of change of error with respect to the activation functions of the hidden activities. Once the rate of change of error with respect to the activation function is known, then using chain-rule, we can compute the rate of change of error with respect to the hidden units.

(If you are still scratching your head, let’s break this down further)

In the above illustration, we have:

y(j) which is the output of the activation function in the layer ‘j’.

y(i) which is the output of the activation function in layer ‘i’.

Knowledge weight ‘w(ij)’ which is the strength of the connections between neurons in layer ‘i’ and layer ‘j’.

‘theta’ as a logit, which is the transfer potential into layer ‘j’.

Let’s assume an error E, which is the difference between output y(j) and some expected value t(j), such that E=t(j)-y(j)

In order to compute the rate of change of error E with respect to weight w(ij), we must compute in the following order.