Derivation

Consider the following network architecture:

Neural Network Architecture for reference

Notations

For every layer in the network, there are certain parameters associated with it. Let us define some notations.

Notations used in the derivations

Now that we have the notations in place, let’s dive into the math.

We know that training a network requires three steps.

Forward propagation — The first step is calculating the activation of each layer one by one, starting from the first layer, until we have the activation of the last layer, a[3] or y’. Computing the cost — The second step is calculating the value of the cost function C(y’, y) and Backpropagation — The final step is updating the weights and biases of the network using the backpropagation algorithm.

Forward Propagation

Let X be the input vector to the neural network, i.e. a[0] = X.

Now, we need to calculate a[l] for every layer l in the network.

Before calculating the activation, a[l], we will calculate an intermediate value z[l]. Each element k in z[l] is just the sum of bias for the neuron k in the layer l with the weighted sum of the activation of the previous layer, l-1.

We can calculate z[l] from the following equation:

Equation for calculating the weighted input to the neurons in layer l

Where ‘.’ refers to the matrix multiplication operation, and + refers to the matrix addition.

Now that we have z[l], we can compute a[l] easily by applying the activation function g[l] element-wise to the vector z[l].

Equation for calculating the activation of layer l

With the above two equations, we can calculate the activation of each layer in the network. The activation of the last layer gives us the predicted output of the network, y’.

Selecting the Cost Function C, and computing the cost

Okay, now we know how we will forward propagate the input X through the network and obtain the predictions. But now that we have the predictions, we need a way to quantitatively determine how good/bad these predictions are.

So, how do we go about this?

To do this, we define some function, known as the cost function, which will be a function of the predicted values and the ground truth.

This function will estimate how badly the model is performing. Put simply, a cost function is a measure of how wrong the model is in terms of its ability to estimate the relationship between the input and the output. This cost function (also referred to as loss or error) can be estimated by iteratively running the model to compare estimated predictions, y’ against “ground truth” — the known values of y.

The objective of a the model, therefore, is to find parameters, weights and biases, that minimize the cost function.

That is why choosing the right cost function for achieving the desired result is a critical point of machine learning problems. The choice of the cost function depends on the application. Here is a good discussion about choosing the right cost function.

In this article we will be using the cross-entropy cost function, which is calculated as follows:

Equation for the Cross Entropy cost

Here y is the actual output, the ground truth, and y’ is the predicted output, or, a[3] in this case.

*Note: Here log refers to the natural logarithm.

Backpropagation

The goal of backpropagation is to compute the partial derivatives of the cost function C with respect to any weight w or bias b in the network.

Once we have these partial derivatives, we will update the weights and biases in the network by the product of some constant alpha and the partial derivative of that quantity with respect to the cost function. This is the popular gradient descent algorithm.

The partial derivatives give us the direction of greatest ascent. So, we take a small step in the opposite direction — the direction of the greatest descent, i.e. the direction which will take us to the local minima of the cost function.

Visualization of the Gradient Descent algorithm Source: Machine Learning Course by Andrew Ng

The update rules looks like this:

Equations of the update rules for the weights and biases of the layer l

Here alpha is known as the learning rate of the network, because it decides how big updates we perform, in other words, it tells us how big steps are we taking in the direction of the local minima, i.e. what is the rate at which we are learning.

To better understand the flow of computations, let us draw a computation graph.

Forward Propagation Computation Graph

From the above computation graph, we can make a few observations which will help us calculate the partial derivatives.

C is a function of a[3], a[3] is in turn a function of z[3], and z[3] is a function of w[3], b[3] and a[2]. So to calculate the partial derivative of the cost function C with respect to w[3], b[3], we will have to use the chain rule.

This gives us:

Equations for calculating the partial derivative of the cost function with respect to the weights and biases of layer 3

Similarly, we need the derivative of C with respect to w[2], b[2]. From the above computation graph, we can see that z[3] is a function of a[2], which(a[2]) is a function of z[2], and in turn z[2] is a function of w[2], b[2] and a[1].

So applying the chain rule again, we get:

Equations for calculating the partial derivative of the cost function with respect to the weights and biases of layer 2

Likewise, for calculating the partial derivatives with respect to w[1], b[1], we would use:

Equations for calculating the partial derivative of the cost function with respect to the weights and biases of layer 1

This can be better understood from the following figure:

Backpropagation Computation Graph

This gives us a better picture of how backpropagation actually works.

So now that we have the understanding of what backpropagation is actually doing, let us generalize the above equations for the partial derivatives, which would make it easier for us to convert them into code.

We have:

Equations for calculating the partial derivative of the cost function with respect to the weights and biases of any layer l

So for calculating the partial derivatives of C with respect to w[l], b[l], we need to calculate

Four fundamental partial derivatives required to perform backpropagation

where L is the last layer.

*We need to calculate dC/dz[L] separately, and then dC/dz[l] will be a general formula to calculate the partial in terms of the partial derivative of the l + 1 layer, i.e. dC/dz[l+1].

For our convenience we will assume the activation function for each layer g[l] to be the sigmoid activation function. As the activation function remains the same for all the layers, we will drop the superscript [l].

The sigmoid function is defined as:

Sigmoid function

Partial derivative of C with respect to z[L]

The partial derivative of C with respect to z[L] can be calculated as follows:

Derivation of the partial derivative of the cost function with respect to the activation of the last layer

Where .* represents element-wise multiplication of the matrices, also known as the Hadamard product. We multiply element-wise to make sure that all the dimensions of our matrix multiplications match up as expected.

Derivative of the activation function:

Derivation of the derivative of the sigmoid activation function

Putting it all together

Derivation of the partial derivative of the cost function with respect to z[L]

Partial derivative of C with respect to z[l]

We want the partial derivative of C with respect to z[l] in terms of the partial derivative of C with respect to the layer l+1, so that once we have z[L], we can calculate z[L-1], z[L-2].. and so on.

We can express C as a function of z[l + 1] for any given l. Therefore, we can write:

Equation for the partial derivative of the cost function with respect to z[l] in terms of the partial derivative of the cost function with respect to z[l + 1]

As we will first be calculating the dC/dz[L] term, we can then calculate dC/dz[l] for every l = L-1, L-2 .. and so on by substituting the values in the above equation.

We still have to calculate dz[l+1]/da[l] and da[l]/dz[l]. Let us see how we can do that.

Derivations of the partial derivative of z[l + 1] with respect to a[l] & the derivative of a[l] with respect to z[l]

Putting it all together we get:

Equation for the partial derivative of the cost function with respect to z[l]

Note that we have adjusted the terms to make sure our matrix multiplication dimensions match as expected. Here ‘.’ represents the matrix multiplication operation and .* represents the element-wise product as above.

Partial derivative of z[l] with respect to w[l]

Now let us calculate dz[l]/dw[l]

Derivation of partial derivative of z[l] with respect to w[l]

Partial derivative of C with respect to w[l]

Using the derived equations,

Equation for the partial derivative of the cost function with respect to w[l]

Partial derivative of z[l] with respect to b[l]

Derivation of the partial derivative of z[l] with respect to b[l]

Partial derivative of C with respect to b[l]

Using the derived equations,

Equation for the partial derivative of the cost function with respect to b[l]

Note that these derivations consider input X as only one training example, for a batch of m examples, we have to average the derivatives over the m examples.

That completes the derivation of the fundamental equations of backpropagation. It’s really just the outcome of carefully applying the chain rule. A little less succinctly, we can think of backpropagation as a way of computing the gradient of the cost function by systematically applying the chain rule from multi-variable calculus. That’s all there really is to backpropagation — the rest is details.