Cost Function

Let’s start with defining the general equation for the cost function. This function represent the sum of the error, difference between the predicted value and the real (labeled) value.

Image 12: General Cost functoin. source: coursera.org

Since this is type of a classification problem y can only take discrete values {0,1}. It can only be in one type of class. For example if we classify images of dogs (class 1), cats (class 2) and birds (class 3). If the input image is dog. The output classes will be value 1 for dog class and value 0 for the other classes.

This means that we want our hypothesis to satisfy

Image 13: Hypothesis function range values

So that’s why we will define our hypothesis as

Image 14: Hypothesis function

Where g in this case will be Sigmoid function, since this function has range values between (0,1).

Our goal is to optimize the cost function so we need to find min J(θ). But Sigmoid function is a “non-convex” function (“Image 15”) which means that there are multiple local minimums. So it’s not guaranteed to converge (find) to the global minimum. What we need is “convex” function in order gradient descent algorithm to be able to find the global minimum (minimize J(θ)). In order to do that we use log function.

Image 15: Convex vs Non-convex function. source: researchgate.com

So that’s why we use following cost function for neural networks

Image 16: Neural Network cost function. source: coursera.org

In case where labeled value y is equal to 1 the hypothesis is -log(h(x)) or -log(1-h(x)) otherwise.

The intuition is pretty simple if we look at the function graphs. Let first look at the case where y=1. Then -log(h(x)) would look like the graph below. And we are only interested in the (0,1) x-axis interval since hypothesis can only take values in that range (“Image 13”)

Image 17: Cost function -log(h(x)) . source: desmos.com

What we can see from the graph is that if y=1 and h(x) approaches value of 1 (x-axis) the cost approaches the value 0 (h(x)-y would be 0) since it’s the right prediction. Otherwise if h(x) approaches 0 the cost function goes to infinity (very large cost).

In the other case where y=0, the cost function is -log(1-h(x))

Image 18: -log(1-h) cost function. source: desmos.com

From the graph here we can see that if h(x) approaches value of 0 the cost would approach 0 since it’s also the right prediction in this case.

Since y (labeled value) is always equal to 0 or 1 we can write cost function in one equation.

Image 19: Cost function equation. source: coursera.org

If we fully write our cost function with the summation we would get:

Image 20: Cost function in case of one output node. source: coursera.org

And this is for the case where there is only one node in the output layer of Neural Network. If we generalize this for multiple output nodes (multiclass classification) what we get is:

Image 21: Generalized Cost function. source: coursera.org

The right parts of the equations represent cost function “regularization”. This regularization prevent the data from “overfitting”, by reducing the magnitude/values of θ.

Forward Propagation Calculation

This process of Forward propagation is actually getting the Neural Network output value based on a given input. This algorithm is used to calculate the cost value. What it does is the same mathematical process as the one described in section 2 “Model Representation Mathematics”. Where in the end we get our hypothesis value “Image 7”.

After we got the h(x) value (hypothesis) we use the Cost function equation (“Image 21”) to calculate the cost for the given set of inputs.

Image 22: Calculate Forward propagation

Here we can notice how forward propagation works and how a Neural Network generates the predictions.

Backpropagation Algorithm

What we want to do is minimize the cost function J(θ) using the optimal set of values for θ (weights). Backpropagation is a method we use in order to compute the partial derivative of J(θ).

This partial derivative value is then used in Gradient descent algorithm (“Image 23”) for calculating the θ values for the Neural Network that minimize the cost function J(θ).

Image 23: General form of gradient descent. source: coursera.org

Backpropagation algorithm has 5 steps:

Set a(1) = X; for the training examples Perform forward propagation and compute a(l) for the other layers (l = 2…L) Use y and compute the delta value for the last layer δ(L) = h(x) — y Compute the δ(l) values backwards for each layer (described in “Math behind Backpropagation” section) Calculate derivative values Δ(l) = (a(l))^T ∘ δ(l+1) for each layer, which represent the derivative of cost J(θ) with respect to θ(l) for layer l

Backpropagation is about determining how changing the weights impact the overall cost in the neural network.

What it does is propagating the “error” backwards in the neural network. On the way back it is finding how much each weight is contributing in the overall “error”. The weights that contribute more to the overall “error” will have larger derivation values, which means that they will change more (when computing Gradient descent).

Now that we have sense of what Backpropagation algorithm is doing we can dive deeper in the concepts and math behind.

Why derivatives ?

The derivative of a function (in our case J(θ)) on each variable (in our case weight θ) tells us the sensitivity of the function with respect to that variable or how changing the variable impacts the function value.

Let’s look at a simple example neural network

Image 24: Simple Neural Network

There are two input nodes x and y. The output function is calculating the product x and y. We can now compute the partial derivatives for both nodes

Image 25: Derivatives to respect to y and x of f(x,y) = xy function

The partial derivative with respect to x is saying that if x value increase for some value ϵ then it would increase the function (product xy) by 7ϵ and the partial derivative with respect to y is saying that if y value increase for some value ϵ then it would increase the function by 3ϵ.

As we defined, Backpropagation algorithm is calculating the derivative of cost function with respect to each θ weight parameter. By doing this we determine how sensitive is the cost function J(θ) to each of these θ weight parameters. It also help us determine how much we should change each θ weight parameter when computing the Gradient descent. So at the end we get model that best fits our data.

Math behind Backpropagation

We will by using the neural network model below as starting point to derive the equations.

Image 26: Neural Network

In this model we got 3 output nodes (K) and 2 hidden layers. As previously defined, the cost function for the neural network is:

Image 27: Generalized Cost function. source: coursera.org

What we need is to compute the partial derivative of J(θ) with respect to each θ parameters. We are going to leave out the summarization since we are using vectorized implementation (matrix multiplication). Also we can leave out the regularization (right part of the equation above) and we will compute it separately at the end. Since it is addition the derivative can be computed independently.

NOTE: Vectorized implementation will be used so we calculate for all training examples at once.

We start with defining the derivative rules that we will use.

Image 28: Derivative Rules

Now we define the basic equation for our neural network model where l is layer notation and L is for the last layer.

Image 29: Initial Neural Network model equations

In our case L has value 4, since we got 4 layers in our model. So let’s start by computing the partial derivative with respect to weights between 3rd and 4th layer.

Image 30: Derivative of θ parameters between 3rd and 4th layer

Step (6) — Sigmoid derivative

To explain the step (6) we need to calculate the partial derivative of sigmoid function.

Image 31: Derivative of Sigmoid function

In case of the last layer L we got,

Image 32: Output layer equation

so,

Image 33: Output layer equation

Step (11) — Get rid of the summarization (Σ)

Also in the last step (11) it’s important to note that we need to multiply δ by a transpose in order to get rid of the summarization (1…m for training examples).

δ — matrix with dimensions

[number_of_training_examples, output_layer_size] so this also means that we will get rid from the second summarization (1…K for number of output nodes).

a — matrix with dimensions

[hidden_layer_size, number_of_training_examples]

Now we continue with the next derivative for the θ parameters between 2nd and 3rd layer. For this derivation we can start from step (9) (“Image 30”). Since θ(2) is inside a(3) function we need to apply the “Chain Rule” when calculating the derivative (step(6) from derivative rules on “Image 28”).

Image 34: Derivative of θ parameters between 2nd and 3rd layer

Now we got the derivative for θ parameter between 2nd and 3rd layer. What we left to do is compute the derivative for θ parameter between input layer and 2nd layer. By doing this we will see that the same process (equations) will be repeated so we can derive general δ and derivative equations. Again we continue from step (3) (“Image 34”).

Image 35: Derivative of θ parameters between input and 2nd layer

From the equation above we can derive equations for δ parameter and derivative with respect to θ parameter.

Image 36: Recursive δ equation

Image 37: Derivative of J (cost) with respect to θ in layer l equation

At the end we get is three matrices (same as θ weight matrices) with same dimensions as the θ weight matrices and calculated derivatives for each θ parameter.

Add the regularization

As already mentioned regularization is needed for preventing the model from overfitting the data. We have already defined regularization for our cost function which is the right part of the equation defined on “Image 21“.

Image 38: Regularization equation for Cost function

In order to add the regularization for the gradient (partial derivative) we need to compute the partial derivative for the regularization above.

Image 39: Regularization equation for gradient (partial derivative)

Which means just adding the sum of all theta values from each layer to the partial derivatives with respect to θ.

Code Implementation

We can now implement all the equations in code where we will calculate the Cost and derivatives (using Backpropagation) so we can use them in Gradient descent algorithm later to optimize θ parameters for our model.

Image 38: Code implementation of Neural Network Cost function and Backpropagation algorithm

Conclusion

Hopefully this was clear and easy to understand. If you think that some part needs better explanation please feel free to add a comment or suggestion. For any questions feel free to contact me.

Hope you enjoyed it!

Helpful links