1. Introduction

You don’t understand anything until you learn it more than one way. Marvin Minsky

No computer has ever been designed that is ever aware of what it’s doing; but most of the time, we aren’t either. Marvin Minsky

A wealth of information creates a poverty of attention. Herbert Simon

This post, Deep Learning from first Principles in Python, R and Octave-Part8, is my final post in my Deep Learning from first principles series. In this post, I discuss and implement a key functionality needed while building Deep Learning networks viz. ‘Gradient Checking’. Gradient Checking is an important method to check the correctness of your implementation, specifically the forward propagation and the backward propagation cycles of an implementation. In addition I also discuss some tips for tuning hyper-parameters of a Deep Learning network based on my experience.

My post in this ‘Deep Learning Series’ so far were

1. Deep Learning from first principles in Python, R and Octave – Part 1 In part 1, I implement logistic regression as a neural network in vectorized Python, R and Octave

2. Deep Learning from first principles in Python, R and Octave – Part 2 In the second part I implement a simple Neural network with just 1 hidden layer and a sigmoid activation output function

3. Deep Learning from first principles in Python, R and Octave – Part 3 The 3rd part implemented a multi-layer Deep Learning Network with sigmoid activation output in vectorized Python, R and Octave

4. Deep Learning from first principles in Python, R and Octave – Part 4 The 4th part deals with multi-class classification. Specifically, I derive the Jacobian of the Softmax function and enhance my L-Layer DL network to include Softmax output function in addition to Sigmoid activation

5. Deep Learning from first principles in Python, R and Octave – Part 5 This post uses the Softmax classifier implemented to classify MNIST digits using a L-layer Deep Learning network

6. Deep Learning from first principles in Python, R and Octave – Part 6 The 6th part adds more bells and whistles to my L-Layer DL network, by including different initialization types namely He and Xavier. Besides L2 Regularization and random dropout is added.

7. Deep Learning from first principles in Python, R and Octave – Part 7 The 7th part deals with Stochastic Gradient Descent Optimization methods including momentum, RMSProp and Adam

8. Deep Learning from first principles in Python, R and Octave – Part 8 – This post implements a critical function for ensuring the correctness of a L-Layer Deep Learning network implementation using Gradient Checking

Checkout my book ‘Deep Learning from first principles: Second Edition – In vectorized Python, R and Octave’. My book starts with the implementation of a simple 2-layer Neural Network and works its way to a generic L-Layer Deep Learning Network, with all the bells and whistles. The derivations have been discussed in detail. The code has been extensively commented and included in its entirety in the Appendix sections. My book is available on Amazon as paperback ($18.99) and in kindle version($9.99/Rs449).

You may also like my companion book “Practical Machine Learning with R and Python- Machine Learning in stereo” available in Amazon in paperback($9.99) and Kindle($6.99) versions. This book is ideal for a quick reference of the various ML functions and associated measurements in both R and Python which are essential to delve deep into Deep Learning.

Gradient Checking is based on the following approach. One iteration of Gradient Descent computes and updates the parameters by doing

.

To minimize the cost we will need to minimize

Let be a function that computes the derivative . Gradient Checking allows us to numerically evaluate the implementation of the function and verify its correctness.

We know the derivative of a function is given by



Note: The above derivative is based on the 2 sided derivative. The 1-sided derivative is given by

Gradient Checking is based on the 2-sided derivative because the error is of the order as opposed for the 1-sided derivative.

Hence Gradient Check uses the 2 sided derivative as follows.



In Gradient Check the following is done

A) Run one normal cycle of your implementation by doing the following

a) Compute the output activation by running 1 cycle of forward propagation

b) Compute the cost using the output activation

c) Compute the gradients using backpropation (grad)

B) Perform gradient check steps as below

a) Set . Flatten all ‘weights’ and ‘bias’ matrices and vectors to a column vector.

b) Initialize by bumping up by adding ( )

c) Perform forward propagation with

d) Compute cost with i.e.

e) Initialize by bumping down by subtracting

f) Perform forward propagation with

g) Compute cost with i.e.

h) Compute or ‘gradapprox’ as using the 2 sided derivative.

i) Compute L2norm or the Euclidean distance between ‘grad’ and ‘gradapprox’. If the

diference is of the order of or the implementation is correct. In the Deep Learning Specialization Prof Andrew Ng mentions that if the difference is of the order of then the implementation is correct. A difference of is also ok. Anything more than that is a cause of worry and you should look at your code more closely. To see more details click Gradient checking and advanced optimization

You can clone/download the code from Github at DeepLearning-Part8

After spending a better part of 3 days, I now realize how critical Gradient Check is for ensuring the correctness of you implementation. Initially I was getting very high difference and did not know how to understand the results or debug my implementation. After many hours of staring at the results, I was able to finally arrive at a way, to localize issues in the implementation. In fact, I did catch a small bug in my Python code, which did not exist in the R and Octave implementations. I will demonstrate this below