Introduction

a. A robot may not injure a human being or, through inaction, allow a human being to come to harm.

b. A robot must obey orders given it by human beings except where such orders would conflict with the First Law.

c. A robot must protect its own existence as long as such protection does not conflict with the First or Second Law.

Isaac Asimov's Three Laws of Robotics

Any sufficiently advanced technology is indistinguishable from magic.

Arthur C Clarke.

In this 5th part on Deep Learning from first Principles in Python, R and Octave, I solve the MNIST data set of handwritten digits (shown below), from the basics. To do this, I construct a L-Layer, vectorized Deep Learning implementation in Python, R and Octave from scratch and classify the MNIST data set. The MNIST training data set contains 60000 handwritten digits from 0-9, and a test set of 10000 digits. MNIST, is a popular dataset for running Deep Learning tests, and has been rightfully termed as the ‘drosophila’ of Deep Learning, by none other than the venerable Prof Geoffrey Hinton. The ‘Deep Learning from first principles in Python, R and Octave’ series, so far included Part 1 , where I had implemented logistic regression as a simple Neural Network. Part 2 implemented the most elementary neural network with 1 hidden layer, but with any number of activation units in that layer, and a sigmoid activation at the output layer. This post, ‘Deep Learning from first principles in Python, R and Octave – Part 5’ largely builds upon Part3. in which I implemented a multi-layer Deep Learning network, with an arbitrary number of hidden layers and activation units per hidden layer and with the output layer was based on the sigmoid unit, for binary classification. In Part 4, I derive the Jacobian of a Softmax, the Cross entropy loss and the gradient equations for a multi-class Softmax classifier. I also implement a simple Neural Network using Softmax classifications in Python, R and Octave. In this post I combine Part 3 and Part 4 to to build a L-layer Deep Learning network, with arbitrary number of hidden layers and hidden units, which can do both binary (sigmoid) and multi-class (softmax) classification. Note: A detailed discussion of the derivation for multi-class clasification can be seen in my video presentation Neural Networks 5 The generic, vectorized L-Layer Deep Learning Network implementations in Python, R and Octave can be cloned/downloaded from GitHub at DeepLearning-Part5. This implementation allows for arbitrary number of hidden layers and hidden layer units. The activation function at the hidden layers can be one of sigmoid, relu and tanh (will be adding leaky relu soon). The output activation can be used for binary classification with the ‘sigmoid’, or multi-class classification with ‘softmax’. Feel free to download and play around with the code! I thought the exercise of combining the two parts(Part 3, & Part 4) would be a breeze. But it was anything but. Incorporating a Softmax classifier into the generic L-Layer Deep Learning model was a challenge. Moreover I found that I could not use the gradient descent on 60,000 training samples as my laptop ran out of memory. So I had to implement Stochastic Gradient Descent (SGD) for Python, R and Octave. In addition, I had to also implement the numerically stable version of Softmax, as the softmax and its derivative would result in NaNs. Numerically stable Softmax The Softmax function can be numerically unstable because of the division of large exponentials. To handle this problem we have to implement stable Softmax function as below



Therefore

Here ‘D’ can be anything. A common choice is

Here is the stable Softmax implementation in Python # A numerically stable Softmax implementation def stableSoftmax(Z): #Compute the softmax of vector x in a numerically stable way. shiftZ = Z.T - np.max(Z.T,axis=1).reshape(-1,1) exp_scores = np.exp(shiftZ) # normalize them for each example A = exp_scores / np.sum(exp_scores, axis=1, keepdims=True) cache=Z return A,cache While trying to create a L-Layer generic Deep Learning network in the 3 languages, I found it useful to ensure that the model executed correctly on smaller datasets. You can run into numerous problems while setting up the matrices, which becomes extremely difficult to debug. So in this post, I run the model on 2 smaller data for sets used in my earlier posts(Part 3 & Part4) , in each of the languages, before running the generic model on MNIST. Here is a fair warning. if you think you can dive directly into Deep Learning, with just some basic knowledge of Machine Learning, you are bound to run into serious issues. Moreover, your knowledge will be incomplete. It is essential that you have a good grasp of Machine and Statistical Learning, the different algorithms, the measures and metrics for selecting the models etc.It would help to be conversant with all the ML models, ML concepts, validation techniques, classification measures etc. Check out the internet/books for background. Checkout my book ‘Deep Learning from first principles: Second Edition – In vectorized Python, R and Octave’. My book starts with the implementation of a simple 2-layer Neural Network and works its way to a generic L-Layer Deep Learning Network, with all the bells and whistles. The derivations have been discussed in detail. The code has been extensively commented and included in its entirety in the Appendix sections. My book is available on Amazon as paperback ($18.99) and in kindle version($9.99/Rs449). You may also like my companion book “Practical Machine Learning with R and Python:Second Edition- Machine Learning in stereo” available in Amazon in paperback($10.99) and Kindle($7.99/Rs449) versions. This book is ideal for a quick reference of the various ML functions and associated measurements in both R and Python which are essential to delve deep into Deep Learning.

1. Random dataset with Sigmoid activation – Python This random data with 9 clusters, was used in my post Deep Learning from first principles in Python, R and Octave – Part 3 , and was used to test the complete L-layer Deep Learning network with Sigmoid activation. import numpy as np import matplotlib import matplotlib.pyplot as plt import pandas as pd from sklearn.datasets import make_classification, make_blobs exec (open( "DLfunctions51.py" ).read()) # Cannot import in Rmd. # Create a random data set with 9 centeres X1, Y1 = make_blobs(n_samples = 400 , n_features = 2 , centers = 9 ,cluster_std = 1.3 , random_state = 4 ) Y1=Y1.reshape( 400 , 1 ) Y1 = Y1 % 2 X2=X1.T Y2=Y1.T # Set the dimensions of L -layer DL network layersDimensions = [ 2 , 9 , 9 , 1 ] # Execute DL network with hidden activation=relu and sigmoid output function parameters = L_Layer_DeepModel(X2, Y2, layersDimensions, hiddenActivationFunc= 'relu' , outputActivationFunc= "sigmoid" ,learningRate = 0.3 ,num_iterations = 2500 , print_cost = True )

2. Spiral dataset with Softmax activation – Python The Spiral data was used in my post Deep Learning from first principles in Python, R and Octave – Part 4 and was used to test the complete L-layer Deep Learning network with multi-class Softmax activation at the output layer import numpy as np import matplotlib import matplotlib.pyplot as plt import pandas as pd from sklearn.datasets import make_classification, make_blobs exec (open( "DLfunctions51.py" ).read()) N = 100 D = 2 K = 3 X = np.zeros((N*K,D)) y = np.zeros(N*K, dtype= 'uint8' ) for j in range(K): ix = range(N*j,N*(j+ 1 )) r = np.linspace( 0.0 , 1 ,N) t = np.linspace(j* 4 ,(j+ 1 )* 4 ,N) + np.random.randn(N)* 0.2 X[ix] = np.c_[r*np.sin(t), r*np.cos(t)] y[ix] = j X1=X.T Y1=y.reshape(- 1 , 1 ).T numHidden= 100 numFeats= 2 numOutput = 3 # Set the dimensions of the layers layersDimensions=[numFeats,numHidden,numOutput] parameters = L_Layer_DeepModel(X1, Y1, layersDimensions, hiddenActivationFunc= 'relu' , outputActivationFunc= "softmax" ,learningRate = 0.6 ,num_iterations = 9000 , print_cost = True ) ## Cost after iteration 0: 1.098759 ## Cost after iteration 1000: 0.112666 ## Cost after iteration 2000: 0.044351 ## Cost after iteration 3000: 0.027491 ## Cost after iteration 4000: 0.021898 ## Cost after iteration 5000: 0.019181 ## Cost after iteration 6000: 0.017832 ## Cost after iteration 7000: 0.017452 ## Cost after iteration 8000: 0.017161

3. MNIST dataset with Softmax activation – Python In the code below, I execute Stochastic Gradient Descent on the MNIST training data of 60000. I used a mini-batch size of 1000. Python takes about 40 minutes to crunch the data. In addition I also compute the Confusion Matrix and other metrics like Accuracy, Precision and Recall for the MNIST data set. I get an accuracy of 0.93 on the MNIST test set. This accuracy can be improved by choosing more hidden layers or more hidden units and possibly also tweaking the learning rate and the number of epochs. import numpy as np import matplotlib import matplotlib.pyplot as plt import pandas as pd import math from sklearn.datasets import make_classification, make_blobs from sklearn.metrics import confusion_matrix from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score exec (open( "DLfunctions51.py" ).read()) exec (open( "load_mnist.py" ).read()) # Read the MNIST training and test sets training=list(read(dataset= 'training' ,path= ".\\mnist" )) test=list(read(dataset= 'testing' ,path= ".\\mnist" )) # Create labels and pixel arrays lbls=[] pxls=[] print (len(training)) for i in range( 60000 ): l,p=training[i] lbls.append(l) pxls.append(p) labels= np.array(lbls) pixels=np.array(pxls) y=labels.reshape(- 1 , 1 ) X=pixels.reshape(pixels.shape[ 0 ],- 1 ) X1=X.T Y1=y.T # Set the dimensions of the layers. The MNIST data is 28x28 pixels= 784 # Hence input layer is 784. For the 10 digits the Softmax classifier # has to handle 10 outputs layersDimensions=[ 784 , 15 , 9 , 10 ] np.random.seed( 1 ) costs = [] # Run Stochastic Gradient Descent with Learning Rate=0.01, mini batch size=1000 # number of epochs=3000 parameters = L_Layer_DeepModel_SGD(X1, Y1, layersDimensions, hiddenActivationFunc= 'relu' , outputActivationFunc= "softmax" ,learningRate = 0.01 ,mini_batch_size = 1000 , num_epochs = 3000 , print_cost = True ) # Compute the Confusion Matrix on Training set proba=predict_proba(parameters, X1,outputActivationFunc= "softmax" ) a=confusion_matrix(Y1.T,proba) print (a) from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score print ( 'Accuracy: {:.2f}' .format(accuracy_score(Y1.T, proba))) print ( 'Precision: {:.2f}' .format(precision_score(Y1.T, proba,average= "micro" ))) print ( 'Recall: {:.2f}' .format(recall_score(Y1.T, proba,average= "micro" ))) # Read the test data lbls=[] pxls=[] print (len(test)) for i in range( 10000 ): l,p=test[i] lbls.append(l) pxls.append(p) testLabels= np.array(lbls) testPixels=np.array(pxls) ytest=testLabels.reshape(- 1 , 1 ) Xtest=testPixels.reshape(testPixels.shape[ 0 ],- 1 ) X1test=Xtest.T Y1test=ytest.T # Compute the Confusion Matrix on Test set probaTest=predict_proba(parameters, X1test,outputActivationFunc= "softmax" ) a=confusion_matrix(Y1test.T,probaTest) print (a) from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score print ( 'Accuracy: {:.2f}' .format(accuracy_score(Y1test.T, probaTest))) print ( 'Precision: {:.2f}' .format(precision_score(Y1test.T, probaTest,average= "micro" ))) print ( 'Recall: {:.2f}' .format(recall_score(Y1test.T, probaTest,average= "micro" ))) ##1. Confusion Matrix of Training set 0 1 2 3 4 5 6 7 8 9 ## [[5854 0 19 2 10 7 0 1 24 6] ## [ 1 6659 30 10 5 3 0 14 20 0] ## [ 20 24 5805 18 6 11 2 32 37 3] ## [ 5 4 175 5783 1 27 1 58 60 17] ## [ 1 21 9 0 5780 0 5 2 12 12] ## [ 29 9 21 224 6 4824 18 17 245 28] ## [ 5 4 22 1 32 12 5799 0 43 0] ## [ 3 13 148 154 18 3 0 5883 4 39] ## [ 11 34 30 21 13 16 4 7 5703 12] ## [ 10 4 1 32 135 14 1 92 134 5526]] ##2. Accuracy, Precision, Recall of Training set ## Accuracy: 0.96 ## Precision: 0.96 ## Recall: 0.96 ##3. Confusion Matrix of Test set 0 1 2 3 4 5 6 7 8 9 ## [[ 954 1 8 0 3 3 2 4 4 1] ## [ 0 1107 6 5 0 0 1 2 14 0] ## [ 11 7 957 10 5 0 5 20 16 1] ## [ 2 3 37 925 3 13 0 8 18 1] ## [ 2 6 1 1 944 0 7 3 4 14] ## [ 12 5 4 45 2 740 24 8 42 10] ## [ 8 4 4 2 16 9 903 0 12 0] ## [ 4 10 27 18 5 1 0 940 1 22] ## [ 11 13 6 13 9 10 7 2 900 3] ## [ 8 5 1 7 50 7 0 20 29 882]] ##4. Accuracy, Precision, Recall of Training set ## Accuracy: 0.93 ## Precision: 0.93 ## Recall: 0.93

4. Random dataset with Sigmoid activation – R code This is the random data set used in the Python code above which was saved as a CSV. The code is used to test a L -Layer DL network with Sigmoid Activation in R. source ( "DLfunctions5.R" ) # Read the random data set z <- as.matrix ( read.csv ( "data.csv" , header = FALSE ) ) x <- z [ , 1 : 2 ] y <- z [ , 3 ] X <- t ( x ) Y <- t ( y ) # Set the dimensions of the layer layersDimensions = c ( 2 , 9 , 9 , 1 ) # sigmoid activation unit in the output layer retvals = L_Layer_DeepModel ( X , Y , layersDimensions , hiddenActivationFunc = 'relu' , outputActivationFunc = "sigmoid" , learningRate = 0.3 , numIterations = 5000 , print_cost = True ) #Plot the cost vs iterations iterations <- seq(0,5000,1000) costs=retvals$costs df=data.frame(iterations,costs) ggplot(df,aes(x=iterations,y=costs)) + geom_point() + geom_line(color="blue") + ggtitle("Costs vs iterations") + xlab("Iterations") + ylab("Loss")