Deep learning is a subfield of the Machine Learning Science which is based on artificial neural networks. It has several derivatives such as Multi-Layer Perceptron-MLP-, Convolutional Neural Networks -CNN- and Recurrent Neural Networks -RNN- which can be applied to many fields including Computer Vision, Natural Language Processing, Machine Translation...

Deep learning is taking off for three main reasons:

Instinctive features engineering : while most of machine learning algorithms require human expertise for the feature engineering and extraction, deep learning handles automatically the choice of variables and their weights

: while most of machine learning algorithms require human expertise for the feature engineering and extraction, deep learning handles automatically the choice of variables and their weights Huge Datasets : the continuous collection of data has led to large databases which allow deeper neural networks

: the continuous collection of data has led to large databases which allow deeper neural networks Hardware evolution: the new GPUs, for Graphical Process Units, allow faster algebraic calculation which is the core base of DL

In this blog, we will focus mainly on the Multi-Layer Perceptron -MLP- where we will detail the mathematical background behind the success of deep learning and explore the optimization algorithms used to improve its performances.

Tabe of contents

The summary is as follows:

Definition Learning algorithm Parameter Initialization Forward - Backpropagation Activation functions Optimization algorithm

1 -Definition

A neuron

It is a bloc of mathematical operations linking between entities

Let’s consider the problem where we estimate the price of a house based on its size, it can be schematized as follows:

When including more description about the house by adding more variables, the graph becomes as follow:

Each neuron is divided into two main blocks:

Computation of z using the inputs x i x_i x i ​ :

z = ∑ i w i ⋆ x i + b z=\sum_i w_i \star x_i +b z=∑i​wi​⋆xi​+b

Computation of a , which is equal to y at the output layer, using z

a = ψ ( z ) a=\psi(z) a=ψ(z)

w i w_i wi​ are the weights , b b b is the bias and ψ \psi ψ is said to be the activation function .

In general, neural networks better known as MLP, for ‘Multi Layers Perceptron’, is a type of direct formal neural network organized into several layers in which information flows from the input layer to the output layer only. Each layer consists of a defined number of neurons, we distinguish : The input layer

The hidden layers

The output layer

The following graph represents a neural network with 5 neurons at the input, 3 in the first hidden layer, 3 in the second hidden layer and 2 out.

Some variables in the hidden layers can be interpreted based on the input features: in the case of the house pricing and under the assumption that the first neuron of the first hidden layer pays more attention to the variables x 1 x_1 x1​ et x 2 x_2 x2​, it can be interpreted as the quantification of the family size of the house for instance.

DL as a supervised task

In most DL problems, we tend to predict an output y using a set of variables X, in this case, we suppose that for each row of the database X i X_i Xi​ we have the corresponding prediction y i y_i yi​, thus the labeled data.

Applications : Real Estate, Speech Recognition, Image Classification …

The used data can be:

Structured: explicit databases with features well defined

Unstructured: Audio, Image, Text, …

Universal approximation theorem

Deep learning in real life is the approximation of a given function f f f. This approximation is possible and accurate thanks to the following theorem:

A multi-layer perceptron with a single hidden layer containing a finite number of neurons can approximate any continuous function f f f on compact ( ∗ ) {}^{(*)} (∗) subsets of R n R^n Rn.

The class of deep neural networks is a universal approximator ⟺ \iff ⟺ the activation function is not polynomial.

( ∗ ) ^{(*)} (∗) In finite dimension, a set is said to be compact if it is closed and bounded. Visit this link for more details.

The main take-out of this algorithm is that deep learning allows solving any problem which can be mathematically expressed

Data Preprocessing

In any machine learning project in general, we divide our data into 3 sets:

Train set : used to train the algorithm and construct batches

: used to train the algorithm and construct batches Dev set : used to finetune the algorithm and evaluate bias and variance

: used to finetune the algorithm and evaluate bias and variance Test set: used to generalize the error/precision of the final algorithm

The following table sums up the repartition of the three sets according the size of the data set m m m:

Train Dev Test m = 1 0 4 m=10^4 m = 1 0 4 60% 20% 20% m = 1 0 6 m=10^6 m = 1 0 6 96% 2% 2%

Standard deep learning algorithms require a large dataset where the number of samples is around 500 k 500k 500k lines. Now that the data is ready we will see in the next section the training algorithm.

Usually, before splitting the data, we also normalize the inputs, a step detailed later in this article.

2 - Learning algorithm

Learning in neural networks is the step of calculating the weights of the parameters associated with the various regressions throughout the network. In other words, we aim to find the best parameters that give the best prediction/approximation y i ^ \hat{y_i} yi​^​, starting from the input x i x_i xi​, of the real value y i y_i yi​.

For this, we define an objective function called the loss function and denoted J which quantifies the distance between the real and the predicted values on the overall training set.

We minimize J following two major steps:

Forward Propagation : we propagate the data through the network either in entirely or in batches, and we calculate the loss function on this batch which is nothing but the sum of the errors committed at the predicted output for the different rows.

: we propagate the data through the network either in entirely or in batches, and we calculate the loss function on this batch which is nothing but the sum of the errors committed at the predicted output for the different rows. Backpropagation : consists of calculating the gradients of the cost function with respect to the different parameters, then apply a descent algorithm to update them.

We iter the same process a number of times called epoch number . After defining the architecture, the learning algorithm is written as follows:

Initialization of the model parameters, a step equivalent to injecting noise into the model.

For i=1,2…N: (N is the number of epochs) Perform forward propagation : ∀ i \forall i ∀ i , Compute the predicted value of x i x_i x i ​ through the neural network: y ^ i θ \hat{y}_i^{\theta} y ^ ​ i θ ​ Evaluate the function : J ( θ ) = 1 m ∑ i = 1 m L ( y ^ i θ , y i ) J(\theta)=\frac{1}{m}\sum_{i=1}^m \mathcal{L}(\hat{y}_i^{\theta}, y_i) J ( θ ) = m 1 ​ ∑ i = 1 m ​ L ( y ^ ​ i θ ​ , y i ​ ) where m is the size of the training set, θ the model parameters and L \mathcal{L} L the cost ( ∗ ) {}^{(*)} ( ∗ ) function Perform backpropagation : Apply a descent method to update the parameters : θ = : G ( θ ) \theta=:G(\theta) θ = : G ( θ )



( ∗ ) {}^{(*)} (∗) The cost function L \mathcal{L} L evaluates the distances between the real and predicted value on a single point.

3 - Parameter initialization

The first step after defining the architecture of the neural network is parameter initialization. It is equivalent to injecting initial noise into the model’s weights.

Zero initialization: one can think of initializing the parameters with 0’s everywhere i.e W [ i ] = O , b [ i ] = O W^{[i]}=O, b^{[i]}=O W [ i ] = O , b [ i ] = O . Using the forward propagation equations, we note that all the hidden units will be symmetric which penalizes the learning phase.

one can think of initializing the parameters with 0’s everywhere i.e . Using the forward propagation equations, we note that all the hidden units will be symmetric which penalizes the learning phase. Random initialization: it’s an alternative commonly used and consists of injecting random noise in the parameters. If the noise is too large, some activation functions might get saturated which might later affect the computation of the gradient.

Two of the most famous initialization methods are: Xavier ’s: it consists of filling the parameters with values randomly sampled from a centered variable following the normal distribution N ( 0 , 2 n i ) \mathcal{N}(0, \frac{2}{n_i}) N ( 0 , n i ​ 2 ​ ) .

’s: it consists of filling the parameters with values randomly sampled from a centered variable following the normal distribution . Glorot ’s: same approach with a different variance: N ( 0 , 2 n i + n i + 1 ) \mathcal{N}(0, \frac{2}{n_i+n_{i+1}}) N ( 0 , n i ​ + n i + 1 ​ 2 ​ ) .

where n i n_i ni​ is the number of nodes in the i t h i^{th} ith layer.

4 - Forward - Backpropagation

Before diving into the algebra behind deep learning, we will first set the annotation which will be used in explicitting the equations of both the forward and the backpropagation.

Neural Network’s representation

The neural network is a sequence of regressions followed by an activation function . They both define what we call the forward propagation. W [ i ] W^{[i]} W[i] and b [ i ] b^{[i]} b[i] are the learned parameters at each layer i i i. The backpropagation is also a sequence of algebraic operations carried out from the output towards the input.

Forward propagation

Algebra through the network

Let us consider a neural network having L layers as follows:

We consider the 1 s t 1^{st} 1st node of the 2 n d 2^{nd} 2nd hidden layer denoted a 1 [ 2 ] a^{[2]}_1 a1[2]​.

It’s computed using the all the neurons of the previous layer as follows:

z 1 [ 2 ] = ∑ l = 1 3 w 1 , l [ 2 ] a i [ 1 ] + b [ 2 ] z^{[2]}_1=\sum_{l=1}^3 w^{[2]}_{1,l} a^{[1]}_i+b^{[2]} z1[2]​=∑l=13​w1,l[2]​ai[1]​+b[2] → a 1 [ 2 ] = ψ [ 2 ] ( z 1 [ 2 ] ) \rightarrow a^{[2]}_1=\psi^{[2]}(z^{[2]}_1) →a1[2]​=ψ[2](z1[2]​)

In general, considering the j t h j^{th} jth node of the i t h i^{th} ith layer we have the following equations:

z j [ i ] = ∑ l = 1 n i − 1 w j , l [ i ] a l [ i − 1 ] + b j [ i ] z^{[i]}_j=\sum_{l=1}^{n_{i-1}} w^{[i]}_{j,l} a^{[i-1]}_l+b^{[i]}_j zj[i]​=∑l=1ni−1​​wj,l[i]​al[i−1]​+bj[i]​ → a j [ i ] = ψ [ i ] ( z j [ i ] ) \rightarrow a^{[i]}_j=\psi^{[i]}(z^{[i]}_j) →aj[i]​=ψ[i](zj[i]​)

with n i − 1 n_{i-1} ni−1​ being the number of neurons in the ( i − 1 ) t h (i-1)^{th} (i−1)th layer and W T {W}^T WT is the transpose of the matrix W W W.

Finally, we denote:

W [ i ] = [ w 1 [ i ] , w 2 [ i ] , . . , w n i [ i ] ] W^{[i]}=[w^{[i]}_1, w^{[i]}_2,.., w^{[i]}_{n_i}] W [ i ] = [ w 1 [ i ] ​ , w 2 [ i ] ​ , . . , w n i ​ [ i ] ​ ] where d i m ( w j [ i ] ) = [ n i − 1 , 1 ] dim(w^{[i]}_j)=[n_{i-1},1] d i m ( w j [ i ] ​ ) = [ n i − 1 ​ , 1 ]

where b [ i ] = T [ b 1 [ i ] , b 2 [ i ] , . . , b n i [ i ] ] b^{[i]}={}^T[b^{[i]}_1, b^{[i]}_2,.., b^{[i]}_{n_i}] b [ i ] = T [ b 1 [ i ] ​ , b 2 [ i ] ​ , . . , b n i ​ [ i ] ​ ]

Z [ i ] = T [ z 1 [ i ] , z 2 [ i ] , . . , z n i [ i ] ] ; A [ i ] = T [ a 1 [ i ] , a 2 [ i ] , . . , a n i [ i ] ] \mathcal{Z}^{[i]}={}^T[z^{[i]}_1, z^{[i]}_2,.., z^{[i]}_{n_i}]; \mathcal{A}^{[i]}={}^T[a^{[i]}_1, a^{[i]}_2,.., a^{[i]}_{n_i}] Z [ i ] = T [ z 1 [ i ] ​ , z 2 [ i ] ​ , . . , z n i ​ [ i ] ​ ] ; A [ i ] = T [ a 1 [ i ] ​ , a 2 [ i ] ​ , . . , a n i ​ [ i ] ​ ]

A [ i ] = ψ [ i ] ( Z [ i ] ) = T [ ψ [ i ] ( z 1 [ i ] ) , ψ [ i ] ( z 2 [ i ] ) , . . , ψ [ i ] ( z n i [ i ] ) ] \mathcal{A}^{[i]}=\psi^{[i]}(\mathcal{Z}^{[i]})={}^T[\psi^{[i]}(z^{[i]}_1), \psi^{[i]}(z^{[i]}_2),.., \psi^{[i]}(z^{[i]}_{n_i})] A [ i ] = ψ [ i ] ( Z [ i ] ) = T [ ψ [ i ] ( z 1 [ i ] ​ ) , ψ [ i ] ( z 2 [ i ] ​ ) , . . , ψ [ i ] ( z n i ​ [ i ] ​ ) ]

Thus:

A [ i ] = ψ [ i ] ( Z [ i ] ) = ψ [ i ] ( W [ i ] T A [ i − 1 ] + b [ i ] ) \mathcal{A}^{[i]}=\psi^{[i]}(\mathcal{Z}^{[i]})=\psi^{[i]}({W^{[i]}}^T\mathcal{A}^{[i-1]}+b^{[i]}) A[i]=ψ[i](Z[i])=ψ[i](W[i]TA[i−1]+b[i])

where

d i m ( Z [ i ] ) = d i m ( A [ i ] ) = [ n i , 1 ] d i m ( W [ i ] T ) = T d i m ( W [ i ] ) = [ n i , n i − 1 ] d i m ( b [ i ] ) = [ n i , 1 ] dim(\mathcal{Z}^{[i]})=dim(\mathcal{A}^{[i]})=[n_i,1] \\ dim({W^{[i]}}^{T})={}^Tdim(W^{[i]})=[n_i,n_{i-1}] \\ dim(b^{[i]})=[n_i,1] dim(Z[i])=dim(A[i])=[ni​,1]dim(W[i]T)=Tdim(W[i])=[ni​,ni−1​]dim(b[i])=[ni​,1]

Algebra through the training set

Let us consider the prediction of the output of a single row data frame, denoted x ( j ) x^{(j)} x(j), through the neural network We set a [ 0 ] = x ( j ) a^{[0]}=x^{(j)} a[0]=x(j), at each layer [ i ] [i] [i], we compute:

z [ i ] [ j ] = W [ i ] T a [ i − 1 ] [ j ] + b [ i ] and a [ i ] [ j ] = ψ [ i ] ( z [ i ] [ j ] ) z^{[i][j]}={W^{[i]}}^{T}a^{[i-1][j]}+b^{[i]}\text{ and } a^{[i][j]}=\psi^{[i]}(z^{[i][j]}) z[i][j]=W[i]Ta[i−1][j]+b[i] and a[i][j]=ψ[i](z[i][j])

Until y ^ ( j ) = ψ [ L ] ( a [ L ] ) \hat{y}^{(j)}=\psi^{[L]}(a^{[L]}) y^​(j)=ψ[L](a[L]), where L L L is the number of layers When dealing with a m m m-row data set, repeating these operations separately for each line is very costly.

We have, at each layer [ i ] [i] [i]:

z [ i ] [ 1 ] = W [ i ] T a [ i ] [ 0 ] + b [ i ] and a [ i ] [ 1 ] = ψ [ i ] ( z > [ i ] [ 1 ] ) . . z [ i ] [ m ] = W [ i ] T a [ i ] [ m − 1 ] + b [ i ] and a [ i ] [ m ] = ψ [ i ] ( z [ i ] [ m ] ) z^{[i][1]}={W^{[i]}}^{T}a^{[i][0]}+b^{[i]}\text{ and }a^{[i][1]}=\psi^{[i]}(z^{>[i][1]}) \\.\\.\\ z^{[i][m]}={W^{[i]}}^{T}a^{[i][m-1]}+b^{[i]}\text{ and }a^{[i][m]}=\psi^{[i]}(z^{[i][m]}) z[i][1]=W[i]Ta[i][0]+b[i] and a[i][1]=ψ[i](z>[i][1])..z[i][m]=W[i]Ta[i][m−1]+b[i] and a[i][m]=ψ[i](z[i][m])

We can use linear algebra to parallelize it as follows:

Z [ i ] = W [ i ] T A [ i − 1 ] + b [ i ] A [ i ] = ψ [ i ] ( Z [ i ] ) Z^{[i]}={W^{[i]}}^{T}A^{[i-1]}+b^{[i]} \\ A^{[i]}=\psi^{[i]}(Z^{[i]}) Z[i]=W[i]TA[i−1]+b[i]A[i]=ψ[i](Z[i])

Considering n i n_i ni​ the number of neuron in the i t h i^{th} ith layer:

Z [ i ] = [ z [ i ] [ j ] ] ( i , j ) ∈ [ n i , m ] A [ i ] = [ a [ i ] [ j ] ] ( i , j ) ∈ [ n i , m ] Z^{[i]}=\begin{bmatrix} z^{[i][j]} \end{bmatrix}_{(i,j)\in [n_i,m]} \\ A^{[i]}=\begin{bmatrix} a^{[i][j]} \end{bmatrix}_{(i,j)\in [n_i,m]} Z[i]=[z[i][j]​](i,j)∈[ni​,m]​A[i]=[a[i][j]​](i,j)∈[ni​,m]​

Where:

d i m ( Z [ i ] ) = d i m ( A [ i ] ) = [ n i , m ] d i m ( W [ i ] T ) = T d i m ( W [ i ] ) = [ n i , n i − 1 ] d i m ( b [ i ] ) = [ n i , 1 ] dim(Z^{[i]})=dim(A^{[i]})=[n_i,m] \\ dim({W^{[i]}}^{T})={}^Tdim(W^{[i]})=[n_i,n_{i-1}] \\ dim(b^{[i]})=[n_i,1] dim(Z[i])=dim(A[i])=[ni​,m]dim(W[i]T)=Tdim(W[i])=[ni​,ni−1​]dim(b[i])=[ni​,1]

The parameter b [ i ] b^{[i]} b[i] uses broadcasting to repeat itself through the columns. This can be summarized in the following graph:

Backpropagation

The backpropagation is the second step of the learning, which consists of injecting the error committed in the prediction (forward) phase into the network and update its parameters to perform better on the next iteration. Hence, the optimization of the function J J J, usually through a descent method.

Computational graph

Most of the descent methods require the computation of the gradient of the loss function denoted ∇ θ J ( θ )

abla_{\theta}J(\theta) ∇θ​J(θ).

In a neural network, the operation is carried out using a computational graph which decomposes the function J J J into several intermediate variables.

Let us consider the following function: f ( x , y , z ) = ( x + y ) . z f(x,y,z)=(x+y).z f(x,y,z)=(x+y).z

The main objective is to calculate ∇ f ( x , y , z )

abla f(x,y,z) ∇f(x,y,z) in ( − 2 , 5 , − 4 ) (-2,5,-4) (−2,5,−4) where:

∇ f ( x , y , z ) = T [ ∂ f ∂ x ∂ f ∂ y ∂ f ∂ z ]

abla f(x,y,z)={}^T \begin{bmatrix} \frac{\partial f}{\partial x} & \frac{\partial f}{\partial y} & \frac{\partial f}{\partial z} \end{bmatrix} ∇f(x,y,z)=T[∂x∂f​​∂y∂f​​∂z∂f​​]

Let q = x + y → f = q . z q=x+y \rightarrow f=q.z q=x+y→f=q.z We carry out the computation using two passes:

Forward propagation: computes the value of f f f from inputs to ouput:

f ( − 2 , 5 , − 4 ) = − 12 f(-2,5,-4)=-12 f(−2,5,−4)=−12

Backpropagation: recursively apply chain rule to compute gradients from output to inputs:

∂ f ∂ f = 1 ∂ f ∂ q = z = − 4 ∂ f ∂ z = q = 3 ∂ f ∂ x = ∂ f ∂ q . ∂ q ∂ x + ∂ f ∂ z . ∂ z ∂ x = z . 1 + q . 0 = z = − 4 ∂ f ∂ y = ∂ f ∂ q . ∂ q ∂ y + ∂ f ∂ z . ∂ z ∂ y = z . 1 + q . 0 = z = − 4 \frac{\partial f}{\partial f}=1\\ \frac{\partial f}{\partial q}=z=-4\\ \frac{\partial f}{\partial z}=q=3\\ \frac{\partial f}{\partial x}=\frac{\partial f}{\partial q}.\frac{\partial q}{\partial x}+\frac{\partial f}{\partial z}.\frac{\partial z}{\partial x}=z.1+q.0=z=-4\\ \frac{\partial f}{\partial y}=\frac{\partial f}{\partial q}.\frac{\partial q}{\partial y}+\frac{\partial f}{\partial z}.\frac{\partial z}{\partial y}=z.1+q.0=z=-4 ∂f∂f​=1∂q∂f​=z=−4∂z∂f​=q=3∂x∂f​=∂q∂f​.∂x∂q​+∂z∂f​.∂x∂z​=z.1+q.0=z=−4∂y∂f​=∂q∂f​.∂y∂q​+∂z∂f​.∂y∂z​=z.1+q.0=z=−4

Hence:

∇ f ( x , y , z ) ∣ ( − 2 , 5 , − 4 ) = T [ − 4 − 4 3 ]

abla f(x,y,z)|_{(-2,5,-4)}={^T}\begin{bmatrix} -4 & -4 & 3 \end{bmatrix} ∇f(x,y,z)∣(−2,5,−4)​=T[−4​−4​3​]

Equations

The derivatives can be resumed in the following

Mathematicaly, we compute the gradients of the cost function, J J J, w.r.t the architecture’s parameters W [ i ] W^{[i]} W[i] and b [ i ] b^{[i]} b[i]. For a given parameter α \alpha α, we set d α [ i ] = ∂ J ∂ α [ i ] d\alpha^{[i]}=\frac{\partial J}{\partial \alpha^{[i]}} dα[i]=∂α[i]∂J​ and we have at the i t h i^{th} ith layer:

d Z [ i ] = d A [ i ] ⋆ ψ ′ [ i ] ( Z [ i ] ) d A [ i − 1 ] = W [ i ] T d Z [ i ] d W [ i ] = d Z [ i ] A [ i − 1 ] d b [ i ] = d Z [ i ] dZ^{[i]}=dA^{[i]}\star\psi'^{[i]}(Z^{[i]})\\ dA^{[i-1]}={W^{[i]}}^TdZ^{[i]}\\ dW^{[i]}=dZ^{[i]}A^{[i-1]}\\ db^{[i]}=dZ^{[i]} dZ[i]=dA[i]⋆ψ′[i](Z[i])dA[i−1]=W[i]TdZ[i]dW[i]=dZ[i]A[i−1]db[i]=dZ[i]

where ⋆ \star ⋆ is the element wise multiplication.

We recursively apply these equations for i = L , L − 1 , . . . , 1 i=L,L-1,...,1 i=L,L−1,...,1

Gradient Checking

When carrying out the backpropagation, an additional checking is added to make sure that the algebric computations are correct. Algorithm :

We first reshape and stack all the parameters W [ i ] W^{[i]} W [ i ] and b [ i ] b^{[i]} b [ i ] into one vector denoted θ \theta θ

and into one vector denoted We carry out the same manoeuvre for their derivatives d W [ i ] dW^{[i]} d W [ i ] and d b [ i ] db^{[i]} d b [ i ] and we denote d θ d\theta d θ the resulting vector.

and and we denote the resulting vector. ∀ i \forall i ∀ i , We compute: d θ a p p r o x [ i ] = J ( θ 1 , θ 2 , . . . , θ i + ϵ , . . . ) − J ( θ 1 , θ 2 , . . . , θ i − ϵ , . . . ) 2 ϵ d\theta_{approx}^{[i]}=\frac{J(\theta_1,\theta_2,...,\theta_i+\epsilon,...)-J(\theta_1,\theta_2,...,\theta_i-\epsilon,...)}{2\epsilon} d θ a p p r o x [ i ] ​ = 2 ϵ J ( θ 1 ​ , θ 2 ​ , . . . , θ i ​ + ϵ , . . . ) − J ( θ 1 ​ , θ 2 ​ , . . . , θ i ​ − ϵ , . . . ) ​ an O ( ϵ 2 ) O(\epsilon^2) O ( ϵ 2 ) approximation of ∂ J ∂ θ i = d θ [ i ] \frac{\partial J}{\partial\theta_i}=d\theta^{[i]} ∂ θ i ​ ∂ J ​ = d θ [ i ] (where ϵ \epsilon ϵ is very small ≈ 1 0 − 7 \approx 10^{-7} ≈ 1 0 − 7 )

, We compute: an approximation of (where is very small ) We check the following quantity: ∥ d θ a p p r o x − d θ ∥ 2 ∥ d θ a p p r o x ∥ 2 + ∥ d θ ∥ 2 \frac{\|d\theta_{approx}-d\theta\|_2}{\|d\theta_{approx}\|_2+\|d\theta\|_2} ∥ d θ a p p r o x ​ ∥ 2 ​ + ∥ d θ ∥ 2 ​ ∥ d θ a p p r o x ​ − d θ ∥ 2 ​ ​

It should be close to the value of ϵ \epsilon ϵ, an error is suspected when the value of the quantity is near 1 0 − 3 10^{-3} 10−3.

Summing up in blocks

We can sum up the Forward and Backward propagation in the following block:

Parameters vs Hyperparameters

Parameters , denoted θ \theta θ , are the elements which we learn through the iterations and on which we apply backpropagation and update: W [ i ] W^{[i]} W [ i ] and b [ i ] b^{[i]} b [ i ]

, denoted , are the elements which we learn through the iterations and on which we apply backpropagation and update: and Hyperparameters are all the other variables we define in our algorithm which can be tunned in order to improve the neural network: Learning rate α \alpha α Number of iterations Choice of activation functions Number of layers L L L Number of units in each layer



5 - Activation functions

Activation functions are a kind of transfer functions that select the data propagated in the neural network. The underlying interpretation is to allow a neuron in the network to propagate learning data (if it is in a learning phase) only if it is sufficiently excited.

Here is a list of the most common functions:

ReLU:

ψ ( x ) = x 1 x ≥ 0 \psi(x)=x\mathcal{1}_{x\geq 0} ψ(x)=x1x≥0​

Sigmoid:

ψ ( x ) = 1 1 + e − x \psi(x)=\frac{1}{1+e^{-x}} ψ(x)=1+e−x1​

Tanh:

ψ ( x ) = 1 − e − 2 x 1 + e − 2 x \psi(x)=\frac{1-e^{-2x}}{1+e^{-2x}} ψ(x)=1+e−2x1−e−2x​

LeakyReLU:

ψ ( x ) = x 1 x ≥ 0 + α x 1 x ≤ 0 \psi(x)=x\mathcal{1}_{x\geq 0}+\alpha x\mathcal{1}_{x\leq 0} ψ(x)=x1x≥0​+αx1x≤0​

Remark : if the activation functions are all linear, the neural network is precisely equivalent to a simple linear regression

6 - Optimization algorithm

Risk

Let us consider a neural network denoted by f f f. The real objective to optimize is defined as the expected loss over all the corpora:

R ( f ) = ∫ p ( X , Y ) L ( f ( X ) , Y ) d X d Y R(f)=\int p(X,Y)\mathcal{L}(f(X),Y)dXdY R(f)=∫p(X,Y)L(f(X),Y)dXdY

Where X X X is an element from a continuous space of observables to which correspond a target Y Y Y and p ( X , Y ) p(X,Y) p(X,Y) being the marginal probability of observing the couple ( X , Y ) (X, Y) (X,Y).

Empirical risk

Since we can not have all the corpora and hence we ignore the distribution p p p, we restrict the estimation of the risk on a certain dataset well representative of the overall corpora and consider all the cases equiprobable.

In this case: ∫ = ∑ \int=\sum ∫=∑ and p ( X , Y ) = 1 m p(X,Y)=\frac{1}{m} p(X,Y)=m1​ where m is the size of the representative corpora. Hence, we iteratively optimize the loss function defined as follows:

J ( θ ) = 1 m ∑ i = 1 m L ( y ^ i θ , y i ) J(\theta)=\frac{1}{m}\sum_{i=1}^m \mathcal{L}(\hat{y}_i^{\theta}, y_i) J(θ)=m1​∑i=1m​L(y^​iθ​,yi​)

Plus we can assert that:

m i n f R ( f ) ≈ m i n θ J ( θ ) min_f R(f)\approx min_{\theta} J(\theta) minf​R(f)≈minθ​J(θ)

Normalizing inputs

There exist many techniques and algorithms, mainly based on gradient descent, which carry out the optimization. In the sections below, we will go through the most famous ones. It is important to note that these algorithms might get stuck in local minima and nothing assures reaching the global one.

Before optimizing the loss function, we need to normalize the inputs in order to speed up the learning. In this case, J ( θ ) J(\theta) J(θ) becomes tighter and more symmetric which helps gradient descent to find the minimum faster and thus in fewer iterations.

Standard data is the commonly used approach which consists of subtracting the mean of the variables and dividing by their standard deviation. Considering θ = T [ θ 1 θ 2 ] \theta={}^T[\theta_1 \theta_2] θ=T[θ1​θ2​], the following image illustrates the effect of normalizing the input on the contour lines of J J J -standard data on the right-:

Let X be a variable in our database, we set:

X : = X − μ σ X:=\frac{X-\mu}{\sigma} X:=σX−μ​

Where μ = 1 m ∑ i = 1 n x ( i ) \mu=\frac{1}{m}\sum_{i=1}^nx^{(i)} μ=m1​∑i=1n​x(i) and σ = 1 m ∑ i = 1 n ( x ( i ) − μ ) 2 \sigma=\frac{1}{m}\sum_{i=1}^n(x^{(i)}-\mu)^2 σ=m1​∑i=1n​(x(i)−μ)2

Gradient descent

In general, we tend to construct a convex and differentiable function J J J where any local minima is a global one. Mathematically speaking finding the global minimum of a convex function is equivalent to solving the equation ∇ J ( θ ) = 0

abla J(\theta)=0 ∇J(θ)=0, we denote θ ⋆ \theta^{\star} θ⋆ its solution. Most of the used algorithms are of kind θ k + 1 = θ k + α k d k \theta_{k+1}=\theta_{k}+\alpha_kd_k θk+1​=θk​+αk​dk​ with θ 0 \theta_0 θ0​ an initial guess, where α k \alpha_k αk​ is the step size and d k d_k dk​ the descent direction. We can assert that:

J ( θ k + 1 ) = J ( θ k ) + α k ∇ J ( θ k ) d k + o ( θ k ) J(\theta_{k+1})=J(\theta_{k})+\alpha_k

abla J(\theta_k)d_k+o(\theta_k) J(θk+1​)=J(θk​)+αk​∇J(θk​)dk​+o(θk​)

Since we seek to have J ( θ k + 1 ) < < J ( θ k ) J(\theta_{k+1})<<J(\theta_{k}) J(θk+1​)<<J(θk​) then we need ∇ J ( θ k ) d k

abla J(\theta_k)d_k ∇J(θk​)dk​ as negative as possible, meaning d k = − ∇ J ( θ k ) d_k=-

abla J(\theta_k) dk​=−∇J(θk​).

Algorithm :

θ 0 \theta_0 θ 0 ​ is given

is given for k = 1 , . . . , k=1,..., k=1,...,stopping criterion: θ k + 1 = θ k − α k ∇ J ( θ k ) \theta_{k+1}=\theta_{k}-\alpha_k

abla J(\theta_k) θ k + 1 ​ = θ k ​ − α k ​ ∇ J ( θ k ​ )



Choice of α k \alpha_k αk​:

α k = α \alpha_k=\alpha α k ​ = α a fixed step size

a fixed step size α k \alpha_k α k ​ minimizes t → J ( α k − t ∇ J ( θ k ) ) t\rightarrow J(\alpha_k-t

abla J(\theta_k)) t → J ( α k ​ − t ∇ J ( θ k ​ ) )

minimizes α k \alpha_k α k ​ follows a certain decay law (see Learning rate decay section)

Mini-batch gradient descent

This technique consists of dividing the trainning set to batches ( X { 1 } , y { 1 } ) , ( X { 2 } , y { 2 } ) , . . . , ( X { n } , y { n } ) (X^{\{1\}},y^{\{1\}}), (X^{\{2\}}, y^{\{2\}}),...,(X^{\{n\}}, y^{\{n\}}) (X{1},y{1}),(X{2},y{2}),...,(X{n},y{n}), the training algorithm is as follows:

for t=1,…,n: Carry out forward propagation on X { t } X^{\{t\}} X { t } Compute the cost function normalized on the size of the batch Carry out the backpropagation using ( X { t } , y { t } , y ^ { t } ) (X^{\{t\}}, y^{\{t\}}, \hat{y}^{\{t\}}) ( X { t } , y { t } , y ^ ​ { t } ) Update the weight W [ l ] W^{[l]} W [ l ] and b [ l ] ; ∀ l b^{[l]}; \forall l b [ l ] ; ∀ l



Choice of the mini-batch size :

Small number of rows ∼ 2000 \sim 2000 ∼ 2 0 0 0 lines

lines Typical size: power of 2 which is good for memory

Mini-batch should fit in CPU/GPU memory

Remark : in the case where there is only one data line in the batch, the algorithm is called stochastic gradient descent

Gradient descent with momentum

A variant of gradient descent which includes the notion of momentum, the algorithm is as follows:

Initialize V d W = 0 d W V_{dW}=0_{dW} V d W ​ = 0 d W ​ , V d b = 0 d b V_{db}=0_{db} V d b ​ = 0 d b ​

, On iteration k: Compute d W dW d W and d b db d b on the current mini-batch V d W = β V d W + ( 1 − β ) d W V_{dW}=\beta V_{dW}+(1-\beta)dW V d W ​ = β V d W ​ + ( 1 − β ) d W ; V d b = β V d b + ( 1 − β ) d b V_{db}=\beta V_{db}+(1-\beta)db V d b ​ = β V d b ​ + ( 1 − β ) d b Update the parameters: W : = W − α d W W:=W-\alpha dW W : = W − α d W b : = b − α d b b:=b-\alpha db b : = b − α d b



( α , β \alpha, \beta α,β) are hyperparameters. Since d θ d\theta dθ is calculated on a mini-batch, the resulting gradient ∇ J

abla J ∇J is very noisy, this exponentially weighted averages included by the momentum give a better estimation of derivatives.

RMSprop

Root Mean Square prop is very similar to gradient descent with momentum, the only difference is that it includes the second-order momentum instead of the first-order one, plus a slight change on the parameters’ update:

Initialize S d W = 0 d W S_{dW}=0_{dW} S d W ​ = 0 d W ​ , S d b = 0 d b S_{db}=0_{db} S d b ​ = 0 d b ​

, On iteration k: Compute d W dW d W and d b db d b on the current mini-batch S d W = β S d W + ( 1 − β ) d W 2 S_{dW}=\beta S_{dW}+(1-\beta)dW^\bold{2} S d W ​ = β S d W ​ + ( 1 − β ) d W 2 ; S d b = β S d b + ( 1 − β ) d b 2 S_{db}=\beta S_{db}+(1-\beta)db^\bold{2} S d b ​ = β S d b ​ + ( 1 − β ) d b 2 Update the parameters: W : = W − α S d W + ϵ d W W:=W-\frac{\alpha}{\sqrt{S_{dW}}+\epsilon}dW W : = W − S d W ​ ​ + ϵ α ​ d W b : = b − α S d b + ϵ d b b:=b-\frac{\alpha}{\sqrt{S_{db}}+\epsilon}db b : = b − S d b ​ ​ + ϵ α ​ d b



( α , β \alpha, \beta α,β) are hyperparameters and ϵ \epsilon ϵ assures numerical stability ( ≈ 1 0 − 8 \approx 10^{-8} ≈10−8)

Adam

Adam is an adaptive learning rate optimization algorithm designed specifically for training deep neural networks. Adam can be seen as a combination of RMSprop and gradient descent with momentum. It uses square gradients to set the learning rate at scale as RMSprop and takes advantage of momentum by using the moving average of the gradient instead of the gradient itself as the gradient descends with momentum. The main idea is to avoid oscillations during optimization by accelerating the descent in the right direction, say dW, using the V d W V_{dW} VdW​ moment: if the descent is slow so V d W V_{dW} VdW​ and S d W S_{dW} SdW​ are small, a choice of the larger step α \alpha α solves the problem, moreover by dividing by S d W \sqrt{S_{dW}} SdW​ ​, the optimization is accelerated further. The algorithm of the Adam optimizer is the following:

Initialize: V d W = 0 V_{dW}=0 V d W ​ = 0 , S d W = 0 S_{dW}=0 S d W ​ = 0 , V d b = 0 V_{db}=0 V d b ​ = 0 , S d b = 0 S_{db}=0 S d b ​ = 0 ;

, , , ; On iteration k: Computation of d W dW d W and d b db d b through backpropagation Momentum: V d W = β 1 V d W + ( 1 − β 1 ) d W V_{dW}=\beta_1V_{dW}+(1-\beta_1)dW V d W ​ = β 1 ​ V d W ​ + ( 1 − β 1 ​ ) d W V d b = β 1 V d b + ( 1 − β 1 ) d b V_{db}=\beta_1V_{db}+(1-\beta_1)db V d b ​ = β 1 ​ V d b ​ + ( 1 − β 1 ​ ) d b RMSprop: S d W = β 2 S d W + ( 1 − β 2 ) d W 2 S_{dW}=\beta_2 S_{dW}+(1-\beta_2)dW^2 S d W ​ = β 2 ​ S d W ​ + ( 1 − β 2 ​ ) d W 2 S d b = β 2 S d b + ( 1 − β 2 ) d b 2 S_{db}=\beta_2 S_{db}+(1-\beta_2)db^2 S d b ​ = β 2 ​ S d b ​ + ( 1 − β 2 ​ ) d b 2 Correction: V d W = V d W 1 − β 1 k V_{dW}=\frac{V_{dW}}{1-\beta_1^k} V d W ​ = 1 − β 1 k ​ V d W ​ ​ S d W = S d W 1 − β 2 k S_{dW}=\frac{S_{dW}}{1-\beta_2^k} S d W ​ = 1 − β 2 k ​ S d W ​ ​ V d b = V d b 1 − β 1 k V_{db}=\frac{V_{db}}{1-\beta_1^k} V d b ​ = 1 − β 1 k ​ V d b ​ ​ S d b = S d b 1 − β 2 k S_{db}=\frac{S_{db}}{1-\beta_2^k} S d b ​ = 1 − β 2 k ​ S d b ​ ​ Parameters’ update: W = W − α V d w S d W + ϵ W=W-\alpha\frac{V_{dw}}{\sqrt{S_{dW}}+\epsilon} W = W − α S d W ​ ​ + ϵ V d w ​ ​ ; b = b − α V d b S d b + ϵ b=b-\alpha\frac{V_{db}}{\sqrt{S_{db}}+\epsilon} b = b − α S d b ​ ​ + ϵ V d b ​ ​



Learning rate decay

The main objective of the learning rate decay is to slowly reduce the learning rate over time/iterations. It finds justification in the fact that we afford to take big steps at the beginning of the learning but when approaching the global minimum, we slow down and thus decrease the learning rate. There exist many learning rate decay laws, here are some of the most common:

We decrease the learning rate by epoch i.e 1 pass through the data (all the mini-batches):

α ( e p o c h _ n u m ) = 1 1 + β . e p o c h _ n u m α 0 \alpha(epoch\_num)=\frac{1}{1+\beta.epoch\_num}\alpha_0 α(epoch_num)=1+β.epoch_num1​α0​

We can exponentially decrease the learning rate:

α ( e p o c h _ n u m ) = 0.9 5 e p o c h _ n u m α 0 \alpha(epoch\_num)=0.95^{epoch\_num}\alpha_0 α(epoch_num)=0.95epoch_numα0​

We can also consider the following decay law:

α ( e p o c h _ n u m ) = k e p o c h _ n u m α 0 \alpha(epoch\_num)=\frac{k}{\sqrt{epoch\_num}}\alpha_0 α(epoch_num)=epoch_num ​k​α0​

( α 0 \alpha_0 α0​, k k k, β \beta β) are hyperparameters

Regularization

When training a neural network, it might suffer from:

High bias : or underfitting, where the network fails to find the path in the data, in this case, J t r a i n J_{train} J t r a i n ​ is very high the same as J d e v J_{dev} J d e v ​ . Mathematically speaking, when performing cross-validation; the mean of J J J on all the considered folds is high.

: or underfitting, where the network fails to find the path in the data, in this case, is very high the same as . Mathematically speaking, when performing cross-validation; the mean of on all the considered folds is high. High variance or overfitting, the model fits perfectly on the training data but fails to generalize on unseen data, in this case, J t r a i n J_{train} J t r a i n ​ is very low and J d e v J_{dev} J d e v ​ is relatively high. Mathematically speaking, when performing cross-validation; the variance of J J J on all the considered folds is high.

Let’s consider the dartboard game, where hitting the red target is the best-case scenario. Having a low biais (first line) means that on average we are close to the goal. In case, of a low variance the hits are all concentrated around the target (the variance of the hits’ distribution is low). When the variance is high, under the assumption of a low bias, the hits are spread out but still around the red circle. Vice-versa, we can define the high bias with a low/high variance.

Mathematically speaking, let f f f be a true regression function: y = f ( x ) + ϵ y=f(x)+\epsilon y=f(x)+ϵ where ϵ ∼ N ( 0 , σ 2 ) \epsilon \sim \mathcal{N}(0, \sigma^2) ϵ∼N(0,σ2). We fit a hypothesis h ( x ) = W x + b h(x)=Wx+b h(x)=Wx+b with MSE and consider x 0 x_0 x0​ be a new data point, y 0 = f ( x 0 ) + ϵ y_0=f(x_0)+\epsilon y0​=f(x0​)+ϵ, the expected error can be defined by E [ ( y 0 − h ( x 0 ) ) 2 ] \mathbb{E}[(y_0-h(x_0))^2] E[(y0​−h(x0​))2] and we can assert that:

E [ ( y 0 − h ( x 0 ) ) 2 ] = E [ ( h ( x 0 ) − h ˉ ( x 0 ) ) 2 ] (Variance) + ( h ˉ ( x 0 ) − f ( x 0 ) ) 2 (bias) + E [ ( y 0 − f ( x 0 ) ) 2 ] (Intrinsic) \mathbb{E}[(y_0-h(x_0))^2]= \mathbb{E}[(h(x_0)-\bar{h}(x_0))^2]\textbf{(Variance)}\\\hspace4cm +(\bar{h}(x_0)-f(x_0))^2\textbf{(bias)}\\\hspace6cm+\mathbb{E}[(y_0-f(x_0))^2]\textbf{(Intrinsic)} E[(y0​−h(x0​))2]=E[(h(x0​)−hˉ(x0​))2](Variance)+(hˉ(x0​)−f(x0​))2(bias)+E[(y0​−f(x0​))2](Intrinsic)

where Z ˉ = E [ Z ] \bar{Z}=\mathbb{E}[Z] Zˉ=E[Z]

A trade-off must be found between variance and bias to find the optimum complexity of the model either by using the A I C AIC AIC criteria or using cross-validation. Here is a simple schema to follow to solve bias/variance issues:

L1 - L2 regularization

Regularization is an optimization technique which prevents overfitting. It consists of adding a term in the objective function to minimize as follows:

L1 regularization : J J J becomes:

J ( θ ) = 1 m ∑ i = 1 m c o s t ( y ^ i θ , y i ) + λ 2 m ∥ θ ∥ 1 2 J(\theta)=\frac{1}{m}\sum_{i=1}^m cost(\hat{y}_i^{\theta}, y_i)+\frac{\lambda}{2m}\|\theta\|_1^2 J(θ)=m1​∑i=1m​cost(y^​iθ​,yi​)+2mλ​∥θ∥12​

Where ∥ θ ∥ 1 = ∑ i ∣ θ [ i ] ∣ \|\theta\|_1=\sum_{i}|\theta^{[i]}| ∥θ∥1​=∑i​∣θ[i]∣

L2 regularization : J J J becomes:

J ( θ ) = 1 m ∑ i = 1 m c o s t ( y ^ i θ , y i ) + λ 2 m ∥ θ ∥ 2 2 J(\theta)=\frac{1}{m}\sum_{i=1}^m cost(\hat{y}_i^{\theta}, y_i)+\frac{\lambda}{2m}\|\theta\|_2^2 J(θ)=m1​∑i=1m​cost(y^​iθ​,yi​)+2mλ​∥θ∥22​

Where ∥ θ ∥ 2 2 = θ T θ \|\theta\|_2^2=\theta^T\theta ∥θ∥22​=θTθ

λ \lambda λ is the hyperparameter of the regularization

Backpropagation and regularization The update of the parameters during backpropagation depends on the the gradient ∇ J

abla J ∇ J , to which is added a new regularization term. In L2 regularization, it becomes as follows:

d θ r e g = d θ + λ m θ → θ : = θ ( 1 − λ m α ) − α d θ d\theta^{reg}=d\theta+\frac{\lambda}{m}\theta\rightarrow\theta:=\theta(1-\frac{\lambda}{m}\alpha)-\alpha d\theta dθreg=dθ+mλ​θ→θ:=θ(1−mλ​α)−αdθ

Considering λ > > 1 \lambda>>1 λ>>1, minimizing the cost function leads to weak values of parameters because of the term λ 2 m ∥ θ ∥ \frac{\lambda}{2m}\|\theta\| 2mλ​∥θ∥ which simplifies the network and makes more consistent, hence less exposed to overfitting.

Dropout regularization

Roughly speaking, the main idea is to sample a uniform random variable, for each layer for each node , and have p \mathcal{p} p chance of keeping the node and 1 − p 1-\mathcal{p} 1−p of removing it which diminishes the network. The main intuition of dropout is based on the idea that the network shouldn’t rely on a specific feature but should instead spread out the weights! Mathematically speaking, when dropout is off and considering the j t h j^{th} jth node of the i t h i^{th} ith layer, we have the following equations:

z j [ i ] = W j [ i ] T A [ i − 1 ] + b j [ i ] → a j [ i ] = ψ [ i ] ( z j [ i ] ) z^{[i]}_j={W^{[i]}_j}^T\mathcal{A^{[i-1]}}+b^{[i]}_j\\ \rightarrow a^{[i]}_j=\psi^{[i]}(z^{[i]}_j) zj[i]​=Wj[i]​TA[i−1]+bj[i]​→aj[i]​=ψ[i](zj[i]​)

When dropout is on, the equations become as follows:

r j [ i − 1 ] ∼ B e r n o u l l i ( p ( i − 1 ) ) A ^ [ i − 1 ] = A [ i − 1 ] . r j [ i − 1 ] z ^ j [ i ] = W j [ i ] T A ^ [ i − 1 ] + b j [ i ] → a j [ i ] = ψ [ i ] ( z ^ j [ i ] ) r^{[i-1]}_j\sim Bernoulli(p^{(i-1)})\\ \hat{\mathcal{A}}^{[i-1]}=\mathcal{A^{[i-1]}}.r^{[i-1]}_j \\ \hat{z}^{[i]}_j={W^{[i]}_j}^T\hat{\mathcal{A}}^{[i-1]}+b^{[i]}_j \\ \rightarrow a^{[i]}_j=\psi^{[i]}(\hat{z}^{[i]}_j) rj[i−1]​∼Bernoulli(p(i−1))A^[i−1]=A[i−1].rj[i−1]​z^j[i]​=Wj[i]​TA^[i−1]+bj[i]​→aj[i]​=ψ[i](z^j[i]​)

Where p ( i − 1 ) p^{(i-1)} p(i−1) is a hyperparameter.

Early stopping

This technique is quite simple and consists of stopping the iteration around the area when J t r a i n J_{train} Jtrain​ and J d e v J_{dev} Jdev​ start seperating:

Gradient problems

The computation of gradients suffers from two major problems: gradient vanishing and gradient exploding . To illustrate both of the situations, let’s consider a neural network where all the activation functions ψ [ i ] \psi^{[i]} ψ[i] are linear, W [ i ] = [ 1 , 5 0 0 1 , 5 ] W^{[i]}=\begin{bmatrix} 1,5 & 0\\0 & 1,5 \end{bmatrix} W[i]=[1,50​01,5​] and b [ i ] = 0 , ∀ i = 1 , . . . , L − 1 b^{[i]}=0, \forall i=1,...,L-1 b[i]=0,∀i=1,...,L−1, thus:

y ^ = W [ L ] . [ 1 , 5 L − 1 0 0 1 , 5 L − 1 ] \hat{y}=W^{[L]}.\begin{bmatrix} 1,5^{L-1} & 0\\0 & 1,5^{L-1} \end{bmatrix} y^​=W[L].[1,5L−10​01,5L−1​]

We note that 1 , 5 L − 1 1,5^{L-1} 1,5L−1 will explode exponentially as a function of the depth L. If we use 0.5 0.5 0.5 instead of 1 , 5 1,5 1,5 then 0 , 5 L − 1 0,5^{L-1} 0,5L−1 will vanish exponentially as well.

The same issue occurs with gradients.

References