6. Optimization algorithm

Risk

Let us consider a neural network denoted by f. The real objective to optimize is defined as the expected loss over all the corpora:

Where X is an element from a continuous space of observables to which correspond a target Y and p(X,Y) being the marginal probability of observing the couple (X,Y).

Empirical risk

Since we can not have all the corpora and hence we ignore the distribution, we restrict the estimation of the risk on a certain dataset well representative of the overall corpora and consider all the cases equiprobable.

In this case: and where m is the size of the representative corpora. Hence, we iteratively optimize the loss function defined as follows:

Plus we can assert that:∫=∑ and p(X,Y)=1/m where m is the size of the representative corpora. Hence, we iteratively optimize the loss function defined as follows:

Plus we can assert that:

There exist many techniques and algorithms, mainly based on gradient descent, which carries out the optimization. In the sections below, we will go through the most famous ones. It is important to note that these algorithms might get stuck in local minima and nothing assures reaching the global one.

Normalizing inputs

Before optimizing the loss function, we need to normalize the inputs in order to speed up the learning. In this case, J(θ) becomes tighter and more symmetric which helps gradient descent to find the minimum faster and thus in fewer iterations.

Standard data is the commonly used approach which consists of subtracting the mean of the variables and dividing by their standard deviation. Considering, the following image illustrates the effect of normalizing the input on the contour lines of -standard data on the right-:

Let X be a variable in our database, we set:

Gradient descent

In general, we tend to construct a convex and differentiable function J where any local minima is a global one. Mathematically speaking finding the global minimum of a convex function is equivalent to solving the equation ∇J(θ)=0, we denote θ⋆ its solution.

Most of the used algorithms are of the kind:

Mini-batch gradient descent

This technique consists of dividing the training set to batches:

Choice of the mini-batch size:

A small number of rows ∼2000 lines

Typical size: the power of 2 which is good for memory

Mini-batch should fit in CPU/GPU memory

Remark: in the case where there is only one data line in the batch, the algorithm is called stochastic gradient descent

Gradient descent with momentum

A variant of gradient descent which includes the notion of momentum, the algorithm is as follows:

(α,β) are hyperparameters.

Since dθ is calculated on a mini-batch, the resulting gradient ∇J is very noisy, this exponentially weighted averages included by the momentum give a better estimation of derivatives.

RMSprop

Root Mean Square prop is very similar to gradient descent with momentum, the only difference is that it includes the second-order momentum instead of the first-order one, plus a slight change on the parameters’ update:

(α,β) are hyperparameters and ϵ assures numerical stability (≈10−8)

Adam

Adam is an adaptive learning rate optimization algorithm designed specifically for training deep neural networks. Adam can be seen as a combination of RMSprop and gradient descent with momentum.

It uses square gradients to set the learning rate at scale as RMSprop and takes advantage of momentum by using the moving average of the gradient instead of the gradient itself as the gradient descends with momentum.

The main idea is to avoid oscillations during optimization by accelerating the descent in the right direction.

The algorithm of Adam optimizer is the following:

Learning rate decay

The main objective of the learning rate decay is to slowly reduce the learning rate over time/iterations. It finds justification in the fact that we afford to take big steps at the beginning of the learning but when approaching the global minimum, we slow down and thus decrease the learning rate.

There exist many learning rate decay laws, here are some of the most common:

Regularization

Variance/bias

When training a neural network, it might suffer from:

High bias : or underfitting, where the network fails to find the path in the data, in this case, J_train​ is very high the same as J_dev​. Mathematically speaking, when performing cross-validation; the mean of J on all the considered folds is high.

: or underfitting, where the network fails to find the path in the data, in this case, J_train​ is very high the same as J_dev​. Mathematically speaking, when performing cross-validation; the mean of J on all the considered folds is high. High variance or overfitting, the model fits perfectly on the training data but fails to generalize on unseen data, in this case, J_train​ is very low and J_dev​ is relatively high. Mathematically speaking, when performing cross-validation; the variance of J on all the considered folds is high.

Let’s consider the dartboard game, where hitting the red target is the best-case scenario. Having a low bias (first line) means that on average we are close to the goal. In case, of a low variance, the hits are all concentrated around the target (the variance of the hits’ distribution is low). When the variance is high, under the assumption of a low bias, the hits are spread out but still around the red circle.

Vice-versa, we can define the high bias with a low/high variance.

Mathematically speaking, let f be a true regression function: y=f(x)+ϵ where: ϵ~N(0,σ²)

We fit a hypothesis h(x)=Wx+b with MSE and consider x_0​ be a new data point, y_0​=f(x_0​)+ϵ: the expected error can be defined by:

A trade-off must be found between variance and bias to find the optimum complexity of the model either by using the AIC criteria or using cross-validation.

Here is a simple schema to follow to solve bias/variance issues:

L1 — L2 regularization

Regularization is an optimization technique that prevents overfitting.

It consists of adding a term in the objective function to minimize as follows:

λ is the hyperparameter of the regularization.

Backpropagation and regularization

The update of the parameters during backpropagation depends on the gradient ∇J, to which is added a new regularization term. In L2 regularization, it becomes as follows:

Considering λ>>1, minimizing the cost function leads to weak values of parameters because of the term (λ/2m)​∥θ∥ which simplifies the network and makes more consistent, hence less exposed to overfitting.

Dropout regularization

Roughly speaking, the main idea is to sample a uniform random variable, for each layer for each node , and have p chance of keeping the node and 1−p of removing it which diminishes the network.

The main intuition of dropout is based on the idea that the network shouldn't rely on a specific feature but should instead spread out the weights!

Mathematically speaking, when dropout is off and considering the j th node of the i th layer, we have the following equations:

When dropout is on, the equations become as follows:

Early stopping

This technique is quite simple and consists of stopping the iteration around the area when J_train​ and J_dev​ start separating:

Gradient problems

The computation of gradients suffers from two major problems: gradient vanishing and gradient exploding.

To illustrate both of the situations, let’s consider a neural network where all the activation functions ψ[i] are linear and:

We note that 1,5^(L−1) will explode exponentially as a function of the depth L. If we use 0.5 instead of 1,5 then 0,5^(L-1) will vanish exponentially as well.

The same issue occurs with gradients.

Conclusion

As a data scientist, it is very important to be aware of the mathematics turning in the background of the neural networks. This allows better understanding and faster debugging.

Do not hesitate to check my previous article dealing with:

Happy Machine Learning!

References