Improvements in Hardware (GPUs) and Software (advanced models / research related to AI) also contributed to deepen the learning from data using Neural Networks.

When it comes to unstructured data (images, text, voice, videos), hand engineered features are time consuming, brittle and not scalable in practice. That is why Neural Networks become more and more popular thanks to their ability to automatically discover the representations needed for feature detection or classification from raw data. This replaces manual feature engineering and allows a machine to both learn the features and use them to perform a specific task.

The success of these models highly depends on the performance of the feature engineering phase: the more we work close to the business to extract relevant knowledge from the structured data, the more powerful the model will be.

Traditional machine learning models have always been very powerful to handle structured data and have been widely used by businesses for credit scoring, churn prediction, consumer targeting, and so on.

Now, once we have understood the basic architecture of a deep neural network, let us find out how it can be used for a given task.

Given a finite set of m inputs (e.g. m words or m pixels), we multiply each input by a weight (theta 1 to theta m) then we sum up the weighted combination of inputs, add a bias and finally pass them through a non-linear activation function. That produces the output Yhat.

Training a Neural Network

Let us say, for a set of X-ray images, we need the model to automatically distinguish those that are related to a sick patient from the others.

For that, machine learning models, like humans, need to learn to differentiate between the two categories of images by observing some images of both sick and healthy individuals. Accordingly, they automatically understand patterns that better describe each category. This is what we call the training phase.

Concretely, a pattern is a weighted combination of some inputs (images, parts of images or other patterns). Hence, the training phase is nothing more than the phase during which we estimate the weights (also called parameters) of the model.

When we talk about estimation, we talk about an objective function we have to optimize. This function shall be constructed to best reflect the performance of the training phase. When it comes to prediction tasks, this objective function is usually called loss function and measures the cost incurred from incorrect predictions. When the model predicts something that is very close to the true output then the loss function is very low, and vice-versa.

In the presence of input data, we calculate an empirical loss (binary cross entropy loss in case of classification and mean squared error loss in case of regression) that measures the total loss over our entire dataset:

Since the loss is a function of the network weights, our task it to find the set of weights theta that achieve the lowest loss:

If we only have two weights theta 0 and theta 1, we can plot the following diagram of the loss function. What we want to do is to find the minimum of this loss and consequently the value of the weights where the loss attains its minimum.

To minimize the loss function, we can apply the gradient descent algorithm:

First, we randomly pick an initial p-vector of weights (e.g. following a normal distribution). Then, we compute the gradient of the loss function in the initial p-vector. The gradient direction indicates the direction to take in order to maximise the loss function. So, we take a small step in the opposite direction of gradient and we update weights’ values accordingly using this update rule: We move continuously until convergence to reach the lowest point of this landscape (local minima).