What is a Neural Network? No maths explanation

Disclaimer: This blog post is heavily inspired by 3Blue1Brown’s video here. Please check it out! It is a really amazing and educational channel

Disclaimer 2: This is intentionally a very simplistic explanation of what a Neural Network is. The intention of this blog post is to shade some light on the concept of Neural Network only, not to provide technical details on how to implement one

Hidden layers, linear algebra, calculus, feedforward, backpropagation… the first time you read about Neural Networks, the amount of terms and math most articles and books spit out can be overwhelming . While it is necessary to fully understand these concepts if you want to work with Neural Networks, they can be abstracted when first introduced to the concept.

In this blog post I am going to explain what is a Neural Network and how it works. For that, I will take the simplest model for clarity. That is a fully connected Neural Network with one hidden layer and one output layer. Fully connected just means that every node in a layer is connected to every other node in the next layer, bare with me.

Structure of a Neural Network

A Neural Network, much like Logistic Regression, is a way of modeling a function that fits our data. In this case a (very) complicated non linear function with (usually) thousands of parameters. A typical fully connected Neural Network consists on an input layer, one or more hidden layers and an output layer. Each node in a hidden layer is called an activation unit or neuron. In a fully connected Neural Network, each neuron in a layer is connected to each neuron in the next layer by some weights. Those weight indicate how much a neuron influences the neuron is connected to. A big weight will make the connected neuron to have a greater value, which will make that neuron to have a greater value as well, and so on.

Please note I’m skipping the concept of bias for clarity. I believe that, though being an important part of the network parameters, it can be skipped in an introduction to the topic.

The number of neurons in the input layer is equal to the number of features that each sample in our dataset is composed of.

Imagine that our network wants to predict the price of a house. To do so we get a dataset with samples consisting on 5 house features and a price. These 5 features, properly preprocessed and converted into numbers, are going to be the input of our network, while the price is going to be the expected result, or ground truth.

Since each unit in a layer affects each unit in the consecutive layer, if we have one hidden layer with 3 neurons and an output layer, that makes already a network with 18 parameters to tune: 5*3 for the input layer to hidden layer + 3*1 for the hidden layer to the output layer.

That may sound very few parameters, but this was just a very simple and illustrative example. The classical example when studying Neural Network is the MNIST dataset of handwritten digits. Each digit is represented as a 28x28 pixels image in grayscale. When unfolding the image to input it to a network, it is converted to a vector of 28*28 = 784 input features. As you can see things escalate quickly. And this is only for 28x28 pixel images!

This exponential growth of parameters makes Neural Networks computationally very expensive, and this is why until recent years they’ve been mostly an academic research topic.

Training a Neural Network

In order to train a Neural Network, we will pass every input sample through all the layers in the Network, calculating the dot product of the input by the weights connecting it to the hidden layer, and passing the result to the next layer. This is called forward pass. Once we arrive to the output layer, we will calculate the prediction error comparing our result with the true value. We know that value because is part of our dataset. In our particular example would be the price of the house.

This error will then be used to adjust the weights on all the layers, so that it is minimized. To minimize the error we can use an algorithm like gradient descent. Gradient descent and other algorithms takes as input all the samples and a variable called the learning rate. This variable describes the size of the step we take towards the minimum of the function we’re trying to fit. This is very important and requires a bit of tuning for each problem: If you take too small of a step, the algorithm will be very slow and can get trapped in what we call a local minimum:

If you take too big steps on the other hand, you may overshoot the global minumum. One needs to find the balance via experimentation.

The process of propagating the error back to all the layers is called… well, backpropagation.

Mathematically, backpropagation can be a bit tricky to understand if you (like me) are not used to calculus. However, the concept of what it is doing is fairly easy to understand: During backpropagation, we’re going to propagate the calculated error back through all the layers. This is done in such a way that the neurons that were farther from the true value will be multiplied by a greater factor than those that were closer to the true value.

We will repeat this process a number of times (epochs) until we obtain an acceptable accuracy. We need to be careful as always to not overfit by training too many epochs. For that, we will use a subset of the dataset that we’ll call validation set, and make sure its accuracy goes in pair with the accuracy of the training set.

Making predictions

Once the network is trained, in order to make a prediction we only need to perform a forward pass and get the result from the output layer. If the result is a continuous value, like the one in our example (remember, house prices), the output is already what we want. If instead we’re on a classification problem (for example disease prediction), then the output of the network is not exactly what we want. The output in this case will be the probability of the input sample to belong to the positive class. This is a number between 0 and 1. Thus we need a threshold function to determine if we classify it as positive or not. A usual threshold is 0.5 , though it depends on the problem in hand:

If output >= 0.5 then we classify as diseased

If output < 0.5 then we classify as not diseased

Final notes

The number of hidden layers, units per hidden layer, learning rate and many other choices are hyperparameters of a Neural Network. To find the best hyperparameter values for your problem, you must investigate by modifying them and training your network until you get results you’re happy with. Unfortunately there are no rules for how to initially define those parameters other than intuition and experience. It is for all these reasons that training a Neural Network is computationally hard.

I hope this post was useful and you have a slightly clearer idea of what a Neural Network really does. If you are interested in knowing more technical details, let me know in the comments and I may write a complete, step by step implementation with explanations , share!

Cheers!