One of the main questions throughout human history and the study of the brain is this: how do we perceive the world the way we do?

When it comes to our eyes, these are just sensors that help us to build an understandable representation of the reality surrounding us. For instance, properties such as colours are not inherent from the objects in the physical world, but instead represent our mental abstraction of sensing reflections of light at different wavelengths. As depressing as it might sound, everything is “dark” in our physical reality. An object appears as red when this object reflects the particular wavelength we mentally associate with the colour red, and absorb all the rest.

This ability of visual perception goes far beyond identifying different wavelengths. The most interesting aspect arises from our ability to build a conceptual reality from the manner these photons are perceived with respect to each other in space. For instance, we understand the concept of a chair because we expect it to have certain geometric properties. Towards our eyes, the light reflected from a chair must maintain a spatial relation that resembles what we conceptualize as the shape of a chair.

Computers have also recently became capable of visual recognition, learning how to see. This has been made possible thanks to the advancements in Artificial Intelligence, and in particular the application of Convolutional Neural Networks (CNNs). There are plenty of articles on the subject out there, but very few explain the intuition of the method outside statistics and linear algebra.

The Building blocks of a Convolutional Neural Network

To understand how CNNs can be used to perform any vision task, we have to first understand its components. The list of concepts involved can be overwhelming, but they can all be narrowed down to the words already present in the method: convolution, deep neural networks (DNNs) and the fusion of these two.

What is a Convolution?

A convolution is a mathematical operation that combines two functions into a third one. For instance, let’s say that we have two given functions, f(t) and g(t), and we are interested in applying one on top of the other one and calculate the area of intersection: f(t) * g(t) = (f*g)(t).

Source: Brian Amberg

In this case, we are applying g(t) (called kernel) over f(t) and changing the response (f*g)(t) according to the intersection over the area of both functions. This concept of convolution is the most used technique in Signal Processing, thus it is also applied in Computer Vision, which can be seen as processing the signal of multiple RGB sensors. The intuition behind the effectiveness of convolutions is their ability to filter given input signals into a combined, more useful result.

In the particular case of images, the signal is better understood in terms of matrices rather than waveforms. Therefore, our functions f(t) and g(t) will now become image(matrix) and kernel(matrix), respectively. Once again, we can apply convolution by sliding one of the matrices on top of the other one:

Source: Machine Learning Guru

What you are seeing above is simply the matrix multiplication of the sliding window in the image with the kernel, and then adding the sum. The power of convolutions in the context of Computer Vision is that they are great feature extractors for the domain of RGB sensors. When taken individually, each pixel (RGB sensor) is rather pointless for understanding what the image contains. It is the relation of one another in space that gives true meaning to the image. It applies to the way that you are reading this article on your computer, with pixels representing characters and your brain matching black pixels with each other in space to form the concept of characters.

By sliding different convolution kernels over an image, one can perform image manipulation. This is how many of the tools that can be found in image editing software such as Photoshop or Gimp work:

Source: Gimp Documentation

What are Deep Neural Networks?

The concept of learning within Machine Learning can be seen as trying to predict an unknown function from the given data. Let’s say that you are interested in predicting a label y given input x. In this particular scenario, you would like your model to minimize the error between your prediction and the ground truth. We modify the weights and biases in our network in such a manner that we map as best as we can the input x to the output y.

To better understand this, let’s take a simple linear regression problem. The equation of a line can be described as y = mx + b, where m is the slope and b is the intersection with the y-plane (typically called bias). If there are only two points, one can solve the slope and bias directly with the formula, but the problem becomes harder when we have a number of points n>2 and there is no line that perfectly fits these points. In that case, we would be interested in approximating a line that minimizes the distance to each datapoint. To do this, we can follow these steps:

1. Give random values to m and b.

2. Calculate the error, which can be described as Mean Squared Error (MSE), to know how far are we from the optimal line.

3. Using the derivatives of this error function, calculate the gradient to know in which direction to move the two values in order to reduce the error.

4. Repeat steps 2 and 3 until the error stops decreasing.

Source: Alykhan Tejani

On the left side of the Figure you can see the error space, which is given by MSE applied to our particular datapoints. On the right side, we can see a representation of our datapoints, and the line defined by m and b. At the beginning the line is completely off, and it is reflected in the high error value. Nevertheless, as we calculate the derivative and move towards where the function decreases, we end-up with values of m and b for the line that resemble the properties of the datapoints.

In Deep Neural Networks (DNNs), this is the core idea on what is happening on each neuron. Given that our problem is way more complex to solve than just approximating a line, we need more neurons, usually structured in layers. Deep Learning minimizes the error function by using Gradient Descent, and since the learning is structured in layers, DNNs end-up learning a hierarchy in the data.

The main building blocks of a Neural Network are artificial neurons. There are different types of these, with the simplest one being the perceptron. One can see a perceptron as a unit that takes several binary inputs x1, x2, …, xn, and generates a single binary output y:

Each input is given a weight w1, w2, …, wn, representing the importance they have to the output. Finally, the output is calculated with the weighted sum of the inputs and a given threshold. If this sum is greater than this threshold, the perceptron will output 1, and 0 otherwise. This can be put more easily in algebraic terms:

The artificial neurons of the network are connected to one another through layers. The first set of neurons connected to the input form the input layer. The last layer of the network providing the prediction is called output layer. There can be any number of layers in between, which are considered hidden layers.

The name Deep Learning comes from the fact that there are more than 2 layers of neurons involved in the architecture. Going beyond two layers was impractical in the past with weak results. These limitations were overcome by the availability of larger volumes of trained data and accelerated computing. It was at this moment when the community started adding more layers, resulting in what is known as Deep Neural Networks. The communication through layers was able to describe a hierarchical representation of the data, where each layer of neurons represents a level of abstraction.

Learning Visual Representations through Convolutional Neural Networks

Combining the power of convolutions to extract spatial features with the ability of hierarhical learning from Deep Neural Networks, yields what is known as Convolutional Neural Networks. Both of these techniques are merged into a single solution by substituting the classical layer of neurons present in DNNs with convolutional layers. Thus, the weights and biases are encoded in the kernels and are the learned parameters that optimize the representation from the raw input (images) to the intended predictions (e.g. class of the image). The initial layers will encode features such as edges, while later edges will encode higher and higher abstractions, such as shapes and textures.

Source: Nvidia

In the figure above, each neuron represents a convolution operation over the image fed from the input or previous layers. We can have multiple convolutions receiving the same input in order to extract different kind of features. Through the layers, the convolution operations will keep transforming the given matrices from previous layers into a more and more compressed/useful representation for the task at hand. For instance, if our network is performing Face Recognition, initial layers will extract edges from the face, these edges will be convoluted into shapes, these shapes will be convoluted into certain facial features that are useful to distinguish different individuals, all the way to the output where we have an identifier of the given face.

Conclusion

CNNs can be applied to any domain in which there are some kind of spatial properties in the data. This usually translates into problems that have images as input.

Advantages

When compared to classical Computer Vision techniques, CNNs are easier to apply from a development perspective. There are plenty of tutorials and open source software available that can deal with the most common tasks, such as image classification or object detection, by just feeding your data in the already available models.

When having enough data, CNNs outperform other, primarily shallow, models.

Limitations