A primer on convolutional neural networks (CNNs)

Nine times out of ten, when you hear about scientific barriers that have been overcome through the application of deep learning techniques, convolutional neural networks are involved. Also called CNNs or ConvNets, they’re the spearheads of deep learning, especially when it comes to computer vision applications.

Today, they’re even able to learn how to sort images by category with, in some cases, better results than manual sorting. If there’s a method that justifies a particular craze when it comes to the deep learning world, then it’s CNNs. What is particularly interesting with CNNs is that they’re also easy to understand, when you divide them into their basic functionality.

A CNN compares images fragment-by-fragment. The fragments a CNN looks for are called features. By finding approximate features that are roughly similar in 2 different images, the CNN is much better at detecting similarities than via a full image-to-image comparison.

These supervised learning techniques can provide very good results, and their performance strongly depends on the quality of the previously-found features. There are several methods for extracting and describing features.

In practice, the classification error is never zero. The results can then be improved by creating new feature extraction methods that are more suited to the studied images, or by using a “better” classifier.

But in 2012, a revolution occurred: at the annual ILSVRC computer vision competition, a new deep learning algorithm surpassed all previous benchmark records! This was a convolutional neural network called AlexNet.

Convolutional neural networks have a methodology similar to traditional methods of supervised learning: they receive input images, detect the features of each of them, and then drag a classifier over them.

One the first CNN Architecture made by Yann LeCun. Source: Paper

Side note: I would highly recommend reading Yann LeCun’s paper.

However, features are learned automatically! The CNNs do all the hard work of extraction and description of features: during the training phase, the classification error is minimized in order to optimize the parameters of the classifier AND the features! In addition, the specific architecture of the network can extract features of different complexities, from the simplest to the most sophisticated. The automatic feature extraction and prioritization, which adapts to the given problem, is one of the strengths of convolutional neural networks: no need to implement a “hand-to-hand” extraction algorithm.

Unlike supervised learning techniques, convolutional neural networks learn the features of each image. That’s where their strength lies: networks do all the work of extracting features automatically, unlike supervised learning techniques.

There are four types of layers for a convolutional neural network: the convolutional layer, the pooling layer, the ReLU correction layer, and the fully-connected layer. Next, I’ll explain the purposes of these different layers.

1. The convolution layer

A convolutional layer is the key component of convolutional neural networks, and generally constitutes at least their first layer.

Its purpose is to locate the presence of a set of features in the images received as input. For this, we perform convolutional filtering: the principle idea is to “drag” a window representing the feature on the image and then calculate the product of the convolution between the feature and each portion of the scanned image. A feature is then seen as a filter: the two terms are equivalent in this context.

The convolution layer thus receives several images as input and calculates the convolution of each of them with each filter. The filters correspond exactly to the features you want to find in the images.

We obtain for each pair (image, filter) an activation map, or feature map, which indicates where the features are in the image: the higher the value, the more the corresponding place in the image looks like the feature.

Unlike traditional methods, features are not pre-defined according to a particular formalism (SIFT), but learned by the network during the training phase! The nuclei of the filters denote the weights of the convolution layer. They are initialized and then updated by backpropagation of the gradient.

This is the strength of convolutional neural networks — they’re able to determine all the discriminating elements of an image, by adapting to the problem. For example, if the question is to distinguish cats from dogs, the automatically defined features can describe the shape of the ears or paws.

2. The pooling layer

This type of layer is often placed between two convolution layers: it receives several feature maps as input, and applies to each of them the pooling operation.

The pooling operation consists of reducing the size of the images while preserving their important characteristics.

For this, the image is cut into regular cells, with the maximum value kept within each cell. In practice, small square cells are often used to avoid losing too much information. The most common choices are adjacent cells of size 2×2 pixels that don’t overlap, or size 3×3 cell pixels, distant from each other with a pitch of 2 pixels (which overlap).

The same number of feature maps are output as input, but these are much smaller.

The pooling layer reduces the number of parameters and calculations in the network. This improves the efficiency of the network and avoids model overfitting.

The maximum values ​​are spotted less accurately in feature maps obtained after pooling than in those received as input — this is actually a big advantage! Indeed, when you want to recognize a dog for example, his ears don’t need to be located as accurately as possible: knowing that they’re located near the head is enough.

Thus, the pooling layer makes the network less sensitive to the position of features: the fact that a feature is a little higher or lower, or even that it has a slightly different orientation should not cause a radical change in the classification of the image.

3. The ReLU correction layer

ReLU (Rectified Linear Units) denotes the real nonlinear function defined by ReLU(x)=max(0,x).

The ReLU correction layer therefore replaces all negative values received as inputs with zeros. It plays the role of an activation function.