For a non-edge containing grid (eg. the background sky), most of the pixels are the same value, so the overall output of the kernel at that point is 0. For a grid with an vertical edge, there is a difference between the pixels to the left and right of the edge, and the kernel computes that difference to be non-zero, activating and revealing the edges. The kernel only works only a 3×3 grids at a time, detecting anomalies on a local scale, yet when applied across the entire image, is enough to detect a certain feature on a global scale, anywhere in the image!

So the key difference we make with deep learning is ask this question: Can useful kernels be learnt? For early layers operating on raw pixels, we could reasonably expect feature detectors of fairly low level features, like edges, lines, etc.

There’s an entire branch of deep learning research focused on making neural network models interpretable. One of the most powerful tools to come out of that is Feature Visualization using optimization[3]. The idea at core is simple: optimize a image (usually initialized with random noise) to activate a filter as strongly as possible. This does make intuitive sense: if the optimized image is completely filled with edges, that’s strong evidence that’s what the filter itself is looking for and is activated by. Using this, we can peek into the learnt filters, and the results are stunning:

Feature visualization for 3 different channels from the 1st convolution layer of GoogLeNet[3]. Notice that while they detect different types of edges, they’re still low-level edge detectors.

Feature Visualization of channel 12 from the 2nd and 3rd convolutions[3]

One important thing to notice here is that convolved images are still images. The output of a small grid of pixels from the top left of an image will still be on the top left. So you can run another convolution layer on top of another (such as the two on the left) to extract deeper features, which we visualize.

Yet, however deep our feature detectors get, without any further changes they’ll still be operating on very small patches of the image. No matter how deep your detectors are, you can’t detect faces from a 3×3 grid. And this is where the idea of the receptive field comes in.

Receptive field

A essential design choice of any CNN architecture is that the input sizes grow smaller and smaller from the start to the end of the network, while the number of channels grow deeper. This, as mentioned earlier, is often done through strides or pooling layers. Locality determines what inputs from the previous layer the outputs get to see. The receptive field determines what area of the original input to the entire network the output gets to see.

The idea of a strided convolution is that we only process slides a fixed distance apart, and skip the ones in the middle. From a different point of view, we only keep outputs a fixed distance apart, and remove the rest[1].

3×3 convolution, stride 2

We then apply a nonlinearity to the output, and per usual, then stack another new convolution layer on top. And this is where things get interesting. Even if were we to apply a kernel of the same size (3×3), having the same local area, to the output of the strided convolution, the kernel would have a larger effective receptive field:

This is because the output of the strided layer still does represent the same image. It is not so much cropping as it is resizing, only thing is that each single pixel in the output is a “representative” of a larger area (of whose other pixels were discarded) from the same rough location from the original input. So when the next layer’s kernel operates on the output, it’s operating on pixels collected from a larger area.

(Note: if you’re familiar with dilated convolutions, note that the above is not a dilated convolution. Both are methods of increasing the receptive field, but dilated convolutions are a single layer, while this takes place on a regular convolution following a strided convolution, with a nonlinearity inbetween)

Feature visualization of channels from each of the major collections of convolution blocks, showing a progressive increase in complexity[3]

This expansion of the receptive field allows the convolution layers to combine the low level features (lines, edges), into higher level features (curves, textures), as we see in the mixed3a layer.

Followed by a pooling/strided layer, the network continues to create detectors for even higher level features (parts, patterns), as we see for mixed4a.

The repeated reduction in image size across the network results in, by the 5th block on convolutions, input sizes of just 7×7, compared to inputs of 224×224. At this point, each single pixel represents a grid of 32×32 pixels, which is huge.

Compared to earlier layers, where an activation meant detecting an edge, here, an activation on the tiny 7×7 grid is one for a very high level feature, such as for birds.

The network as a whole progresses from a small number of filters (64 in case of GoogLeNet), detecting low level features, to a very large number of filters(1024 in the final convolution), each looking for an extremely specific high level feature. Followed by a final pooling layer, which collapses each 7×7 grid into a single pixel, each channel is a feature detector with a receptive field equivalent to the entire image.

Compared to what a standard feedforward network would have done, the output here is really nothing short of awe-inspiring. A standard feedforward network would have produced abstract feature vectors, from combinations of every single pixel in the image, requiring intractable amounts of data to train.

The CNN, with the priors imposed on it, starts by learning very low level feature detectors, and as across the layers as its receptive field is expanded, learns to combine those low-level features into progressively higher level features; not an abstract combination of every single pixel, but rather, a strong visual hierarchy of concepts.

By detecting low level features, and using them to detect higher level features as it progresses up its visual hierarchy, it is eventually able to detect entire visual concepts such as faces, birds, trees, etc, and that’s what makes them such powerful, yet efficient with image data.

A final note on adversarial attacks

With the visual hierarchy CNNs build, it is pretty reasonable to assume that their vision systems are similar to humans. And they’re really great with real world images, but they also fail in ways that strongly suggest their vision systems aren’t entirely human-like. The most major problem: Adversarial Examples[4], examples which have been specifically modified to fool the model.

To a human, both images are obviously pandas. To the model, not so much.[4]

Adversarial examples would be a non-issue if the only tampered ones that caused the models to fail were ones that even humans would notice. The problem is, the models are susceptible to attacks by samples which have only been tampered with ever so slightly, and would clearly not fool any human. This opens the door for models to silently fail, which can be pretty dangerous for a wide range of applications from self-driving cars to healthcare.

Robustness against adversarial attacks is currently a highly active area of research, the subject of many papers and even competitions, and solutions will certainly improve CNN architectures to become safer and more reliable.

Conclusion

CNNs were the models that allowed computer vision to scale from simple applications to powering sophisticated products and services, ranging from face detection in your photo gallery to making better medical diagnoses. They might be the key method in computer vision going forward, or some other new breakthrough might just be around the corner. Regardless, one thing is for sure: they’re nothing short of amazing, at the heart of many present-day innovative applications, and are most certainly worth deeply understanding.

References

Further Resources