The Convolutional Layer

First, a smidge of theoretical background

When you first saw the term “convolution,” you might have recognized it as the mathematical operation commonly used in signal processing. Suppose you have a system (e.g. a circuit) and you measure the output signal, f(t), given an exceedingly narrow input pulse (approximated as a Dirac delta function). This is great, however, you really want to be able to predict the output signal, y(t), given any input pulse, g(t), you choose. To compute y(t), just convolve the two signals, f*g. What you’re really doing here is sweeping g(t) across f(t) to generate y(t). This operation can be summarized in equation form:

Check out this YouTube video, Signals and Systems: Convolution Theory, by UConn for a more in-depth example. The main thing to remember is that one function is swept over another function to generate some kind of output. Take a look at this animation from Wikipedia to visualize what is happening:

Figure 3 — animation from Wikipedia showing the convolution (black curve) of a box-shaped pulse (red box) with an exponential signal (blue curve).

In neural networks, the mechanics of a convolutional layer is not exactly identical to the mathematical operation, but the general idea is the same: something called a “kernel” gets swept over an input array and generates an output array.

A warped wall detector: a qualitative look at kernels

If you visualize your input image as a 2-D array of pixels, imagine a much smaller 2-D array called a “kernel” sweeping across it, kind of like this:

Figure 4 — Visualization of a kernel (blue box) sweeping over all of the pixels in an image.

Imagine that you’re a competitive ninja and you’re trying to figure out which obstacles you should train on the most in order to boost your chances of success. You have lots of photos of different American Ninja Warrior course configurations that you have collected over the years, like this one:

Figure 5 — A typical American Ninja Warrior course.

You decide that the first thing you want to do is count up all of the course configurations that include a warped wall. You don’t want to sit around counting warped walls by hand, so you decide to train a CNN to perform this “object detection” task automatically for you.

Let’s think conceptually about how this CNN might work within the blueprint of what we just talked about (remember the kernels sweeping across pixel arrays?). Imagine a kernel that has somehow learned to detect warped walls. As it sweeps across the input image, it looks down at all the pixels within each one of its “receptive fields” and detects whether or not a warped wall is there or not (it only looks at one receptive field at a time). If a warped wall is present, the kernel sends a “yes” to the output array. If not, the kernel sends a “no.” Then it sweeps to its next receptive field and looks for more warped walls. It keeps doing this until it has swept over the entire input image.

In short, the convolution of the kernel and the input image generates an output array of yes’s and no’s. Take a look at Figure 6 to visualize what is going on:

Figure 6 — a “warped wall detector.” A kernel detects warped walls by sweeping across an input image. The convolution of the kernel and input image generates an output array of yes’s and no’s.

The output array then becomes the input array for the next layer of the network. Of course, this is a highly simplified conceptualization of what’s really happening. You may be asking: How does the kernel know what types of features it’s looking for? What exactly is the kernel? Where does the kernel come from?

The kernel and where it comes from

The kernel is just a 2-D array of WEIGHTS. This bears repeating:

A kernel is a 2-D array of weights.

Remember how linear regression trains its weights using gradient descent? A CNN does the same thing: it trains all of its weights during backpropagation. The weights associated with the convolutional layers in a CNN are what make up the kernels (remember that not every layer in a CNN is a convolutional layer). Until the weights are trained, none of the kernels know which “features” they should detect.

So if each kernel is just an array of weights, how do these weights operate on the input image during convolution? The network simply performs an element-wise multiplication between the kernel and the input pixels within its receptive field, then sums everything up and sends that value to the output array. In Figure 7, you can see how the first element in the output array is calculated:

Figure 7 — element-wise multiplication.

Once the first element of the output array has been filled in, the kernel sweeps over to its next stop. The next element of the output array is calculated, then the next, and so on until the kernel has swept over the entire input image and the entire output array has been filled. Take a look at Figure 8 to help visualize this process. The tiny numbers in the dark blue box sweeping over the input image correspond to the kernel weights. Notice how these weights never change as the kernel performs its full sweep:

Figure 8 — a kernel sweeps over an image, generating each element of the output array along the way.

In Figures 7 and 8, the kernel we used was [[2,1], [0,2]], but that was just an arbitrary example so that we could get a feel for how element-wise multiplication works. Remember that each element of a kernel is actually a WEIGHT that the network learns during backpropagation.

The set of weights assigned to a kernel actually have rich visual meaning and encode which “feature” that kernel will look for as it sweeps across an image.

A deeper look at how kernels encode visual meaning

Let’s go back to our earlier example of a dog and cat classifier. If we had to dictate to the computer which features it should use in order to discriminate between dogs and cats, we might focus on the ears (floppy vs. pointy), the nose (big vs. small), and the eyes (round pupils vs. vertical pupils).

CNNs train their weights automatically, so we have no control over which features the network chooses to use. However, we can come up with our own kernels to get a feel for how they can be used to detect different features. Take a look at four simple kernels in Figure 9:

Figure 9 — Kernels.

The kernels displayed in Figure 9 detect horizontal lines (top-left), vertical lines (top-right), 45 degree lines (bottom-left), and 135 degree lines (bottom-right). Each kernel is shown as an array of weights (left) and a pixel representation (right). Notice how the pixel representations all look kind of like filters. For example, the pixel representation of the “horizontal lines” kernel blocks everything behind it except for a horizontal strip running across the center. In fact, kernels can actually be represented as a small image the size of the receptive field!

Coming up with kernels that have interesting applications is pretty math-ish, so we won’t get into the details here (if you’d like to read more about kernels, read these computer science notes from Cornell). You can actually use kernels to apply many interesting effects to an image, such as sharpening, blurring, and embossing. This is actually how your favorite graphics editors work!

We can approximate a pretty good edge detection kernel by combining all of the kernels in Figure 9 (using an element-wise sum): [[-1,-1,-1], [-1,8,-1], [-1,-1,-1]]. We can easily convolve this kernel with an input image using the Python image processing library, OpenCV. Take a look at the code below and the result in Figure 10:

Figure 10 — Original image (left) convolved with an edge detecting kernel (right).

In Figure 10 we display the convolution of our image of Jessie Graff with the edge detection kernel we just came up with. Wow! The output array is actually ANOTHER IMAGE…

Feature hierarchies

If the input and output arrays for all of the convolutional layers are images, then we can visualize a CNN as a stack of images:

Figure 11 — Hierarchical layers.

Figure 11 displays an input image (bottom panel) with two convolutional layers stacked on top. Each pixel in the first convolutional layer (middle panel) can “see” only the pixels contained within its receptive field in the input image. Now let’s take a look at the second convolutional layer (top panel). Each pixel can see all of the pixels contained within its receptive field in the first layer. In turn, each pixel within that receptive field can see all of the pixels in the input image that are contained within its receptive field. This means that as you travel towards the top-most layers of the network, each pixel has more and more information about the input image encoded in it. In this way, the structure of CNNs can be thought of as “hierarchical.”

A hierarchical structure makes so much sense when you’re working with images! Kernels in the lower convolutional layers focus on detecting small-scale features while kernels in the upper convolutional layers focus on detecting large-scale features.

The CNN works by learning feature hierarchies with features from higher levels of the hierarchy formed by the composition of lower level features. For example, consider the feature hierarchy for a warped wall:

Figure 12 — feature hierarchy for a warped wall.

In Figure 12, we can see how large-scale features (middle tier) are amalgamations of many different small-scale features (bottom tier). Interestingly, research indicates that animal brains actually process images in a similar, hierarchical manner. The hierarchical structure of CNNs is why we can take a network that was originally trained for a specific task and reuse the bottom layers of that network for a completely different task with surprisingly excellent results. This is called using a “pre-trained model.”

Feature maps

Just one more thing before we move on. The image that results from the convolution of the input image and kernel is called a “feature map.” You can design a convolutional layer with as many feature maps as you want; each map will have its own kernel associated with it. One convolutional layer can have hundreds of feature maps!

Taking another look at the LeNet-5 architecture (Figure 13), you can see that the first convolutional layer has 6 feature maps and the second has 16. When you hear someone say that a CNN is “wide,” they mean that each convolutional layer has many feature maps.