The most notable observation from the figure above is that each filter (i.e filter-1, filter-2, … et. al) in Step-1 actually comprises of a set of three convolution kernels (Wt-R, Wt-G, and Wt-B). Each one of this kernel is reserved respectively for the Red (R), Green (G), and Blue (B) channels in the input image.

During the forward propagation, the R,G, and B pixel values from the image are multiplied against the Wt-R, Wt-G, and Wt-B kernels respectively to produce an intermittent activation map (not shown in the figure). The outputs from the three kernels are then added to produce one activation map per filter.

Subsequently, each of these activations are subject to the ReLu function, and finally run through the max-pooling layer, the latter being primarily responsible for reducing the dimension of the output activation map. What we have at the end is a set of activation map, usually whose dimensions is now half of the input image, but now whose signals span a set of 32 (by our choice as the number of filters) two-dimensional tensors.

The output from a convolution layer often serves as the input to the subsequent convolution layer. Thus, if our second convolution unit started as the follows:

conv_out_2 = Conv2d(input = relu_out, filters=64)

then the framework needs to instantiate 64 filters, each filter using a set of 32 unique kernels.

The Why?

Another subtle, but an important point that escapes scrutiny is the discussion of why we used 32 filters for our first convolution layer. In many popular architectures, the number of filters used gets increasingly larger (e.g. 64 for the second, 128 for the third, and so on..) as we go deeper in the network.

Matt Zeiler, in this paper, employs a the deconvolution operator to visualize how the kernels at different layers and depth of a deep convolution architecture get tuned during the training process. The general consensus is that in an optimally trained convolution network, the filters at the very edge (close to the image) becomes sensitive to basic edges and patterns. The filters in the deeper layers become sensitized to gradually higher orders shapes and patterns. The phenomenon is very well summarized in these diagrams extracted from Matt’s paper:

Visualization of the activation of filters on the first and second (outermost) layers.

Visualization of the filters activation on the third layer.

Visualizations of the filters activation on the 4th and 5th layers.

Another question that I wondered for a considerable time is why different filters, even in any given layer, get tuned to a specific shape or pattern. After all, there was nothing extraordinary about the weights in any kernel which would’ve guaranteed the observed outcome. Precisely to that point: the process of stochastic gradient descent (SGD) automagically corrects the weights so that the kernels acquire the specialized features above. It is only important that:

the kernels (or weight matrices) be initialized randomly, so that we ensure each kernel is optimized to a unique solution space, and

we define enough filters to maximize the capture the various features in our dataset, while striking the balance against the incurred computational cost.

And finally, many papers also suggest that the visualization of the filter activations often provide a window into the performance of a convolution architecture. A balanced and performant network often displays activations as discussed above, marked with the manifestation of well-defined edge and shape detectors. A network that over-fits, under-fits, and/or generalizes poorly often fail to show these observations. Hence, it is always a good idea to test the network using the process used in (2) to see if an experimental convolution network is yielding good results.

References: