Background

There are different types of semantic segmentation networks and the focus here is on Fully Convolution Networks(FCNs). The first FCN was proposed in this paper from Berkely. FCNs are built by extending normal convolution networks (CNN) and thus have more parameters and take longer to train than the latter. The work described here stemmed from an effort to build an FCN that is small enough to be trained on a typical laptop in a few minutes. The idea was to first build a dataset containing multiple MNIST digits in every image. The code used to generate this derived dataset is here. Let us call it M2NIST (multi-digit MNIST) to avoid any confusion.

M2NIST

Every image in M2NIST is grayscale (single channel), 64x84 pixels in size, and contains up to 3 digits from MNIST dataset. A typical image can look like this:

A multi-digit image from M2NIST

The labels for the M2NIST dataset are segmentation masks. A segmentation mask is a binary image (pixel values 0 or 1),with the same height and width as the multi-digit image but with 10 channels, one for every digit from 0 to 9. The k-th channel in the mask has only those pixels set to 1 that coincide with the location of digit k in the input multi-digit. If digit k is not present in the multi-digit, the k-th channel in the mask has all its pixels set to 0. On the other hand, if the multi-digit contains more than one instance of the the k-th digit, the k-th channel will have all those pixels set to 1 that happen to coincide with either of the instances in the multi-digit. For example the mask for the multi-digit above looks like this:

Mask for the multi-digit above. Only channels for digits 2,3 and 9 have some pixels set to 1

To keep things easy the M2NIST dataset combines digits from MNIST and does not perform any transform, for example, rotation or scaling. M2NIST does ensures that the digits do not overlap.

The Idea Behind FCNs

The idea behind FCNs is very simple. Like CNNs, FCNs use a cascade of convolution and pooling layers. The convolution and maxpooling layers reduce the spatial dimension of an input image and combine local patterns to generate more and more abstract ‘features’. This cascade is called an encoder as raw input is encoded into more abstract, encoded, features.

In a CNN, the encoder is followed by a few fully-connected layers that mix together the local features produced by the encoder into global predictions that tell a story about the presence or absence of objects of our interest.

CNN = Encoder + Classifier

Typical CNN architecture. Source: https://www.mathworks.com/solutions/deep-learning/convolutional-neural-network.html

In an FCN, we are interested in predicting masks. A mask has n channels if there are n classes of objects that could be present in an input image. The pixel at row r and column c in the k-th channel of the mask, predicts the probability of the pixel with coordinates (r,c) in the input belonging to class k. This is also known as pixel-wise dense prediction. Because the total probability of belonging to different classes for any pixel should add up to 1, the sum of values at (r,c) from channel 1 to n have sum equal to 1.

Mask with channel IDs for an M2NIST image containing digits 2,3 and 9. The value in any of the positions (r,c) has to sum up to 1 across channels 0 to 9.

Let us understand how FCNs achieve pixel-wise dense prediction. FCNs first, gradually, expand the output features from the encoder stage using transpose convolution. Transpose convolution re-distributes the features back to pixel positions they came from. To understand how transpose convolution works, refer to this excellent post:

It is important to stress that transpose convolution does not undo convolution. It merely redistributes the output of some convolution in a fashion that is consistent with, but in the opposite direction of, the way in which convolution combines multiple values.

Transpose convolution re-distributes one value to (many) positions, from where the value would have come from. Source: https://towardsdatascience.com/up-sampling-with-transposed-convolution-9ae4f2df52d0

The expansion or up-sampling, as it is called, is repeated, using multiple transpose convolutions, until the features have the same height and width as the input image. This essentially gives us features for every pixel position and constitutes the decoder stage of an FCN.

FCN = Encoder + Decoder

Typical FCN architecture. The first stage is the encoder stage, similar to that in a CNN, that reduces the height(H) and width(W) of the input and increases the thickness or number of channels(C). The second stage is a decoder that uses transpose convolution (deconvolution) to up-sample features from the encoder to the same size as the input image. The figure shows the output H and W after every layer. The thickness or number of channels in the output is not shown but qualitatively represented. Source: https://www.doc.ic.ac.uk/~jce317/semantic-segmentation.html

The output of the decoder is a volume with shape HxWxC, where H and W are the dimensions of the input image and C is a hyper-parameter. The C channels are then combined into n channels in a pixel-wise fashion, n being the number of object classes we care about. The pixel-wise combination of features values is done using normal 1x1 convolution. 1x1 convolutions are commonly used for this kind of ‘dimension reduction’.

In most cases we have C > n so it makes sense to call this operation a dimension reduction. It is also worth mentioning that, in most implementations, this dimension reduction is applied to the output of the encoder stage instead of the decoder’s output. This is done to reduce the size of the network.

Whether the encoder’s output is up-sampled by the decoder and then the decoder’s output dimension is reduced to n OR the encoder’s output dimension is immediately reduced to n and then the decoder up-samples this output, the final result has shape HxWxn. A Softmax classifier is then applied pixel-wise to predict the probability of each pixel belonging to each of the n classes.

To take a concrete example, suppose the encoder’s output has shape 14x14x512, as in the FCN diagram above, and the number of classes, n, is 10. One option is to first reduce the thickness dimension using 1x1 convolutions. This gives us a 14x14x10 volume which is then up-sampled to 28x28x10, 56x56x10 and so on, until the output has shape HxWx10. The second option is to up-sample first, which gives us 28x28x512, 56x56x512 and so on until we reach HxWx512 and then use 1x1 convolution to reduce the thickness to HxWx10. Clearly the second option consumes more memory as all the intermediate outputs with thickness 512 will use more memory than intermediate outputs with thickness 10 that are produced with the first approach.

With the encoder-decoder architecture in mind, let us see how to reuse parts of a CNN as the encoder for an FCN.