In this corner… Implementation #2: “Transfer Learning” with VGG16

How does transfer learning work?

For our second implementation of a gym equipment classifier, we’re going to use VGG16 (or…part of it). The VGG16 architecture was the runner-up in the ILSVRC-2014 challenge. VGG16 was developed by the “Visual Geometry Group” (VGG) at Oxford. It was trained on a subset of the ImageNet dataset, which is the world’s largest and most comprehensive database of hand-annotated images. ImageNet contains over 14 million images that fall into over 20,000 categories. All of the ILSVRC-2014 challengers were trained on a subset of ImageNet that contained 1,000 out of the total 20,000 categories. The 1,000 categories included everything from dogs to pianos to space shuttles (the full list can be found here).

The idea of “transfer learning” is that part of a “pre-trained” network can be recycled and reused in another network. This saves a lot of time because we now only need to train the non-recycled layers of the new network. It’s also helpful if we have limited training data. If the pre-trained network was trained on a vast dataset, then recycling layers from that network can effectively “transfer” the knowledge gained from the large dataset to the new model, which can only “see” the small dataset. Transfer learning is basically like cheap cable tv for a resource-constrained network.

You might be asking: how is it helpful to recycle layers from VGG16, which was trained to classify dogs and pianos, when we want to build an image classifier for gym equipment?

Remember that a CNN learns feature hierarchies. Features from higher levels of the hierarchy are formed by the composition of lower-level features. The hierarchy of a warped wall is displayed in Figure 5. The lower layers learn low-level features while the upper layers learn high-level features. While high-level features are very specific to each unique class, low-level features, like edges and corners, are shared by many different classes. For example, a piano and benchpress probably share lots of low-level features. This is why, when using a pre-trained network, we can get away with recycling the lower layers. However, we must re-train the top layers.

Feature hierarchy of a warped wall.

Overview of the VGG16 architecture

VGG16 has 16 layers, not including the pooling layers and output layer. Its architecture is considered one of the simplest out of all of the state-of-the-art CNNs, especially compared to architectures like GoogleNet and ResNet. In the image below, you can see that one of the major differences between VGG16 and LeNet-5 is that VGG16 stacks convolutional layers one on top of the other, instead of following each one with a pooling layer.

The VGG16 architecture. The “frozen” pre-trained layers that we recycle and reuse in the new network are labeled with snowflakes.

The layers that are labeled above with snowflakes have been “frozen.” These are the layers that we’re going to recycle. Notice how everything is frozen except for the fully connected layers. When we build the new network, we will reuse the frozen layers from VGG16 and build 2 fully connected layers on top of that. When we train the new network, the weights for the recycled layers will remain frozen during backpropagation while the weights for the unfrozen layers will be tweaked and optimized.

Implementing a pre-trained model

Let’s take a look at some code to see what the implementation looks like!

In the interest of time and space, the code above assumes that we have already trained the weights for the fully connected layers and that we are ready to make a prediction. If you want to see how the weights were trained, check out the full code in my Github repo or read this excellent post in the Keras blog that talks in depth about how to do this.

In order to make a prediction on a new image, we first resize the image, load the resized image into an array, and convert to grayscale (so that the size and color scale are consistent with the training images). Next, we use the expand_dims function in NumPy to add an extra pair of brackets around the image array so that it’s in the correct format (we need the outer brackets because the input can contain multiple arrays if the image has multiple channels).

Now we’re ready to send the input image through the network. We implement the network in two parts: first, we send the input through the frozen layers of VGG16, then build the new layers on top of that.

Let’s begin with the first part. Another reason that Keras is so convenient is that you can easily instantiate pre-loaded models, such as VGG16. In line 8, we instantiate the frozen layers of VGG16 by passing 2 parameters to applications.VGG16():

include_top=False, which leaves off the “top” layers (the fully connected layers), and weights=’imagenet’, which loads the weights that were trained on the ImageNet dataset.

We send the input through these frozen layers in line 11 and store the output in the variable, bottleneck_prediction. This output will be the input for the second part of the network.

In lines 14–18, we implement the second part of the network: a free-standing network with 2 fully connected layers, including the output layer. In line 21, we load the weights that we previously trained for these top layers and saved in a file called “top_model_weights.h5”. Finally, we pass bottleneck_prediction to model.predict_classes() in order to make the final prediction.

If you are leveraging a pre-trained network in your model and you find that your evaluation metric is still subpar, you can proceed by un-freezing even more layers and re-training the weights for all of the un-frozen layers. In other words, you can keep removing snowflakes and re-training more and more of the weights until you are happy with the results.

As we discussed earlier, VGG16 was trained on over 1,000 categories of images including Dalmatians, space shuttles, and pianos. Of those 1,000 categories, only one (“dumbbell”) fell into the broader category of gym equipment and does not resemble any of our 5 classes. I was skeptical about whether or not this implementation would work. In the next section, we’ll find out!