In this post, we are documenting how we used Google’s TensorFlow to build this image recognition engine. We’ve used Inception to process the images and then train an support vector machine (SVM) classifier to recognise the object, in other words, transfer learning. Our aim is to build a system that helps a user with a zip puller to find a matching puller in the database. This piece will also cover how the Inception network sees the input images and assess how well the extracted features can be classified.

Our puller project with TensorFlow

Recently, Oursky got a mini zip puller recognition project. One of our teams had to build a system for users to match an image of puller with most similar puller inside the database. The sample size for the trial is small (12 pullers), which has implications discussed below as we share our experience on trying out Google’s TensorFlow.

Images showing 12 different pullers

Our first test was to compare the histogram of oriented gradients (HOG) feature computed on the input image and all the puller model images rendered from their computer-aided design (CAD) models. This solution works but the matching performance is poor if the input image background has a strong texture.

We also tested an alternative solution to address the problems with the textured background. We then built a relatively shallow convolutional neural networks (CNN) with 2 convolutional layers and two fully connected layer [1] for classifying the puller image. However, since our data set is too small (around 200 puller images for each type) and lacks variety, the classification performance is poor. It is basically not different from making random guesses.

Training a CNN from scratch with a small data set is indeed a bad idea. The common approach for using CNN to do classification on a small data set is not to train your own network, but to use a pre-trained network to extract features from the input image and train a classifier based on those features. This technique is called transfer learning. TensorFlow has a tutorial on how to do transfer learning on the Inception model; Kernix also has a nice blog post talking about transfer learning and our work is largely based on that.

Brief overview on classification

In a classification task, we first need to gather a set of training examples. Each training example is a pair of input features and labels. We would like to use these training examples to train a classifier, and hope that the trained classifier can tell us a correct label when we feed it an unseen input feature.

There are lots of learning algorithms for classification, for example SVM, random forest, and neural network. How well a learning algorithm can perform is highly related to the input feature. Input feature is a representation that captures the essence of the object under classification.

For example, in image recognition, the raw pixel values could be an input feature. However, using raw pixel values as input feature, the feature dimension is usually too big or too generic for a classifier to work well. In this case, we can either use a more complex classifier such as deep neural network, or use some domain knowledge to brainstorm a better input feature.[2]

For our puller classification task, we will use SVM for classification, and use a pre-trained deep CNN from TensorFlow called Inception to extract a 2048-d feature from each input image.

Bottlenecks features of deep CNN

The common structure of a CNN for image classification has two main parts: 1) a long chain of convolutional layers, and 2) a few (or even one) layers of the fully connected neural network. The long convolutional layer chain is indeed for feature learning. The learned feature will be feed into the fully connected layer for classification.

The feature that feeds into the last classification layer is also called the bottleneck feature. The following image shows the structure of TensorFlow’s Inception network we are going to use. We have indicated the part of the network that we are getting the output from as our input feature.

TensorFlow Inception Model that indicating the bottlenecks feature

How Inception sees a puller

Training a CNN means it learns a bunch of image filters (kernels).

For example, if the input of the convolutional layer is an image with 3 channels, the kernel size for this layer is 3×3 and there will be an independent set of three 3×3 kernels for each output channel. Each kernel in a set will convolve with the corresponding channel of the input and produces three convolved images. The sum of those convolved images will form a channel of the output.

The illustration below is a convolution step.

Illustration of convolution

As the output of each convolutional layer [3] is a multi-channel image, we could also view them as multiple gray-scale images.”. By plotting those grayscale images out, we can understand how the Inception network sees an image. The following images are extracted at different stages of the convolutional layer chain The points are illustrated as A,B,C and D in the Inception Model figure.

This is an input image.

All the 32 149×149 images at stage A:

Inception Output image at Stage A

All the 32 147×147 images at stage B:

Inception Output image at Stage B

All the 288 35×35 images at stage C:

Inception Output image at stage C

All the 768 17×17 images at stage D:

Inception Output image at stage D

Here we can see the images become more and more abstract going down the convolutional layer chain. We could also spot that some of the image are highlighting the puller, and some of them are highlighting the background.

Why is the bottleneck feature good?

The bottleneck feature of Inception network is a 2048-d vector. The following is a figure showing the bottleneck feature of the previous input image in bar chart form.

Bottleneck feature in bar chart form

For the bottleneck feature to be a good feature for classification, we would like the features representing the same type of puller to be close (think of the feature as a point in 2048-d space) to each other, while features representing different types of puller should be far apart. In other words, we would like to see features in a data set clustering themselves according to their types.

It is hard to see this kind of clustering happened on 2048-d feature data sets. However, we can do a dimensionality reduction[4] on the bottleneck feature and transform them to a 2-d feature which is easy to visualize. The following image is the scatter plot of the transformed feature in our puller data set[5] . Different puller type are illustrated by different colors.

Scatter plot of transformed feature of the puller dataset

As we can see, the same color points are mostly clustered together. It has a high chance that we could use the bottleneck feature to train a classifier with high accuracy.

Code for extracting inception bottleneck feature

The inception v3 model can be downloaded here.

Training a SVM classifier

SVM is a linear binary classifier.

The goal of the SVM is to find a hyper-plane that separates the training data correctly in two half-spaces while maximising the margin between those two classes.

Although SVM is a linear classifier, which could only deal with linear separable data sets, we can apply a kernel trick to make it work for non-linear separable case.

A commonly used kernel besides linear is the radial basis function kernel (RBF kernel).

The hyper-parameters for SVM includes the type of kernel and the regularization parameter *C*. If using the RBF kernel, there is an additional parameter *γ* for selecting which radial basic function to use.

Usually the bottleneck feature from a deep CNN is linear separable. However, we will consider the RBF kernel as well.

We used simple grid search for selecting the hyper-parameter. In other words, we tried out all the hyper-parameter combination in the range we have specified, and evaluated the trained classifier performance using cross validation.

The rule of thumb for trying out the *C* and *γ* parameter is trying them with different order of magnitude.

We used 10-fold cross validation.

SVM is a binary classifier. However, we could use the one-vs-all or one-vs-one approach to make it a multi-class classifier.

It seems a lot of stuff to do for training a SVM classifier, indeed it is just a few function calls when using machine learning software package like scikit-learn.

Code for the training the SVM classifier

SVM training result

The following is the training result we get, which got a perfect result! Though this might deal to overfitting…

We’ve used it to built an mobile app and a web front-end for the puller classifier for field testings.

Puller Matcher screenshot

Since the classifier can work with unseen samples, it seems that the over-fitting issue is not so serious.

Conclusion

A pre-trained deep CNN, Inception network in particular, could be used as a feature extractor for general image classification tasks.

The bottleneck feature of the Inception network should a good feature for classification. We have extracted the bottleneck feature from our data set and did a dimensionality reduction for visualization. The result shows a nice clustering of the sample according to their class.

The SVM classifier training on the bottleneck feature has a perfect result, and the classifier seems to work on the unseen sample.

Footnotes

1: ^ The model is based on one of the TensorFlow Tutorial on CIFAR-10 classification, with some twist to deal with larger image size.

2: ^ Sometimes, it will be the other way round, the dimension input feature is too small, we need to do some transformation on the input feature to expand its dimension. The process of picking a good feature to learn is called feature engineering. It is a difficult task. One of the reasons why deep learning is so popular is because we can feed in raw and generic input to the network, and it can automatically learn some good feature during the training. However, the trade-off will be a huge training data set and long training time.

3: ^ Noticed that one convolutional layer is not just having one convolution operation, it could also have multiple convolution operations, pooling operations, or other operations.

4: ^ The algorithm for dimensionality reduction we use is t-SNE.

5: ^ We didn’t use the full data set for the classification, instead we remove images has low variety from the data set, this result in a data set of around 400 images.