Tensorflow Inception

The Tensorflow Inception model has been trained to recognize objects of ~1000 classes. If you feed an image to the network it will spit out the likelihood of each class for the object shown in the image.

To use the Inception model with OpenCV we have to load the binary ‘tensorflow_inception_graph.pb’ and the list of class names from ‘imagenet_comp_graph_label_strings.txt’. You can get these files by downloading and unzipping ‘inception5h.zip’ (see sample code for link).

Classifying objects in an image

To classify the object of an image we will write the following helper function:

This function does the following things:

Prepare the input image

First of all we have to know, that the Tensorflow Inception net accepts 224x224 sized input images. That’s the reason why we resize the image such that it’s largest dimension is 224 and we pad the image’s remaining dimension with white pixels, such that the width = height (padToSquare).

Pass the image through the network

We can simply create a blob from the image and call net.forward() to forward pass the input and retrieve the output blob.

Extract the result from the output blob

For the purpose of generalization, the output blob is simply expressed as a matrix (cv.Mat) and it’s dimensionality depends on the model. With Inception it’s easy. The blob simply is a 1xN matrix (where N equals the number of classes), which describes the probability distribution for all classes. Each entry holds a floating number representing the confidence for the corresponding class. The entries add up to 1.0 (100%) in total.

We want to take a closer look at the most probable classes for our image, thus we are looking for the classes with a confidence larger than a minConfidence (5% in this example).

That’s easy to achieve, we will simply threshold all values in the matrix by 0.05 and find all entries, which are not set to zero (findNonZero). Lastly we will sort the result by confidence and return the pairs of className with confidence.

Test it!

Now we will read some sample data that we want the network to recognize:

If we run the prediction for each image we will get the following output (or see title image):

banana:

banana (0.95) husky:

Siberian husky (0.78)

Eskimo dog (0.21) car:

sports car (0.57)

racer (0.12) lenna:

sombrero (0.34)

cowboy hat (0.3)

Quite interesting. We get a pretty precise description for the contents of the husky and banana image. For the car we may get different categories of cars but we can definitely say that it’s a car shown in the image. Of course the net can not be trained on infinite classes, which is why it does not return some description like “woman” for the last image. However, it recognizes the hat.

COCO SSD

Ok that worked pretty well, but how do we deal with images that show multiple objects. Well to recognize multiple objects in a single image, we will utilize what’s called a Single Shot Multibox Detector (SSD). In our second example we will look at a SSD model trained with the COCO (Common Object in Context) dataset. The model we are using has been trained on 84 different classes.

Since this one comes as a Caffe model we have to load a binary ‘VGG_coco_SSD_300x300_iter_400000.caffemodel’ as well as a protoxt file ‘deploy.prototxt’:

Classification with COCO

Our classify function looks mostly the same as with Inception, but this time the input will be 300x300 images and the output will be a 1x1xNx7 matrix.

I am not quite sure why the output is a 1x1xNx7 matrix, but we are actually only interested in the Nx7 part. To map the 3rd and 4th dimension into a 2D matrix we can use the flattenFloat utility. Comparing this one to the Inception output matrix, this time N does not correspond to each class but to each object detected. Furthermore we end up with 7 entries now per object.

Why 7 entries?

Remember, the problem is a little bit different here. We wanted to detect multiple objects per image, thus we can not only give each class a confidence value. What we actually want to have is a rectangle indicating the each object’s location in the image. Below you can find what each entry corresponds to:

0. I actually have no clue.

1. the class label of the object

2. it’s confidence

3. leftmost x of the rectangle

4. bottom y of the rectangle

5. rightmost x of the rectangle

6. top y of the rectangle

The output matrix gives us quite some information about the result, which is pretty neat. We can filter the result by confidences again and draw rectangles into the image for each recognized object.

Let’s see it in action!

For the sake of simplicity I will skip the code for drawing the rectangles and all the other stuffs for visualization. If you want to know how to do that you can look at the sample code.

Let’s feed an image with cars into the network and filter the result for detections with the className ‘car’:

Nice! And now something more difficult. Let’s try uhmm… a breakfast table maybe?

There we go!

Some final words

That’s how you can use OpenCV and Node.js to recognize objects in images with neural nets. If you want to play around it I would recommend checking out the Caffe Model Zoo, which offers a bunch of trained models for different usecases, which you can simply download.

If you did some awesome stuff with DNNs in OpenCV, I would love to know about! Feel free to leave a comment below.

If you liked this article feel free to clap and comment. I would also highly appreciate supporting the opencv4nodejs project by leaving a star on github. Furthermore feel free to contribute or get in touch if you are interested :).