Here’s what we’re building:

The TensorFlow Object Detection API demos lets you recognize the location of objects in an image, which can lead to some super cool applications. But because I spend more time taking pictures of people, rather than things, I wanted to see if the same technology could be applied to recognizing faces. Turns out it worked pretty well! I used it to build the Taylor Swift detector pictured above.

In this post I’ll outline the steps I took to get from a collection of T-Swift images to an iOS app that made predictions against a trained model:

Preprocess images: resize, label, split them into training and test sets, and convert to the Pascal VOC format Convert images to TFRecords for feeding to the Object Detection API Train the model on Cloud ML Engine using MobileNet Export the trained model and deploy it to ML Engine for serving Build an iOS frontend that makes prediction requests against the trained model (in Swift, obviously)

Here’s an architecture diagram of how it all fits together:

And if you’d rather skip to the code, you can find it on GitHub.

Looking at it now, it all seems so simple

Before I dive into the steps, it would help to explain some of the technology and terms we’ll be using: The TensorFlow Object Detection API is a framework built on top of TensorFlow for identifying objects in images. For example, you can train it with lots of photos of cats and once it’s trained you can pass in an image of a cat and it’ll return a list of rectangles where it thinks there’s a cat in the image. And while it has API in the name you can think of it more as a set of handy utilities for transfer learning.

But training a model to recognize objects in an image takes time and tons of data. The coolest part of Object Detection is that it supports five pre-trained models for transfer learning. Here’s an analogy to help you understand how transfer learning works: when a child is learning their first language they are exposed to many examples and corrected if they misidentify something. For example, the first time they learn to identify a cat they’ll see their parents point to the cat and say the word “cat,” and this repetition strengthens pathways in their brain. When they then learn how to identify a dog, the child doesn’t need to start from scratch. They can use a similar recognition process that they did for the cat, but apply it to a slightly different task. That’s how transfer learning works too.

I don’t have time to find and label thousands of TSwift images but I can use the features extracted from those models which were trained on millions of images by modifying the last few layers and applying them to my specific classification task (detecting TSwift).

Step 1: Preprocessing images

Big thank you to Dat Tran who wrote this awesome post on training a raccoon detector with TF Object Detection. I followed his blog post for labeling images and converting them to the correct format for TensorFlow. His post has the details; I’ll summarize my steps here.

My first step was downloading 200 images of Taylor Swift from Google Images. Turns out there’s a Chrome extension for that — it’ll download all results from a Google Images search. Before labeling my images I split them into two datasets: train and test. I reserved the test set to test the accuracy of my model on images it didn’t see during training. Per Dat’s recommendations, I wrote a resize script to make sure none of the images were wider than 600px.

Because the Object Detection API will tell us where our object is in the image, you can’t just pass it images and labels as training data. You need to pass it a bounding box identifying where the object is in your image and the label associated with that bounding box (in our dataset we’ll only have one label, tswift ).

To generate the bounding boxes for our images I used LabelImg, as recommended in Dat’s raccoon detector blog post. LabelImg is a Python program that lets you hand label images and returns an xml file for each image with the bounding box and associated label (I did spend an entire morning labeling tswift images while people walked by my desk with concerned glances). Here’s how it works — I define the bounding box on an image and give it the label tswift :

Then LabelImg generates an xml file that looks like the following:

Now I have an image, a bounding box, and a label but I need to convert this into a format that TensorFlow accepts — a binary representation of this data called a TFRecord . I wrote a script to do this based on the guide provided in the Object Detection repo. To use my script, you’ll need to clone the tensorflow/models repo locally and package the Object Detection API:

# From tensorflow/models/research/

python setup.py sdist

(cd slim && python setup.py sdist)

Now you’re ready to run the TFRecord script. Run the command below from the tensorflow/models/research directory, and pass it the following flags (run it twice: once for training data, once for test data):

python convert_labels_to_tfrecords.py \

--output_path=train.record \

--images_dir=path/to/your/training/images/ \

--labels_dir=path/to/training/label/xml/

Step 2: Training a TSwift detector on Cloud Machine Learning Engine

I could train this model on my laptop but that would take time, lots of resources, and if I had to put my computer away and do something else the training job would abruptly stop. That’s what the cloud is for! We can leverage the cloud to run our training across many cores to get the entire job done in a few hours. And when I use Cloud ML Engine I can run a training job even faster by leveraging GPUs (graphical processing units), which are specialized silicon chips that excel at the type of computations that our model performs. Utilizing this processing power, I can kick off a training job and then go jam out to TSwift for a few hours while my model trains.

Setting up Cloud ML Engine

With all my data in TFRecord format I’m ready to upload it to the cloud and start training. First I created a project in the Google Cloud console and enabled Cloud ML Engine:

Then I’ll create a Cloud Storage bucket to package up all the resources for my model. Make sure to specify a region for the bucket (don’t choose multi-regional):

I’ll create a /data subdirectory within this bucket to put the training and test TFRecord files:

The Object Detection API also needs a pbtxt file that maps labels to an integer ID. Since I only have one label this will be very short:

Adding the MobileNet checkpoints for transfer learning

I’m not training this model from scratch so when I run training I’ll need to point to the pre-trained model I’ll be building on. I chose to use a MobileNet model — MobileNets are a series of small models optimized for mobile. While I won’t be serving my model directly on a mobile device, MobileNet will train quickly and allow for faster prediction requests. I downloaded this MobileNet checkpoint for use in my training. A checkpoint is a binary file that contains the state of a TensorFlow model at a specific point in the training process. After downloading and unzipping the checkpoint, you’ll see that it contains three files:

I’ll need all of them to train the model so I put them in the same data/ directory in my Cloud storage bucket.

There’s one more file to add before running the training job. The Object Detection script needs a way to find our model checkpoint, label map, and training data. We’ll do that with a config file. The TF Object Detection repo has sample config files for each of the five pre-trained model types. I used the one for MobileNet here and updated all of the PATH_TO_BE_CONFIGURED placeholders with the corresponding paths in my Cloud Storage bucket. In addition to connecting my model to the data in Cloud Storage, this file also configures several hyperparameters for my model like convolution size, activation functions, and steps.

Here are all the files that should be in my /data Cloud Storage bucket before I begin training:

I’ll also create train/ and eval/ subdirectories in my bucket — this is where TensorFlow will write my model checkpoint files while running training and evaluation jobs.

Now I’m ready to run training, which I can do through the gcloud command line tool. Note that you need to clone tensorflow/models/research locally and run this training script from that directory:

While training is running, I also kicked off the evaluation job. This will evaluate the accuracy of my model using data it hasn’t seen before:

You can verify that your job is running correctly and inspect the logs for a specific job by navigating to the Jobs section of ML Engine in your Cloud console:

Step 3: Deploying the model to serve predictions

To deploy the model to ML Engine I need to convert my model checkpoints to a ProtoBuf. In my train/ bucket, I can see checkpoint files saved from a few points throughout my training process:

The first line of the checkpoint file will tell me the latest checkpoint path — I’ll download the 3 files from that checkpoint locally. There should be a .index , .meta , and .data file for each checkpoint. With these saved in a local directory I can make use of Objection Detection’s handy export_inference_graph script to convert these to a ProtoBuf. To run the script below, you’ll need to define the local path to your MobileNet config file, the checkpoint number of the model checkpoint you downloaded from the training job, and the name of the directory you’d like the exported graph to be written to:

After this script runs, you should see a saved_model/ directory inside your .pb output directory. Upload the saved_model.pb file (don’t worry about the other generated files) to the /data directory in your Cloud Storage bucket.

Now you’re ready to deploy the model to ML Engine for serving. First, use gcloud to create your model:

gcloud ml-engine models create tswift_detector

Then create the first version of your model by pointing it to the saved model ProtoBuf you just uploaded to Cloud Storage:

gcloud ml-engine versions create v1 --model=tswift_detector --origin=gs://${YOUR_GCS_BUCKET}/data --runtime-version=1.4

Once the model deploys I’m ready to use ML Engine’s online prediction API to generate a prediction on a new image.

Step 4: Building a prediction client with Firebase Functions and Swift

I wrote an iOS client in Swift for making predictions requests on my model (because why write a TSwift detector in any other language?). The Swift client uploads an image to Cloud Storage, which triggers a Firebase Function that makes the prediction request in Node.js and saves the resulting prediction image and data to Cloud Storage and Firestore.

First, in my Swift client I added a button for users to access their device’s photo library. Once the user selects a photo, this triggers an action which uploads the image to Cloud Storage:

Next I wrote the Firebase Function triggered on uploads to the Cloud Storage bucket for my project. It takes the image, base64 encodes it, and sends it to ML Engine for prediction. You can find the full function code here. Below I’ve included the part of the function where I make a request to the ML Engine prediction API (thank you to Bret McGowen for his expert Cloud Functions help on getting this working!):

In the ML Engine response, we get:

detection_boxes which we can use to define a bounding box around Taylor if she was detected in an image

which we can use to define a bounding box around Taylor if she was detected in an image detection_scores return a confidence value for each detection box. I’ll only include detections that have a score higher than 70%

return a confidence value for each detection box. I’ll only include detections that have a score higher than 70% detection_classes tells us the label ID associated with our detection. In this case it will always be 1 since there’s only one label

In the function I use detection_boxes to draw a box on the image if Taylor was detected, along with the confidence score. Then I save the new boxed image to Cloud Storage, and write the image’s filepath to Cloud Firestore so I can read the path and download the new image (with the rectangle) in my iOS app:

Finally, in my iOS app I can listen for updates to the Firestore path for the image. If a detection was found, I’ll download the image and display it in my app along with the detection confidence score. This function will replace the comment in the first Swift snippet above:

Woohoo! We’ve got a working Taylor Swift detector. Note that the focus here was not on accuracy (I only had 140 images in my training set) so the model did incorrectly identify some images of people you might mistake for tswift. But if I find time to hand label more images I will update the model and publish the app in the App Store :)

What’s next?

This post covered a lot of information. Want to build your own? Here’s a breakdown of the steps with links to resources:

Preprocessing data : I followed Dat’s blog post on using LabelImg to hand label images and generate xml files with bounding box data. Then I wrote this script to convert labeled images to TFRecords

: I followed Dat’s blog post on using LabelImg to hand label images and generate xml files with bounding box data. Then I wrote this script to convert labeled images to TFRecords Training and evaluating an Object Detection model : using the approach from this blog post, I uploaded training and test data to Cloud Storage and used ML Engine to run training and evaluation

: using the approach from this blog post, I uploaded training and test data to Cloud Storage and used ML Engine to run training and evaluation Deploying the model to ML Engine : I used the gcloud CLI to deploy my model to ML Engine

: I used the CLI to deploy my model to ML Engine Making prediction requests: I used the Firebase SDK for Cloud Functions to make an online prediction request to my ML Engine model. This request was triggered by an upload to Firebase Storage from my Swift app. In my function, I wrote the prediction metadata to Firestore.

Have questions or topics you’d like me to cover in future posts? Leave a comment or find me on Twitter @SRobTweets.