5 / 5 ( 1 vote )

INTRODUCTION

Machine learning in applications becomes more and more popular. Intelligent YouTube or Netflix recommendations, live text translation by Google Translate. Combining the power of mobile, artificial intelligence and machine learning leads to the great user experience. However, since learning models is a very computationally complex process, and smartphones are low-power devices, machine learning for mobile will inevitably require training on a local computer or server.

RECOGNISE MODELS

Accurate modern object recognition models may contain millions of parameters. For example, Google’s model Inception-v3 shown in [Fig. 1], where one block represents one layer,

Fig. 1: Inception-v3 diagram is able to distinguish between a spotted salamander and a fire salamander [Fig. 2].

Fig. 2: Photos of spotted and fire salamander

Unfortunately, the training process of such complex models requires huge computing power, i.e., Inception-v3 requires two weeks of learning with 8 NVIDIA Tesla K40 graphics cards. To accelerate the process Google has released a version or a pre-trained inception model that can be adapted to a new task. This process is called transfer learning and significantly facilitates retraining of existing weights of las layers to recognize new objects. It’s not as effective as training from scratch, but surprisingly effective for many applications. The best is that it can achieve satisfactory results in approximately 30 minutes on a laptop, without requiring a GPU.

SIZE PROBLEM

Inception-v3 is a great model, but slowish and bulky for mobile devices. It occupies a lot of space and memory (almost 100 MB). Also, input-to-output processing time takes up to 200-300 ms to process one input 224×224 image on a decent phone (Nexus 5). Fortunately, Google has also released models optimized for mobile – „MobileNet”.

MobileNets are a class of a convolutional neural network created to be fast, resource-efficient and reasonably accurate. (More info: https://arxiv.org/pdf/1704.04861.pdf)

Google released many types of MobileNet [Fig. 3]:

Fig. 3: MobileNet model types

Where:

MACs (Multiply Accumulates) – proportionate to required computing power,

parameters – proportionate to memory usage

Additionally, every model comes with normal and quantized weights. A quantized model version uses 8-bit weights instead of 32 bit. As a result, the model has decreased its size up to 75% (at the cost of slightly worse accuracy), and because of the 8-bit computation, the processing time has decreased.

GATHER TRAINING DATA

To get started, we need training data of objects we want to recognize. We need at least 1000 images of every object. To make this process faster, we can make a movie and split it into frames. To make it happen I will use FFMpeg .

If movie resolution is high, we should reduce it first. With FFMpeg we can call the command below:

View the code on Gist.

If we pass desired_width as 500 , it will scale the width down to 500 px and because of the passed height size value of -1 , the script will automatically adjust it to maintain the ratio.

View the code on Gist.

Finally, we can split it with:

View the code on Gist.

If the movie is recorded in 30 fps and we pass the fps value of:

30 – it will return images of every frame,

15 – it will return every second frame,

1 – it will return one frame every second of the movie.

This process should be repeated for every object we want to recognize.

TRAINING TIME

I assume that you have already installed TensorFlow. If not, please follow this guide: https://www.tensorflow.org/install/.

To start retraining, execute the retrain.py script:

View the code on Gist.

Where:

image_dir – a path to the folder with the structure like this:

View the code on Gist.

learning_rate – controls the size of the updates to the final layer during training,

– controls the size of the updates to the final layer during training, testing_percentage – what percentage of images to use as a test set,

– what percentage of images to use as a test set, validation_percentage – what percentage of images to use as a validation set,

– what percentage of images to use as a validation set, train_batch_size – how many images to train on at a time,

– how many images to train on at a time, validation_batch_size – how many images to use in an evaluation batch. This validation set is used much more often than the test set, and is an early indicator of how accurate the model is during the training. A value of -1 causes the entire validation set to be used, which leads to more stable results across training iterations, but may be slower on large training sets,

– how many images to use in an evaluation batch. This validation set is used much more often than the test set, and is an early indicator of how accurate the model is during the training. A value of -1 causes the entire validation set to be used, which leads to more stable results across training iterations, but may be slower on large training sets, flip_left_right – whether to randomly flip half of the training images horizontally,

– whether to randomly flip half of the training images horizontally, random_scale – percentage determining how much to randomly scale up the size of the training images,

– percentage determining how much to randomly scale up the size of the training images, random_brightness – percentage determining how much to randomly multiply the training image input pixels up or down,

– percentage determining how much to randomly multiply the training image input pixels up or down, eval_step_interval – how often to evaluate the training results,

– how often to evaluate the training results, how_many_training_steps – how many training steps to run before ending,

– how many training steps to run before ending, architecture – name of a model architecture (which will be automatically downloaded).

At first, I recommend leaving the architecture field blank. Inception-v3 model will be selected. This will verify if the quality of your training data is sufficient. If the accuracy will be satisfactory, you can try to select smaller MobileNet architectures.

We can observe the learning process in the console window or graphically [Fig. 4], in the form of graphs, by calling:

View the code on Gist.

And opening http://localhost:6006/ in our browser.

Fig. 4: TensorBoard interface

On completion of the learning process, the model will be saved to /tmp/output_graph.pb and labels file to /tmp/output_labels.txt .

As you can see, retraining a model to recognize custom objects is pretty easy and takes less than an hour, including learning time, on a decent laptop. In the next article, I will show how to make use of the generated model to visualize results of recognized objects.