This text will be useful for mobile developers who want to train existing ML models on custom data and use it to build mobile apps.

Why can’t the performance of my computer be appropriate for training the ML model?

If you want to train the perceptron to perform the XOR operation — you can train it even on an old mobile device.

But some image recognition solutions require more computing performance. For example, it takes weeks (or even months) of computation on a powerful CPU to train YOLO (Object Recognition and Photo Algorithm). For the top GPUs, the training time can be reduced from few days to several hours. Of course, you can also spend thousands of dollars for the latest Nvidia Tesla GPU, but if you are not actively engaged in this, it may be costly. Sometimes the computation for ML-algorithms should be parallelized to working on several such GPUs. Therefore, it’s often appropriate to use cloud computing.

What determines the train time of the model?

It depends on many parameters: the size of the dataset, weights (the count of calibrated parameters of the neural network), the number of iterations, etc. The training of the neural network can be described as “weights calibration”, and the array of these weights + the structure of the neural network form a pre-trained model, which will be loaded on the mobile device in our example.

What’re the epoch, step, iteration, loss, batch size, tensor shape, over-fitting?

Gradient descent and backpropagation algorithms are widely used to train neural networks.

The dataset is divided into several batches. Batch size is the size of each batch.

One epoch is when an entire dataset is passed forward and backward (computing backpropagation) through the neural network only once.

For finding the best value of some weight with the lowest error, it moves along a graph in the direction of a gradient (a vector that points to an increase of some magnitude) through a step, and it requires several iterations. Iterations are the number of batches required to complete one epoch.

Loss is a number characterized by a loss function, which indicates how bad the model was predicting the result value. If the model’s forecast is perfect (which is always unlikely), the loss is zero, otherwise, the loss is greater.

Every artificial neural network has input and output, so to “feed” the data (and get the output value), it’s necessary to convert them to the appropriate format — to the N-dimensional array. And the shape is the number of elements in each of its dimensions.

Convert N-dim. array to a tensor using TensorFlow in the python console

For example, a picture created by only 4 pixels (green, black, blue, red) can be represented as N-dimensional array

[

[ [0, 255, 0], [0, 0, 0] ],

[ [0, 0, 255], [255, 0, 0] ]

]

or

[

[ [0, 1.0, 0], [0, 0, 0] ],

[ [0, 0, 1.0], [1.0, 0, 0] ]

]

(normalized)

with the shape [2, 2, 3] (height, width, RGB):

Let’s draw `f(x)= x` function via red dots, the points arrangement isn’t accurate, and you want to build an ML model for drawing the rest part of the chart.

Over-fitting is when your model fits in perfectly with the data on which it trains, in this example — to all irregularities of the red dots you drew. But further drawing will be not successful and accurate, that is, it makes big mistakes on the test data.

Appropriate-fitting is when your model correctly finds out the patterns of your data, and in this example, can correctly plot the rest part of the chart, that is, it can work well on test data.

Under-fitting is when your model doesn’t work out on both training and test data.

But sometimes, in real life, it’s permissible to use models that can be called “over-fitted”, which ideally work on just a certain range of data. For example, the formula for velocity adding (for velocities familiar to people in everyday life) is quite simple — it’s just adding. But considering a rather wide range of velocities up to the speed of light, it has a more complex appearance. So the first formula works on only a certain range of data and the second formula works on an even wider data range than the first formula. But quite often the second formula can be neglected and, for the sake of simplification, the first formula can be used.

Step 1. Prepare your project for training on Google Cloud.

For example, I chose an open-source project for recognizing objects and their coordinates on a photo — YOLO v3 (Keras). First, let’s edit the structure of our project:

trainer # Directory with train module

— — __init__.py

— — …. # here will be files of our open source project

setup.py #dependencies

We will use Google Could Storage to store our large dataset. Let’s create a storage bucket using the command in the Google Could console:

gsutil mb -p [PROJECT_NAME] -c [STORAGE_CLASS] -l [BUCKET_LOCATION] -b on gs://[BUCKET_NAME]/

Where PROJECT_NAME is the name of our project on Google Could.

STORAGE_CLASS — there are Multi-Regional Storage, Regional Storage, Nearline Storage, and Coldline Storage. You can read more about storage classes here.

BUCKET_LOCATION — location of our storage bucket:

For my example, I used the following parameters: storage class — coldline, region — us-east1.

Next, you need to download the dataset. I used VOC dataset.

To copy these files to Cloud Storage via the Google Could console, use the command:

gsutil -m cp -R [SOURCE_LOCAL_LOCATION] gs://[BUCKET_NAME]

All gsutil commands are described here.

It’s desirable to do all File I/O operations through Bucket I/O, which is perfectly implemented in the tensorflow.python.lib.io module.

See this file for details:

For this particular example — YOLO training — we still need to create a train-file that contains the paths to the pictures from the dataset, the coordinates of objects on them and their type (class):

Where x11, y11, x12, y12 are the coordinates of the “rectangle” of an object on the photo, some_class is the object class (number, all classes can be viewed in classes.txt).

To automate this process, there is a voc_annotation.py script in the repository.

2012_train.txt, the result of voc_annotation.py

We got the following file structure on the Cloud Storage:

Step 2. Create ML Cloud Job.

To create a Cloud Job in the Google Cloud Console, let’s run this command.

I’ve written a separated bash script for that.

If you look at the logs, you would see that the value of loss decreases step by step with each epoch:

The best loss value is close to zero.

After training we will find our pre-trained models here:

You can also see the full code in this repository — https://github.com/dneprDroid/keras-yolo3

Step 3. Load the model on your mobile device.

Let’s consider several cases:

— CoreML (ios/mac)

— Metal Performance Shaders (ios/mac)

3.1 CoreML

To load using CoreML, we need to convert our model to the appropriate format.

For Keras (*.h5):

For Tensorflow (*.pb, *.proto):

Where input_tensor_shapes are the shapes of the input N-dimensional array, in the case of YOLO v3 (tiny): [416, 416, 3], the format of the shape is [height, width, rgb-values].

And output_tensor_names are the names of the output values, in the case of YOLO: is output1, output2, output3 where output1 shape = [13, 13], output2 shape = [26, 26], output3 shape = [52, 52].

When you add a CoreML model to the Xcode project, the Yolov3 class is auto-generated. Let’s see how and where our model loads.

As we can see, it’s loaded from the Yolov3.mlmodelc directory (from the app bundle), where files are stored, including model.espresso.net (model structure), model.espresso.weights (weights).

It should be mentioned that the model files are not encrypted and someone could easily “steal” them for usage in another application.

CoreML model in the app bundle

To process a picture, we should call this method with the parameter CVPixelBuffer:

You can get CVPixelBuffer from the camera video stream using this method in AVCaptureVideoDataOutputSampleBuerDelegate:

It should be mentioned that if you take CVPixelBuer from AVCaptureSession (AVFoundation), it has the RGB color model format.

But if you take CVPixelBuer from [ARFrame capturedImage] (ARKit), it has the YUV format.

RGB vs YUV

Here’s what your neural network can “see” when you make a mistake and pass YUV data instead of the expected RGB:

YUV image is displayed as RGB

And yes, it can adversely affect the results and quality of recognition. Therefore, it should be always converted to the appropriate format and the appropriate image size.

The full code of the repository is here.

CoreML can run on both CPU and GPU. GPU implementation is based on Metal Performance Shaders.

3.2 Metal Performance Shaders

The Metal Performance Shaders framework contains a collection of highly optimized compute and graphics shaders that are designed to integrate easily and efficiently into your Metal app. These data-parallel primitives are specially tuned to take advantage of the unique hardware characteristics of each GPU family to ensure optimal performance.

If you look at the stack trace of your CoreML app, you can see the classes of Espresso (Apple’s C++ ML framework) and Metal Command Buffer.

Let’s consider how to run our model on Metal Command Buffer.

The computational graph of our model is created sequentially using MPS nodes.

SomeNode1, SomeNode2,… SomeNodeN — node classes. There are input and output nodes. Some node classes: MPSCNNPoolingMaxNode (max pooling), MPSCNNConvolutionNode (convolution), MPSCNNNeuronReLU (ReLU activation) and others.

The node graph of our ML model can be viewed using python via keras.utils.plot_model (for Keras). For example, the graph of the YOLO model is like this.

device (MTLDevice) defines the interface to a GPU and can be used for graphics and parallel computing.

To work via GPU-CPU shared memory space, the data should be prepared by converting to MTLTexture and then by creating MPSImage.

After calling [MPSNNGraph executeAsync], our entire graph is encoded into MTLCommandBuffer for execution on the GPU.

In the callback of [MPSNNGraph executeAsync], we can get outputImage as an MPSImage, then copy values from it.

The detailed example of using Metal Performance Shaders for YOLO can be found in this repository.

If you don’t want your ML model to be used by someone else and easily integrated into another app — you should use Metal Performance Shaders directly instead of CoreML.