Gesture-Based Interface

We’ll use machine learning to train a model with simple gestures. This will allow us to communicate with an app in which we’ll have a browser being displayed by AR.

1 — Training a Machine Learning Model

Usually, when training a machine learning model, we need a huge dataset to avoid biased recognition. This means that if we submit only a few pictures on a similar angle or with similar light/shadow conditions, then the recognition might either not work for what we want, or it’ll work for what we want and for many other cases that we don’t want.

Once we gather enough images to compose our dataset, we can train a model using tools like Turi Create, Caffe, TensorFlow... If you want to train a model on-line, you can use Custom Vision:

H ere’s a list of custom Core ML models where you can find many nice trained models ready-to-go. https://github.com/likedan/Awesome-CoreML-Models

If you’re using CustomVision.ai, use the “General Compact” domain, and once you’ve submitted the whole dataset and tagged the pictures of each gesture, you can then train the model and export it to Core ML format.

I learned this approach when I checked this repository https://github.com/hanleyweng/Gesture-Recognition-101-CoreML-ARKit. You can even re-use its CoreML model, which is what I’ll do here.

Now that we’ve trained and exported our model, keep it somewhere safe, because now it’s time to…

2 —Use ARKit to create an AR iOS app

Let’s create an iOS project using the Augmented Reality template.

I’m not a big fan of using templates because they have too much boilerplate code, but for this purposes of this tutorial, we’ll use it. If you want to understand step-by-step how to start an ARKit project, check out my other article on this topic:

I haven’t written a single line of code and there are already 81 lines.

As I said, these templates have a lot of boilerplate code. Drag and drop your model into the project navigator. If you’re re-using the model from the project I’ve mentioned (like me), it’ll look like this.

Make sure the ‘Target Membership’ is checked for the project.

Let’s clean this up…All this boilerplate code is just noisy and useless for us. Delete every function except for viewWillAppear and v iewWillDisappear .

This is how our code looks like right now.

Note that I also removed the delegate protocol conformation and the unused imports of UIKit and SceneKit since we’re already using ARKit. We’ll use the Vision framework for the Image Analysis. So go ahead and add:

import Vision

Now let’s go from top to bottom and declare some properties we’ll use further in our code:

The first line inside our class scope is an outlet for a ARSCNView , which is already connected to the .xib since we started the project using a template.

We’ll be constantly trying to recognize each frame displayed from our camera, and we can’t block the UI thread many times per second to analyze those frames. So we must handle this using a serial queue, which is our first property.

The second property is an array of VNRequest . We’ll create an object from this type further down in this ViewController.

The third and last property is a simple UIWebView .

Lifecycle and setups

Setting up the AR

Besides running the session with our configuration parametrized, the only thing I’ve added here is the gesture recognizer. This will allow us to decide when we want to add a node with the UIWebView to our scene.

Setting up Vision/ML

First, we have our trained model wrapped in a VNCoreMLModel instance, then a VNCoreMLRequest parametrizing this model, and then a completion handler, which in this case is a function to be set a few lines ahead.

Back in the properties, we set an array of VNRequest type, remember? What we do here is set an array containing this request instance to that property.

But why does it have to be an array if it’s just one instance? Because the ‘perform’ function receives an array as a parameter.

tapped(recognizer: UIGestureRecognizer)

Once you tap on the ARSCNView , we get its current frame, and then we load a SCNPlane onto our scene containing the UIWebView as its diffuse shading.

The position of the node that contains this SCNPlane is being set by getting the transform matrix of the camera to retrieve the point of view and multiplying the ‘z’ axis by -1.0, so it can be seen a bit ahead of the phone’s camera. The eulerAngles property means rotation of the node, and SCNVector3Zero is the same as SCNVector3(0, 0 ,0)

Check this link for more:

loopCoreMLUpdate

Here, we use the queue to leave the main thread free for the camera so this whole recognition process doesn’t freeze our UI. And inside the scope of execution, we call our function recursively.

updateCoreML function