Augmented reality (AR) and machine learning are the hottest technologies on the market right now, so what we are going to build this time is a face recognition app that identifies people in our office and shows basic information when the app identifies someone who has been registered.

In this first post, we are going to talk about how to build our iOS app and set up everything to work with a custom-made back end; in the second part, we are going to talk about how to build the back end and put it in charge of receiving the images from the app, implementing our machine learning model, and sending the resulting data back to the app. In the second post, learn how to build the back end and put it in charge of receiving the images from the app, implement our machine learning model, and send the resulting data back to the app.

What we will need

We are going to build our app using Swift 4 and Xcode 9. We are also going to need a real device with iOS 11 and ARKit support, which could be an iPhone 6S or newer.

Technologies involved

For this project, we want to detect the face of a person using a live video feed within the application. We can accomplish this task by using Vision Framework to perform face detection. For 25 frames in the video feed, we will request Vision to return a bounding box that will be used to crop the frame and obtain 25 images of the user’s face. This will be our training data.

Once we have the training data (images), we will send them to a web application programmed in Python. The back end will be a combination of two elements: a REST API implemented on top of Flask and Turi Create, which is being developed by Apple and makes it easy to build, train, and export CoreML models.

Finally, Docker will allow us to deploy the backend to AWS with minimal effort. The resulting Docker container will be based on Ubuntu 16.04 and run a Gunicorn HTTP server.

Building the iOS app

Proposed Design

Based on the requirements for the project, the design team proposed an app layout with the following main screens:

Step 1: Create the Xcode Project

To start, we’ll need to have Xcode 9 installed on our machines. Open Xcode and create a new project using the template iOS->Augmented Reality App.

After you select the template, click “next” and set up your project’s settings, including your project name, team, identifier, language, etc., and click “next” to create your project in the selected folder.

At this point, if you build and run your app, you will see something like this:

This means that you have successfully created your first AR project with a default object living in a virtual space.

Step 2: Build the User Interface

We are going to have the following screens in the app: a landing screen, a registration screen, a “saving user data” screen to, an “uploading” screen for when the data has been successfully saved, and an “identifier” or “people tracker” screen. Our code will be organized as follows:

|-Project

|—Features

|——Feature Name

|————Classes

|————Storyboards

|——Feature Name

|————Classes

|————Storyboards

We are going to have a main folder called “Features” where we are going to add all the features of our app. Inside this folder, we are going put each feature in its own folder, dividing the classes, controllers and storyboards like this:

Landing Screen

The landing screen is pretty straightforward: we have a NavigationController containing a ViewController that allows us to push new controllers from it. In the landing screen’s ViewController, we’ll have an ImageView to show a logo, two standard buttons to select what we want to do, and at the bottom of the screen, we’ll have a progress view with a label to notify the users that our CoreML model is being updated.

Registration Screen

To be able to identify a person, the app needs to know what that person looks like. We need to create, therefore, a model to match each person to his or her face. To achieve this, we’ll be taking pictures of the user’s face during the registration process. Our registration form is really simple; just like the landing screen, the registration ViewController is also embedded in a NavigationController. Here, we’ll ask the users to enter their name and position in Gorilla Logic; after that, they will tap the “record video” button and a new ViewController will be presented to start recording their face. Once users finish recording their face, the registration screen appears again, and they tap the “continue” button to start uploading the data to our server. The storyboard should look like this:

Saving User Data Screen

The “saving user data” screen is just a processing screen where the app processes the video, extracts the images from it, and then crops them and packages all the user data to be uploaded to the server. Meanwhile, the user waits for this process to end; when it finishes, the user returns to the registration screen to finish the registration process

Uploading Screen

When the app has all the required data to register a new user, the registration screen provides a “save” button, and the app uploads all the content to the server. On the back end, the server starts to create the CoreML model with all the data of the previously registered users and the new one. The app then presents a screen to alert the user that the server is updating the model.

Step 3: Record the User’s Face

To track the user’s face, we’ll instantiate a new ViewController to handle all the “face tracking” logic, “FaceTrackerViewController,” where we will record and extract the photos from the camera feed. Because the time it takes to train the model depends directly on the number of images we provide it with, we won’t use a lot of images, but it’s important to keep in mind that the more images you use, the more accurate the model will be. For the purposes of this project, we are going to capture 25 photos per person to train our CoreML model; this amount will give us a good balance of time and accuracy.

FaceTrackerViewController

Our controller will have a couple of UI elements to handle all the logic to track the user’s face, and it should look like this:

Basically, there is a UIButton to stop the tracking process (we are going to use it later in the project) and a UIProgressView to show how many photos we have captured.

Code

We have to import the following frameworks:

import UIKit import ARKit import Vision

The face recognition process is a mixture of things: a VNSequenceRequest, AVCaptureSession, AVCaptureVideoPreviewLayer.

let faceDetection = VNDetectFaceRectanglesRequest() let faceDetectionRequest = VNSequenceRequestHandler() var faceClassificationRequest: VNCoreMLRequest! var lastObservation : VNFaceObservation? var session = AVCaptureSession() lazy var previewLayer: AVCaptureVideoPreviewLayer? = { var previewLayer = AVCaptureVideoPreviewLayer(session: session) previewLayer.videoGravity = .resizeAspectFill return previewLayer }() var sampleCounter = 0 let requiredSamples = 25 var faceImages = [UIImage]()

Let’s configure the AVSession.

func configureAVSession() { //use the front camera if need to track the user face, and the back camera when tracking other's people face guard let captureDevice = AVCaptureDevice.default(AVCaptureDevice.DeviceType.builtInWideAngleCamera, for: AVMediaType.video, position: isIdentifiyngPeople ? .back : .front) else { preconditionFailure("A Camera is needed to start the AV session") } guard let deviceInput = try? AVCaptureDeviceInput(device: captureDevice) else { preconditionFailure("unable to get input from AVDevice") } let output = AVCaptureVideoDataOutput() output.videoSettings = [String(kCVPixelBufferPixelFormatTypeKey) : Int(kCVPixelFormatType_420YpCbCr8BiPlanarFullRange)] output.alwaysDiscardsLateVideoFrames = true session.beginConfiguration() if session.canAddInput(deviceInput) { session.addInput(deviceInput) } if session.canAddOutput(output) { session.addOutput(output) } session.commitConfiguration() let queue = DispatchQueue(label: "output.queue") output.setSampleBufferDelegate(self, queue: queue) session.startRunning() } func configurePreviewLayer() { if let layer = self.previewLayer { layer.frame = view.bounds view.layer.insertSublayer(layer, at: 0) } }

After we set up the AVSession, we have to implement the AVCaptureVideoDataOutput SampleBufferDelegate and implement our captureOutput method.

func captureOutput(_ output: AVCaptureOutput, didOutput sampleBuffer: CMSampleBuffer, from connection: AVCaptureConnection) { guard let pixelBuffer = CMSampleBufferGetImageBuffer(sampleBuffer), let attachments = CMCopyDictionaryOfAttachments(kCFAllocatorDefault, sampleBuffer, kCMAttachmentMode_ShouldPropagate) as? [String: Any] else { return } let ciImage = CIImage(cvImageBuffer: pixelBuffer, options: attachments ) let ciImageWithOrientation = ciImage.oriented(forExifOrientation: Int32(UIImageOrientation.leftMirrored.rawValue)) detectFace(on: ciImageWithOrientation) }

Once we capture the data from the camera, we create a CIImage to pass into our face detection method where we search for the face and then crop the image according to the shape of the user’s head. At this point, we just capture the image and save it to the variable ‘faceImages’ before uploading it to our server to be processed.

func detectFace(on image: CIImage) { try? faceDetectionRequest.perform([faceDetection], on: image) guard let faceObservation = (faceDetection.results as? [VNFaceObservation])?.first else { //no face detected, remove all rectangles on the screen DispatchQueue.main.async { self.removeFaceRectangles() } return } let croppedImage = image.crop(toFace: faceObservation) if isIdentifiyngPeople { let handler = VNImageRequestHandler.init(ciImage: croppedImage, orientation: .up) self.lastObservation = faceObservation try? handler.perform([self.faceClassificationRequest]) } else { guard let faceImage = croppedImage.uiImage else { return } sampleCounter += 1 //grab a sample every 6 samples. if sampleCounter % 6 == 0 { faceList.append(faceImage) if faceList.count == requiredSamples { DispatchQueue.main.async { self.delegate?.facesIndetified(faces: self.faceList) } } } DispatchQueue.main.async { self.progressView.progress = Float(self.faceList.count) / Float(self.requiredSamples) } } }

After we capture all the images we need, we dismiss the current FaceTrackerViewController and go back to the registration controller where the user is able to submit all the data to the server to be registered in the app.

Step 4: Saving the User’s Data

Saving the user’s data is a very simple process. We already have all the images that we are going to be uploading stored locally, meaning we just have to send them to our server and wait for the server to send us a response.

The UI is not that complicated; we have a NavigationController and a ViewController with a ProgressView, a couple of labels, and a nice icon to show the user once we finish uploading the data.

To handle all the networking logic, we added Alamofire to our project and created a small utility class to hold all our networking methods and logic. The first thing we have to do is register the new user and create a corresponding ID. Later, we will assign the user’s images to that ID in order to track the user once we enable the “identification” logic.

The request to send the images to our server is a standard multipart request, meaning we send the images in a “single request” and wait for the process to execute a completion block to handle the response from the server. After this, the user is notified that the registration process has completed successfully.

class func upload(images: [UIImage], of user: User, onComplete: @escaping(_ result: Result<Void>) -> Void){ Alamofire.upload(multipartFormData: { multipartData in multipartData.append(String(user.id).data(using: .utf8)!, withName: "id") for image in images { if let image = UIImageJPEGRepresentation(image, 1.0) { let fileName = UUID.init().uuidString + ".jpeg" multipartData.append(image, withName: "photos", fileName: fileName, mimeType: "image/jpeg") } } }, to: Constants.Endpoints.addUserPhotos ) { encodingResult in switch encodingResult { case .success(request: let upload, _, _): upload.responseJSON(completionHandler: { request in switch request.result { case .success(let value): switch request.response?.statusCode { case StatusCode.success.rawValue: let json = JSON(value) if json["status"].stringValue == ResultStatus.success.rawValue { onComplete(.success) } else { let error = json["error"].stringValue onComplete(.failure(ServiceError.unknown(error))) } case StatusCode.badRequest.rawValue: onComplete(.failure(ServiceError.badRequest)) case StatusCode.notFound.rawValue: onComplete(.failure(ServiceError.notFound)) default: onComplete(.failure(ServiceError.unknown(request.response.debugDescription))) } case .failure(let error): onComplete(.failure(ServiceError.unknown(error.localizedDescription))) } }) case .failure(let error): onComplete(.failure(ServiceError.unknown(error.localizedDescription))) } } }

Step 5: OTA CoreML Model Update

One of the coolest things in the project is being able to update the CoreML model in our app without having to recompile it. That might sound a little like rocket science, but in simple words, we just ask our server for a new CoreML model, and when the server has a new model, we download it like a regular file. Once we have it locally, we compile the new CoreML model and replace it with the previous model we had in our app. Pretty simple, right?

private func download(_ model: Model){ let destination : DownloadRequest.DownloadFileDestination = { url, options in guard let documentsURL = FileManager.default.urls(for: .documentDirectory, in: .userDomainMask).first , let modelUrl = URL(string: model.url) else { preconditionFailure("unable to use documents directory") } let fileURL = documentsURL.appendingPathComponent(modelUrl.lastPathComponent) return (fileURL, [.removePreviousFile, .createIntermediateDirectories]) } sessionManager.download(model.url, to : destination) .downloadProgress(closure: { progress in self.delegate?.modelManager(downloadProgress: progress) }) .response { response in if response.error != nil { self.delegate?.modelManager(error: ModelManagerError.failedToDownloadModel) }else { if let temporaryURL = response.destinationURL { self.compileNewModel(model: model, url: temporaryURL) } else { self.delegate?.modelManager(error: ModelManagerError.failedToDownloadModel) } } } }

For this task, we created a ModelManager to hold all the logic for downloading, compiling and replacing the current CoreML model. As we mentioned before, we treated the model just like a regular file and download it using Alamofire. Then, we notified the class when the new model had finished downloading and proceeded to compile and replace part of the process.

After we download the new model, we call our compileModel method.

private func compileNewModel(model: Model, url: URL){ do { let modelUrl = try MLModel.compileModel(at: url) let storage = ModelStorage() try storage.saveNewModel(model: model, compiledModelURL: modelUrl) self.delegate?.modelManager(newModelReady: model, file: storage.getModelFile(model: model)) } catch { self.delegate?.modelManager(error: ModelManagerError.failedToSaveModel) } }

Once all the downloading, compiling and replacing logic is completed, we use our ModelManagerDelegate to notify the landing controller that a new model is available, is downloading, or is ready to use, and then we update our UI to notify the user of the status of the CoreML model.

extension LandingViewController: ModelManagerDelegate { func modelManager(newModelReady: Model, file: Faces) { updateProgressView.isHidden = true updatingModelLabel.isHidden = true } func modelManager(downloadProgress: Progress) { updateProgressView.isHidden = false updateProgressView.progress = Float(downloadProgress.fractionCompleted) updatingModelLabel.isHidden = false } func modelManager(error: ModelManagerError) { updateProgressView.isHidden = true updatingModelLabel.isHidden = true } }

Step 6: Identify People

To identify people, we are going reuse the previously mentioned ViewController, the ”FaceTrackerViewController,” but we are going to modify its logic to handle the two states: the “recording when registering a new user” mode, and the new “identifying people” process. We decided to do this because there is already a lot of code we can reuse from the first solution.

Specifically, we are going to add a property to know when we are recording or identifying people. Nothing too complicated, just a regular bool var, also we will draw a rectangle around very the person’s face in the screen, we need to keep those rectangles near because we should remove them if there’s not a person face in the screen:

var isIdentifiyngPeople = false var faceRectangles = [UIView]()

Now it’s time to load our CoreML Model

func configureFaceClassificationRequest() { guard let model = try? VNCoreMLModel(for: modelFile.model) else { preconditionFailure("Unable to instantiate CoreML model") } faceClassificationRequest = VNCoreMLRequest(model: model, completionHandler: handleClassificationResults) }

What we are going to create is the handleClassificationResults handler, because not only do we have to track the user’s face using our CoreML model, but also identify who he or she is. To achieve this, we create a method like this:

func handleClassificationResults(request: VNRequest, _: Error?) { guard let results = request.results as? [VNClassificationObservation] else { return } //get the best result guard let result = results.sorted(by: { $0.confidence > $1.confidence }).first else { return } //find a user in our model using the identifier of the result //we are omitting users with ids 1&2 because they are negative results let users = modelDescription.users.filter { $0.id == Int(result.identifier) && $0.id != 1 && $0.id != 2 } guard let user = users.first else { return } if let lastObservation = self.lastObservation { DispatchQueue.main.async { self.removeFaceRectangles() let box = lastObservation.boundingBox.relativeTo(self.view, flipHorizontal: self.isIdentifiyngPeople) let faceViewRectangle = UIView(frame: CGRect(x: 0, y: 0, width: box.width, height: box.height)) faceViewRectangle.layer.borderColor = UIColor.white.cgColor faceViewRectangle.layer.borderWidth = 2.0 let label = UILabel(frame: CGRect(x: 0, y: faceViewRectangle.frame.maxY , width: box.width, height: 80)) label.numberOfLines = 2 label.text = "\(user.name)

\(user.position)" label.textColor = UIColor.white let holderFrame = CGRect(x: box.minX, y: box.minY, width: box.width, height: box.height + label.frame.height) let holderView = UIView(frame: holderFrame) holderView.addSubview(faceViewRectangle) holderView.addSubview(label) self.view.addSubview(holderView) self.faceRectangles.append(holderView) } } }

Here, we can see several things happening: first, the method is called when the face tracker identifies a face; then, we prioritize those results based on confidence. After we have the results ordered, we select the first one and search for the user ID corresponding to that result to find out who the user is. Finally, we scan the live camera feed again to draw a box around the user’s face and add a label containing the user’s name and position.

Finally, we need to a function to remove the previously drawn rectangles, is just as simple as this:

func removeFaceRectangles(){ for faceRectangle in self.faceRectangles { faceRectangle.removeFromSuperview() } self.faceRectangles.removeAll() }

Stay tuned for the second part of the series on how to build the back end and put it in charge of receiving the images from the app, implementing our machine learning model, and sending the resulting data back to the app.

*Please note that at this time we are not able to share the complete code.