Cat filters and rock star sunglasses are just the tip of the iceberg in today’s ocean of face-based augmented reality (AR) mobile applications. Such effects are however compute-hungry, and today’s users want their novelty images to pop up on their smartphone screens without delay. Luckily for us, Google fully appreciates the need for speed. Google researchers have introduced a new face detection framework called BlazeFace, adapted from the Single Shot Multibox Detector (SSD) framework and optimized for inference on mobile GPUs. The lightweight face detector runs at an impressive speed of 200–1000+ FPS on flagship smartphones.

Researchers first proposed a compact feature-extractor convolutional neural network (CNN) inspired by MobileNet V1/V2. Most modern CNN architectures such as the MobileNet versions use 3×3 convolution kernels along the model graph, and the pointwise parts dominate their depthwise separable convolution computations. Researchers discovered it was relative cheap to increase the depthwise kernel size. They employed 5×5 kernels in the model architecture bottlenecks, enabling the two pointwise convolutions to add another layer. This accelerated the receptive field size progression and formed the essential higher abstraction level layers of BlazeFace.

Researchers made another key contribution, developing a new GPU- friendly anchor scheme modified from SSD. Anchors are predefined static bounding boxes, which enable the network to predict and make adjustments accordingly for determining prediction granularity.

Although there are numerous possible object detection tasks that could leverage the BlazeFace framework, researchers’ current focus is on the very useful task of efficiently detecting faces via smartphone cameras. They added six additional facial keypoint coordinates to estimate face rotation for the video processing pipeline, and built separate models for front and rear cameras.

BlazeFace was trained on a dataset of 66,000 images, and performance was evaluated on a geographically diverse dataset consisting of 2,000 images. In front camera face detection tasks BlazeFace showed 98.61% average precision with 0.6 ms inference time.

The introduction of BlazeFace brings with it a wide range of potential applications. Researchers say the model can be deployed into virtually any face-related computer vision application, including 2D/3D facial keypoints, contour, or surface geometry estimation, facial features or expression classification, and face region segmentation.

It’s all about speed these days, and BlazeFace can be expected to trailblaze a new path for low-latency AR self-expression applications and AR developer APIs on mobile phones.

The paper BlazeFace: Sub-millisecond Neural Face Detection on Mobile GPUs is available on arXiv.