Multi-face detection using Viola Jones

Introduction

From my time researching this topic I have come to the realisation that a lot of people don’t actually understand it, or only understand it partly. Also, many tutorials do a bad job in explaining in “lay-mans” terms what exactly it is doing, or leave out certain steps that would otherwise clear up some confusion. So i’m going to explain from start to finish in the most simplest way possible.

There are many approaches to implement facial detection, and they can be separated into the following categories:

Knowledge Based

Rule based (Ex: X must have eyes, x must have a nose)

Too many rules and variables with this method

Feature Based

Locate and extract structural features in the face

Find a differential between facial and non facial regions in an image

Appearance Based

Learn the characteristics of a face

Example: CNN’s

Accuracy depends on training data (which can be scarce)

Template

Using predefined templates for edge detection

Quick and easy

A trade off for speed over accuracy

The approach we are going to be looking at is a mix between feature based and template based. One of the easiest and fastest ways of implementing facial detection is by using Viola Jones Algorithm.

Haar-like Features

Before learning about Viola Jones, we need to take a quick look at Haar-like features (which ill just be calling haar features from now on), and their inspiration: Haar Wavelets — Haar Wavelets were proposed by mathematician Alfred Haar in 1909 and are used in applications such as signal and image compression in electrical and computer engineering. To put simply: Haar Features are essentially collections of pixels in rectangular shapes. Haar features are conceptually similar to kernels in convolutional neural nets. The difference is that these features are created programmatically, they aren’t learned from the raw image data like in the case of deep learning.

But don’t worry, you don’t need to sit there and write thousands of fancy functions to generate these features as they are widely available online in the form of XML files. There are thousands of possible features you can use, because all they really are rectangles with regions for calculating delta values.

The rationale of haar features is that if you apply a feature to an area in the image, and subtract the unshaded region of pixels values from shaded region of pixel values it will give you certain delta values.

Example: region X with 100 pixels has a a summed value of 200 and region Y of the same size has a summed value of 150, then the delta value value of 50. Simples :)

These feature values are then used for training an AdaBoost variant (but more on this later)

Haar feature types

Example of some possible feature shapes

Although there are thousands of possible feature shapes that can be created, the two most common are Edge Features and Line Features.

Edge Features

So let’s say for example you have want to detect part of a face, in this case an eyebrow, naturally the shade of the pixels of on an eyebrow in an image will be darker and abruptly gets lighter (skin). Edge features are great for finding this.

Edge Features

Line Features

Now lets say you want to detect a mouth: naturally the shape of the lips region on your face go from light to dark to light again. For this, Line features prove to be the best.

Inverse Line Features

The cool thing about these features is that they can be used inversely. Meaning that they apply with a dark-light-dark and light-dark-light format. So to summarise…

for each feature type:

1.Move across the image

2.Calculate delta of the sum(unshaded) and sum(shaded)

3.Use these values to train an AdaBoost variant model.

Using all these features together on an image will give you a probability value of something being a face. But is all a simplified view of things, now to get into the nitty gritty