Introduction

What we want

The goal of this project is to use our eyes to trigger actions on our computer. This is a very general problem so we need to specify what we want to achieve.

We could, for instance, detect when the eyes look towards a specific corner and then work from that. That’s however quite limited and not really flexible, plus it would require us to hard-code corner combinations. Instead, we are going to use Recurrent Neural Networks to learn to identify complete eye movements.

The data

We won’t work with external datasets, we’ll make our own. This has the advantage of using the same source and processing for both training the model and making the predictions.

Without doubt, the most effective way to extract information from our eyes would be to use a dedicated closeup camera. With such hardware, we could directly track the center of the pupils and do all kinds of fancy stuff.

I didn’t want to use an external camera so I decided to use to the good old 720p webcam from my laptop.

The pipeline

Before we jump directly to the technical aspects, let’s review the steps of the process. Here’s the pipeline I came up with:

Take a picture with the webcam and find the eyes

Pre-process the images and extract important features (did you say neural network?)

Keep a running history of last few frames’ extracted features

Predict the current eye movement based on history

The pipeline we will use to process the images

We’ll go through these steps to see how we can make this work.

Let’s get to it !

Getting a picture of the eyes

Finding the eyes

Straight from the webcam, we start by downsampling the image and converting it to grayscale (color channels are extremely redundant). This will make next steps much faster, and will help our model run in real time.

For the detection part, we’ll use HAAR Cascades as they are extremely fast. With some tuning, we can get some pretty good results, but trying to detect the eyes directly leads to many false positives. To get rid of these, we don’t try to find the eyes in the image, but rather the face in the image, and then the eyes in the face.

Once we have the bounding boxes of both eyes we can extract the images from the initial full-sized webcam snapshot, so that we don’t lose any information.

Pre-processing the data

Once we have found both eyes, we need to process them for our dataset. To do that we can simply reshape both to a fixed size — square, 24px — and use histogram normalization to get rid of the shadows.

Steps to extract eyes

We could then use the normalized pictures directly as input, but we have the opportunity here to do a little more work that helps a lot. Instead of using the eye images, we compute the difference between the eyes in the current and previous frame. This is a very efficient way to encode motion, which is all we need in the end.

**Note that for all diagrams except the GIF below, I will use eye pictures to represent eye differences, because differences look awful on screen.**

Comparison between normalized frames and frame differences

Now that we have processed both eyes, we have the choice to treat them separately as two representatives of the same class, or use them together as if they were a single image*. I chose the latter because, even though the eyes are supposed to follow the exact same motion, having both inputs will make the model more robust.

*What we are going to do is a bit more clever than simply stitching the images together, though.

Paring both eyes together

Creating the dataset

Recording

I have recorded 50 samples for two separate motions (one that looks like a “gamma”, the other looks like a “Z”). I have tried to vary the position, scale, and speed of the samples to help the model generalize. I have also added 50 examples of “idle”, which contain roughly generic pattern-free eye motions as well as still frames.

Motion examples — ‘gamma’, ‘mount’, ‘Z’, ‘idle’

Unfortunately, 150 samples is tiny for such a task, so we need to augment the dataset with new samples.

Data augmentation

The first thing we can do is fix an arbitrary sequence length — 100 frames. From there, we can slow down shorter samples and speed up longer ones. That’s possible because speed does not define the motion.

Also, because sequences shorter that 100 frames should be detected at any time in the 100 frames window, we can add padded examples.

Sliding window padding for samples shorter than 100 frames

With these techniques, we can augment our dataset to be around 1000–2000 examples.

The final dataset

Let’s take a step back for a minute, and try to understand our data. We have recorded some samples with corresponding labels. Each of these samples is a series of two 24px wide square images.

Note that we have one dataset for each eye.

Tensor description of the dataset

The model

Now that we have a dataset, we need to build the right model to learn and generalize from this data. We could write its specifications as follows:

Our model should be able to extract information from both images at each time step, combine these features to predict the motion executed with the eyes.

Such a complicated system requires using a powerful artificial intelligence model — neural networks. Let’s see how we can build one that meets our need. Neural networks layers are like LEGOs, we simply have to choose the right bricks and put them at the right place.

Visual features — Convolutional Neural Network

To extract information from the images, we are going to need convolutional layers. These are particularly good at processing images to squeeze out visual features. (Psst! We already saw that in part. 1)

We need to treat each eye separately, and then merge the features through a fully connected layer. The resulting convolutional neural network (CNN) will learn to extract relevant knowledge from pairs of eyes.

Convolutional Neural Network — Two parallel convolutional layers extract visual features, which are then merged

Temporal features — Recurrent Neural Network

Now that we have a simple representation of our images, we need something to process them sequentially. For that, we are going to use a recurrent layer — namely Long Short Term Memory cells. The LSTM updates its state using both the extracted features at the current time step and its own previous state.

Finally, when we have processed the whole sequence of images, the state of the LSTM is then fed to a softmax classifier to predict the probability of each motion.

Full model

Behold our final neural network, which takes as input a sequence of image pairs, and outputs the probability of each motion. What is crucial here is that this we build the model in one single piece, and therefore it can be trained end-to-end via backpropagation.