How long does it usually take you to pick out a new pair of glasses at the store? 10 minutes? 30? When left unsupervised, I’ve admittedly taken over an hour. Head tilt. Half smile. Side shot. Next pair. It’s a big deal, as it is scientifically established that the type of glasses you wear impacts perception of your intelligence, success, and attractiveness.

It’s 2016; there must certainly be some sort of technology that has solved this problem. Of course there is! DITTO technologies developed a virtual mirror that allows customers to try on hundreds of products from the comfort of their homes. Center your face, turn left, turn right, and you’re done. You can sit back and evaluate which glasses pairs your face with your style and character.

There is, however, one small caveat for those of us who are optically challenged — you have to remove your glasses to use the virtual mirror. In case you’ve never had the experience, it’s even more difficult to choose glasses if you need your original glasses to see. You can only hope that the small, rectangular blobs you see in the mirror have perfectly framed your blurred visage.

Wouldn’t it be great if people could leave their glasses on, and the software automatically removed them? Imagine walking into a retail store and having a virtual mirror remove your glasses and replace them with different products in real-time. Implementing this vision for DITTO is the challenge that I faced as an Insight Fellow.

Tech Background

The task of removing eyeglasses from faces is not a new one, by far. A hefty amount of scientific literature documents a variety of image processing algorithms to remove eyeglasses, often for the goal of improving facial recognition technologies. Using some thoughtful math with features such as contrast, edges, and congruency, these techniques typically detect and subtract the image pixels containing the glasses and then synthesize the obfuscated facial region through smoothing or inference. Despite the ingenuity, these algorithms can fall short at the recognition of the glasses and/or the reconstruction of the face. They can also notably struggle with generalizing across different skin tones and correcting for shadows, magnification, and glare caused by the frames and lenses.

So why on earth would anyone think that, in 4 weeks, they could improve upon something that groups of specialists have devoted years to developing? The answer is that there has been a recent insurgence of algorithms, open-source code and tools, and GPU computing that have opened the floodgates for application of deep learning to an endless array (or should I say…tensor) of high-dimensional data problems. The power of deep learning is that you don’t have to design and optimize an algorithm based on the features that you think are important for your task; you only have to provide examples and the neural network identifies and weights the features that are relevant. Rather than engineering an algorithm to identify glasses, one to remove them, and then one to reconstruct faces, you can simply train one neural network to do all-of-the-above as one computation. Not only does this require much less time and domain knowledge, but the resulting network can generalize across a much broader range of inputs.

iSee — the mirror within the mirror

The most critical components for the success of a deep learning project is having the right data, having a lot of it, and having hardware that can handle it. While there are a number of cloud computing options, I was fortunate to have access to PC with a Tesla K40 graphics card that was generously donated to Insight from NVIDIA. On the data front, DITTO was able to supply to an incredibly unique dataset: thousands of faces with without glasses (I should note at this point that the faces shown in this article belong to DITTO employees, not their customers). Armed with access to DITTO’s API, and the IDs of 20,000 customers, I used their technology to project glasses onto the customers’ faces and thus create a very large, labeled dataset. The only missing component at this point was designing a neural network to remove the glasses.

While I knew that a convolutional neural network would be the best choice for recognizing an abstract object such as glasses, and is robust to spatial variance, it wasn’t immediately clear to me how to go about removing the glasses and more importantly reconstructing the faces. My original inspiration came from an online article about a man who trained a network on “Blade Runner” and then had it watch “A Scanner Darkly.” The article included a link to the dissertation of Terrance Broad, called “Autoencoding Video Frames,” which was a total gold mine of relevant information. Another valuable source was an article by Insight Alumnus TJ Torres at StitchFix, which also centered around the concept of using a convolutional neural network as an autoencoder.

An autoencoder is a network that is trained to reconstruct an input, in our case an image, and in doing so it compresses the relevant features of inputs into a lower dimensional space (labeled “z” below). The cost function used for backpropagation is the mean squared error (MSE) between the input image and the reconstructed image, which is the output. A cartoon of this approach would be something like this:

Given enough training, an autoencoder would eventually learn to recreate the original image. For Terrence and TJ, this approach was useful for creating generative models from the latent or “z” space. For my application, however, I didn’t need a generative model because I already had the outputs that I wanted. I simply needed the network to take an image of a face with glasses as an input and construct that same face without the glasses as an output. To do this, I started where anyone who is rapidly iterating on a new project should start, with open-source code. I found code for a convolutional neural network autoencoder in TensorFlow, written by Parag K. Mital, and redefined the cost function to be the difference between the desired output (the face without glasses) and the reconstructed image, but still provided the face with glasses as the input. The network was thus provided with tens of thousands of examples of the input (face with glasses) and desired output (face, no glasses) and was trained to perform that computation. As a technical aside, doing so means that the network is no longer really an autoencoder anymore — it’s just a convolutional neural network with a symmetric topology and a linearized layer in the middle.

This method, coined “iSee,” worked quite well. That is, of course, after considerable tuning of the hyperparameters. The images below show the model’s performance on faces of three DITTO employees, which were not included in the training set. The network had never seen their faces before but was still able to reconstruct the area of their faces that had been obfuscated by the glasses. It was also able to remove the glasses, which varied in size and position across the images. In contrast to more traditional image processing techniques found in the literature, this method can work for different facial structures and skin tones as long as similar faces are provided in the training data.

Images reconstructed via iSee after hyperparameter tuning

Preprocessing and hyperparameter tuning — the RAM ceiling

You might notice that the images above have been flattened to grayscale and are low in resolution. This was necessary because the amount of RAM (64GB) on the PC that I used placed limitations on number of weights/parameters that can be held in memory at once, and thus placed constraints on the total size of the network. It’s important to note that the total amount of data was not a limiting factor, as TensorFlow permits batch processing, which makes it possible to work with a large dataset on a home computer. In order to implement the iSee method, I needed a network that was both deep (many layers) and wide (many filters per layer). For the example shown above, I used 6 convolutional layers, each with 90 3x3 filters per layer, and the original images were cropped and downsampled down to 66x66 pixels. The network was trained over 250 epochs in batches of 20 images. As a point of reference, the images below were generated using the same hyperparameters except that only 70 filters were used per layer and the network was only trained over 50 epochs in batches of 25 images.

Images reconstructed via iSee prior to hyperparameter optimization

Tuning hyperparameters can be incredibly time-consuming; not only are there many of them, but they are often related to each other. For example, increasing the size of the convolutional filters from 2x2 to 3x3 improved the network’s ability to remove the glasses, but more filters per layer were then needed to produce a high-resolution reconstruction. While there are no universal default hyperparameters, there are a few options available that can help with this process. I used the quick and dirty approach: downsample images (with Python Imaging Library), reduce batch size, and manually tweak parameters. The next level of sophistication would be to preprocess the images more thoughtfully. For me, this would involve identifying the pixels that contain glasses across all images and then only training the network on that smaller region, rather than forcing the network to reconstruct the entire face and background. Products such as SigOpt, which uses an ensemble of optimization algorithms, can be helpful in automating the hyperparameter space search. Finally, as summarized in this blog post by Alex Gude, an Insight alumnus at Lab41 (an In-Q-Tel lab), pruning, clustering, and encoding can all be used to reduce the size and ultimate RAM costs of deep learning algorithms.

Conclusion

“It ‘actually works’!” exclaimed Ilya Sutskever during a presentation on his infoGAN work at a scaled machine learning conference in August of 2016. As a novice to the field, it’s both exciting and reassuring to hear that even the experts can be surprised by what can be accomplished with deep learning. This is because deep learning is an empirical science, which is prime and open for commercial and creative applications. The innovation for the iSee method is the idea that synthetic data can be used to train a neural network to identify and remove an abstract object from an image. More work needs to be done to show that the algorithm can successfully remove a real glasses, shadows, and lens magnification. If robust, the approach can be extended to a wider variety of applications. The doors have opened, the tools are here, and I am delighted to see what our generation of data scientists and engineers will create.