Facial Emotion Recognition: Single-Rule 1–0 DeepLearning

Single-Rule 1–0 DeepLearning

In my attempt to build Artificial Emotional Intelligence I first turned my head to Deep Learning.

The main reason being its recent success at cracking Computer Vision tasks, as I am currently working on the part that detects emotions from faces (I already have the part that understands the content of our written words).

So I spent the last month and a half taking online courses, reading online books, and learning a Deep Learning tool.

To be honest, that was the easy part.

The real challenge was amassing a descent dataset of faces classified by emotion.

Why is that a challenge? Because Deep Learning algorithms are data-hungry!

In order to get a descent dataset, I collected face pics from Google Images, and cropped the faces with OpenCV (as described here). I was able to collect several thousand pics but my annotation approach failed due to many pics either not containing a face, or not having the right emotion.

At the end, I ended up with just around 600 pics, a useless number for hungry Neural Networks. Looking around the Internet, I was able to crawl a pre-labeled, yet still small set from flickr user The Face We Make (TFWM). In a desperate (and probably the smartest move)

I asked on the MachineLearning sub-reddit and someone saved my life. I was pointed to a collection from a Kaggle competition with over 35K pics labeled with 6 emotions plus a neutral class.

Armed with a larger dataset and my beginner’s skills on Deep Learning, I modified the two TensorFlow MNIST sample networks to train them with the 35k pics and test them with the TFWM set.

I was thrilled and full of anticipation while the code was running, after all the simplest code (simple MNIST), a modest softmax regression, achieved 91% accuracy, while the deeper code (deep MNIST), a two-layers Convolutional Network achieved around 99.2% accuracy on the MNIST dataset.

What a huge disappointment when the highest accuracy I got was 14.7% and 27.9% respectively. I then tried changing a few parameters like learning rates, number of iterations, switch from Softmax to ReLU in the last layer, but things did not change much.

Somehow I felt cheated, so before spending more time exploring Deep Learning in order to build more complex networks, I decided to try a small experiment.

Having being working with emotions for a while, I have become familiar with research from psychologist Paul Ekman. In fact, most emotions detected by systems are somehow based on his proposed 6 basic emotions. The work that inspired my experiment is the Facial Action Coding System (FACS), a common standard to systematically categorize the physical expression of emotions.

In essence, FACS can describe any emotional expression by deconstructing it into the specific Action Units (the fundamental actions of individual muscles or groups of muscles) that produced the expression. FACS has proven useful to psychologists and to animators, and I believe most emotion detection systems adapt it. FACS is complex, and to develop a system that uses it from scratch might take a long time.

In my simple experiment, I identified 2 Action Units relatively easy to detect in still images: Lip Corner Puller, which draws the angle of the mouth superiorly and posteriorly (a smile), and Lip Corner Depressor which is associated with frowning (and a sad face).

Fig. 1: A smile or joy, represented by the elevation of the corners of the mouth.

Fig. 2: Sadness, represented by a depression of the corners of the mouth.

To perform my experiment, I considered only two emotions, namely joy and sadness. To compare with the adapted MNIST networks, I created a single-rule algorithm as follows. Using dlib, a powerful toolkit containing machine learning algorithms, I detected the faces in each image with the included face detector. For any detected face, I used the included shape detector to identify 68 facial landmarks. From all 68 landmarks, I identified 12 corresponding to the outer lips.

Fig. 3: A face with 68 detected landmarks. White dots represent the outer lips.

Once having the outer lips, I identified the topmost and the bottommost landmarks, as well as the landmarks for the corners of the mouth. You can think of such points as constructing a bounding box around the mouth.

Fig. 4: The topmost and bottommost landmarks in white, the corners of the lips in black.

Then the simple rule is as follows. I compute a mouth height (mh) as the difference between the y coordinates of the topmost and bottommost landmarks. I set a threshold (th) as half that height (th = mh/2). The threshold can be thought of as the y coordinate of a horizontal line dividing the bounding box into an upper and a lower region.

Fig. 5: A bounding box defined by the 4 special landmarks and a threshold line dividing it into two regions.

I then compute the two “lip corner heights” as the difference between the y coordinates of the topmost landmark and both mouth corner landmarks. I take the maximum (max) of the “lip corner heights” and compare it to th. If max is smaller than the threshold, it means that the corner of the lips are in the top region of the bounding box, which represents a smile (by the Lip Corner Puller). If not, then we are in the presence of a Lip Corner Depressor action, which represents a sad face.

Fig. 6: Lip corners just on the threshold line, which fails as a smile.

With this simple algorithm in place I then performed the experiment. For the NMIST networks, I extracted the related faces from the Kaggle set and I ended up with 8989 joy and 6077 sad training faces. For testing I had 224 and 212 faces respectively from the TFWM set. After training and testing, the simple NMIST network obtained 51.4% and the deep NMIST 55% accuracy, a significant improvement over the 7-classes version, but still a very bad performance. I then used the test set and ran the single-rule algorithm. Surprisingly, this single rule obtained an accuracy of 76%, a 21% improvement over the deep NMIST network.

There has been a long debate on whether Deep Learning algorithms are better than custom algorithms built based on some domain knowledge. Recently Deep Learning has outperformed many such algorithms in Computer Vision and Speech Recognition. I have no doubt about the power of Deep Learning, however, much has been said about how difficult it is to build a good custom algorithm and how easy it is to build a good neural network.

The single-rule algorithm I just described is very simple and far from being a realistic system, however, this simple algorithm built in an afternoon beat something that took me over a month to understand. This is not by any means a definitive answer to the debate, but makes me wonder if custom algorithms are ready to be replaced by their Deep Learning counterparts.

Custom algorithms are not only good, but as expressed in a previous post, they also give you the satisfaction of fully understanding what’s going on inside, a priceless feeling.