Teaching a machine how to recognize objects (cars, houses, and cats) is a difficult task, teaching it to recognize emotions is another story. If you have been following my posts, you then know that I want to teach machines to recognize human emotions. One important way in which machines can detect our feelings is by reading our faces. Teaching a machine to read faces has many challenges, and now that I started to tackle this problem I have encountered my first big one.

Deep Learning, a powerful tool used to teach machines seems promising for the task at hand, but in order to make use of it I needed to find the materials to teach the machine. Let me use an analogy to explain. For humans to learn to recognize objects, or in our specific case recognize facial expressions, a person has to be exposed to many faces. That’s not a big deal as we see faces everywhere from the second we are born. On the other hand, we don’t really have tools to take a computer into the wild and let it learn. So my big challenge was finding pictures or videos of people showing emotions in their face, to feed it to the machine and let it learn.

Companies like Google and Facebook, and some big labs in prestigious universities, have access to an enormous amount of data (just think of how many faces people tag on Facebook). However, mere mortals like me have to find not straightforward ways to collect humble amounts of data to teach our machines. So let me start by defining exactly what is the data I wanted to collect. To teach my machine to recognize emotions from facial expressions, I needed to collect pictures of faces expressing some emotion (angry faces or happy faces), and at the same time I need to explicitly tell the machine what the emotion is (this face shows anger). To be more exact, what I need to feed the machine is a collection of data pairs of the form [picture, emotion]. The question now is how to obtain such data?

First let me quickly tell you how you should not obtain this data. Many, including me, would first think about manually collecting thousands of pics from different sources (personal photos, Facebook, etc …), use a photo app to crop the faces (the learning is more efficient if the pic contains just the face), and manually define the emotion tag. This is time consuming, and not scalable. Let me explain what I did instead .First, many companies offer some automatic ways to pull data from their servers. The obvious choice for pics then might be Instagram (not Facebook as the data is not public). Now the problem with Instagram is that it’s not easy to specify that you want pics with faces. So in order to get exactly what I needed (faces with emotional expressions) my best choice was Google.

Google offers the Custom Search API, a tool to let programs pull data based on queries, much like humans would using the Google website. This was perfect for me, to understand why try the query scared look on Google (then go to images). So now I had an automatic way to get faces expressing emotions and I did not have to manually identify the emotion (it comes from the query). But wait, what about this picture:

“Big Man With Angry Eyes Points His Gun To Your Face”, obtained using the query “angry look”

The image was obtained with the query angry look, and it clearly has an angry face in it, but it also has an upper body, a gun, and many watermarks. This is not good as it will confuse my machine. How about this picture obtained with the query sad person:

An image obtained with the query “sad person” but without a person in it

It clearly has no sad person, it has no person at all as it’s just a table. So while in most cases (when using appropriate queries) you will obtain faces with the intended emotion (like the angry man), it will most likely come with extra noise, or sometimes even not have a face at all. Again, the best way to deal with this is not manually, but using Computer Vision tools to remove the noise automatically.

After submitting many queries and downloading a few thousands pics (due to rate limitations this might span a few days), I automatically processed all the pics using the popular Computer Vision library OpenCV (free if you wonder). OpenCV comes pre-loaded with a set of nice filters to detect faces and other features (eyes, mouth, …) in pics. The results are magical:

“Big Man With Angry Eyes Without the Gun and All Extra Noise” thanks to OpenCV.

OpenCV automatically detected a square region containing the face, and with additional commands, my program was able to automatically crop and reduce the face to a size and format appropriate to feed to my machine. Now what happened to the image without the face? OpenCV did not detect any face in it so it was automatically ignored. And Voila, that’s how I could efficiently (and free) start building a descent dataset of faces to later teach a computer how to detect our emotions.

To conclude, very often (depending on the query) you will find friends like this in the pictures:

Often you will find cartoon faces in the downloaded pictures.

and OpenCV will of course return you this beauty:

Whether this is bad or not for the trained machine I still don’t know. I will find out when I move to the training process. Worst case, I have to manually remove a few faces (and other possible wrongfully detected objects). Best case, I have a machine that can know if my kids are watching happy cartoons.