Deep Learning : Training a convolutional neural network to recognize The Simpsons characters.

As a big Simpsons fan, I have watched a lot (and still watching) of The Simpson episodes -multiple times each- over the years. I wanted to build a neural network which can recognize characters. I don’t know right now what will be the applications of the neural net (perhaps computing the characters presence in each episode).

This project is not specially difficult but can be time consuming, because I have to manually label many pictures of each character. I didn’t find any The Simpsons characters database on the Internet so I am building it by myself (I am still labeling pictures when I have time). I think it could be useful for other ones. The dataset is already available on Kaggle with exploratory code (in the Kernels section).

After learning and using TensorFlow for different projects, I want to use Keras because of its simplicity (compared to TensorFlow for example) and its capacity (TensorFlow backend) for experimentation. Keras is a Deep Learning library written in Python by Francois Chollet. My approach to solve this problem will be based on convolutional neural networks (CNNs) : multi-layered feed-forward neural networking able to learn many features.

You can find the code on the github repo .

Building the image dataset

The dataset currently features 18 classes/characters (the data on Kaggle contains 20 classes, but currently I used only 18 characters for training). Please check the image below for the characters used. The pictures are under various size, scenes, could be cropped from other characters and are mainly extracted from episodes (season 4 to 24).

The Simpsons characters

The training set includes about 1000 images per character (still labeling data to get to this number). The character is not necessarily centered in each image and could sometimes be with other characters (but it should be the most important part in the picture).

Dataset distribution for 20 characters (on 6/19/2017)

With label_data.py, you can label data from .avi movies : you can get a cropped sub picture (left or right part) or the full picture and then label it by entering a part of the character name (burns for Charles Montgomery Burns).

To add more data, I also use the Keras model. I capture videos and get 3 pictures for each frame I analyzed (left part, right part, full) and then I ask my algorithm to classify each pictures. Afterward, I check each picture it has classified. It’s still manual but it’s faster and it’s an incremental process that’s more and more fast, particularly for “small” characters.

Preprocessing

The first step for preprocessing pictures is resizing them. We need to have all pictures with the same size for training. I will convert data as float32 to save some memory and normalize them (divide by 255.). Then, instead of characters name, I use numbers and thanks to Keras, I can quickly convert those categories to vectors :

import keras

import cv2 pic_size = 64

num_classes = 10

img = cv2.resize(img, (pic_size, pic_size)).astype('float32') / 255. ...

y = keras.utils.to_categorical(y, num_classes)

I am splitting my dataset into a training and a testing set : for this, I use sklearn train_test_split function.

Deep Learning Model(s)

Now, let’s begin the “funny” part : defining our model. Right now, we’ll use a feed forward 4 convolutional layers with ReLU activation followed by a fully connected hidden layer (see below for a deeper model). This model is similar to the CIFAR example from Keras documentation. I also use dropout layers to regularize and avoid overfitting. The output layer uses softmax activation to output the probability for each class. I also tried to replace ReLU by ELU (like ReLU but with a mean closer to zero) but it didn’t work.

Categorical Cross Entropy loss is -as often- used. And for the optimizer, I use RMS Prop which is a stochastic gradient descend where we “divide the learning rate for a weight by a running average of the magnitudes of recent gradients for that weight” .

Training the model

For the training, the model is iterating over batches of training set (batch size : 32) for 200 epochs.

As I don’t have a huge data set, I am using data augmentation (which is really simple to use with Keras library). It means doing a number of random variations over the pictures so the model never see the same picture twice. This helps prevent overfitting and helps the model generalize better.

datagen = ImageDataGenerator(

featurewise_center=False, # set input mean to 0 over the dataset

samplewise_center=False, # set each sample mean to 0

featurewise_std_normalization=False, # divide inputs by std

samplewise_std_normalization=False, # divide each input by its std

rotation_range=0, # randomly rotate images in the range

width_shift_range=0.1, # randomly shift images horizontally

height_shift_range=0.1, # randomly shift images vertically

horizontal_flip=True, # randomly flip images

vertical_flip=False) # randomly flip images

This take a while running on CPU (on my computer) so I run it on GPU with AWS EC2, Tesla K80: 8 seconds per epoch. In total, it took 20 minutes (which is really quick for deep learning).

As we can on the plot, after 200 epochs, it seems to have reach the asymptote, without an obvious overfitting. Moreover, the accuracy seems good too.

Loss and Accuracy (Validation and Training) during training

Classification evaluation

Of course, right now, it’s complicated to have a true model accuracy because of the low number of pictures but as the number of pictures will grow, it will be more pertinent. Thanks to sklearn it’s really easy to print a classification report :

4 convolutional layers net

As you can see, the accuracy (f1-sport) is really good : above 90 % for every character except Lisa. The precision for Lisa is 82%. Maybe Lisa is mixed up with other characters.

Indeed, Lisa is often mixed up with Bart. Probably because many pictures of Lisa contain Bart too.

Adding a threshold to improve the accuracy

In order to improve the precision (so, of course decrease the recall, but I would try to not decrease it too much), I thought that I can maybe add a threshold.

Before to talk about a threshold to improve accuracy. I just want to had a famous graph about recall and precision.

Graph to define recall and precision

I compute some statistics about good and wrong predictions : maximum probability predictions, the probability difference between the best two candidates and the std.

For good predictions : Max : 0.83, Difference Two First : 0.773, STD : 0.21

For wrong predictions : Max : 0.27, Difference Two First : 0.092, STD : 0.07

If the probability of the predicted character (1.) is too low, the standard deviation of the prediction (2.) is too high or the probability difference between the two most likely characters (3.) is too low maybe we can say that we don’t want to predict a character at all.

So I plot those 3 values for the test set to find a line (or a hyperplane) to separate good and wrong predictions. I did it for both characters.