Yes officer, I saw the speed limit sign. I just didn’t see you.

This is part 1 of a series about building a deep learning model to recognize traffic signs. It’s intended to be a learning experience, for myself and for anyone else who likes to follow along. There are a lot of resources that cover the theory and math of neural networks, so I’ll focus on the practical aspects instead. I’ll describe my own experience building this model and share the source code and relevant materials. This is suitable for those who know Python and the basics of machine learning already, but want hands on experience and to practice building a real application.

In this part, I’ll talk about image classification and I’ll keep the model as simple as possible. In later parts, I’ll cover convolutional networks, data augmentation, and object detection.

Setup

The source code is available in this Jupyter notebook. I’m using Python 3.5 and TensorFlow 0.12. If you prefer to run the code in Docker, you can use my Docker image that contains many popular deep learning tools. Run it with this command:

docker run -it -p 8888:8888 -p 6006:6006 -v ~/traffic:/traffic waleedka/modern-deep-learning

Note that my project directory is in ~/traffic and I’m mapping it to the /traffic directory in the Docker container. Modify this if you’re using a different directory.

Finding Training Data

My first challenge was finding a good training dataset. Traffic sign recognition is a well studied problem, so I figured I’ll find something online.

I started by googling “traffic sign dataset” and found several options. I picked the Belgian Traffic Sign Dataset because it was big enough to train on, and yet small enough to be easy to work with.

You can download the dataset from http://btsd.ethz.ch/shareddata/. There are a lot of datasets on that page, but you only need the two files listed under BelgiumTS for Classification (cropped images):

BelgiumTSC_Training (171.3MBytes)

BelgiumTSC_Testing (76.5MBytes)

After expanding the files, this is my directory structure. Try to match it so you can run the code without having to change the paths:

/traffic/datasets/BelgiumTS/Training/ /traffic/datasets/BelgiumTS/Testing/

Each of the two directories contain 62 subdirectories, named sequentially from 00000 to 00061. The directory names represent the labels, and the images inside each directory are samples of each label.

Exploring the Dataset

Or, if you prefer to sound more formal: do Exploratory Data Analysis. It’s tempting to skip this part, but I’ve found that the code I write to examine the data ends up being used a lot throughout the project. I usually do this in Jupyter notebooks and share them with the team. Knowing your data well from the start saves you a lot of time later.

The images in this dataset are in an old .ppm format. So old, in fact, that most tools don’t support it. Which meant that I couldn’t casually browse the folders to take a look at the images. Luckily, the Scikit Image library recognizes this format. This code will load the data and return two lists: images and labels.

def load_data(data_dir):

# Get all subdirectories of data_dir. Each represents a label.

directories = [d for d in os.listdir(data_dir)

if os.path.isdir(os.path.join(data_dir, d))] # Loop through the label directories and collect the data in

# two lists, labels and images.

labels = []

images = []

for d in directories:

label_dir = os.path.join(data_dir, d)

file_names = [os.path.join(label_dir, f)

for f in os.listdir(label_dir)

if f.endswith(".ppm")]

for f in file_names:

images.append(skimage.data.imread(f))

labels.append(int(d))

return images, labels

images, labels = load_data(train_data_dir)

This is a small dataset so I’m loading everything into RAM to keep it simple. For larger datasets, you’d want to load the data in batches.

After loading the images into Numpy arrays, I display a sample image of each label. See code in the notebook. This is our dataset:

The training set. consists of 62 classes. The numbers in parentheses are the count of images of each class.

Looks like a good training set. The image quality is great, and there are a variety of angles and lighting conditions. More importantly, the traffic signs occupy most of the area of each image, which allows me to focus on object classification and not have to worry about finding the location of the traffic sign in the image (object detection). I’ll get to object detection in a future post.

The first thing I noticed from the samples above is that images are square-ish, but have different aspect ratios. My neural network will take a fixed-size input, so I have some preprocessing to do. I’ll get to that soon, but first let’s pick one label and see more of its images. Here is an example of label 32:

Several sample images of label 32

It looks like the dataset considers all speed limit signs to be of the same class, regardless of the numbers on them. That’s fine, as long as we know about it beforehand and know what to expect. That’s why understanding your dataset is so important and can save you a lot of pain and confusion later.

I’ll leave exploring the other labels to you. Labels 26 and 27 are interesting to check. They also have numbers in red circles, so the model will have to get really good to differentiate between them.

Handling Images of Different Sizes

Resizing images to a similar size and aspect ratio

Most image classification networks expect images of a fixed size, and our first model will do as well. So we need to resize all the images to the same size.

But since the images have different aspect ratios, then some of them will be stretched vertically or horizontally. Is that a problem? I think it’s not in this case, because the differences in aspect ratios are not that large. My own criteria is that if a person can recognize the images when they’re stretched then the model should be able to do so as well.

What are the sizes of the images anyway? Let’s print a few examples:

for image in images[:5]:

print("shape: {0}, min: {1}, max: {2}".format(

image.shape, image.min(), image.max())) Output:

shape: (141, 142, 3), min: 0, max: 255

shape: (120, 123, 3), min: 0, max: 255

shape: (105, 107, 3), min: 0, max: 255

shape: (94, 105, 3), min: 7, max: 255

shape: (128, 139, 3), min: 0, max: 255

The sizes seem to hover around 128x128. I could use that size to preserve as much information as possible, but in early development I prefer to use a smaller size because it leads to faster training, which allows me to iterate faster. I experimented with 16x16 and 20x20, but they were too small. I ended up picking 32x32 which is easy to recognize (see below) and reduces the size of the model and training data by a factor of 16 compared to 128x128.

I’m also in the habit of printing the min() and max() values often. It’s a simple way to verify the range of the data and catch bugs early. This tells me that the image colors are the standard range of 0–255.

Images resized to 32x32

Minimum Viable Model

We’re getting to the interesting part! Continuing the theme of keeping it simple, I started with the simplest possible model: A one layer network that consists of one neuron per label.

This network has 62 neurons and each neuron takes the RGB values of all pixels as input. Effectively, each neuron receives 32*32*3=3072 inputs. This is a fully-connected layer because every neuron connects to every input value. You’re probably familiar with its equation:

y = xW + b

I start with a simple model because it’s easy to explain, easy to debug, and fast to train. Once this works end to end, expanding on it is much easier than building something complex from the start.

Building the TensorFlow Graph

Visualization of a part of a TensorFlow graph

TensorFlow encapsulates the architecture of a neural network in an execution graph. The graph consists of operations (Ops for short) such as Add, Multiply, Reshape, …etc. These ops perform actions on data in tensors (multidimensional arrays).

I’ll go through the code to build the graph step by step below, but here is the full code if you prefer to scan it first:

First, I create the Graph object. TensorFlow has a default global graph, but I don’t recommend using it. Global variables are bad in general because they make it too easy to introduce bugs. I prefer to create the graph explicitly.

graph = tf.Graph()

Then I define Placeholders for the images and labels. The placeholders are TensorFlow’s way of receiving input from the main program. Notice that I create the placeholders (and all other ops) inside the block of with graph.as_default(). This is so they become part of my graph object rather than the global graph.

with graph.as_default():

images_ph = tf.placeholder(tf.float32, [None, 32, 32, 3])

labels_ph = tf.placeholder(tf.int32, [None])

The shape of the images_ph placeholder is [None, 32, 32, 3]. It stands for [batch size, height, width, channels] (often shortened as NHWC) . The None for batch size means that the batch size is flexible, which means that we can feed different batch sizes to the model without having to change the code. Pay attention to the order of your inputs because some models and frameworks might use a different arrangement, such as NCHW.

Next, I define the fully connected layer. Rather than implementing the raw equation, y = xW + b, I use a handy function that does that in one line and also applies the activation function. It expects input as a one-dimensional vector, though. So I flatten the images first.

The ReLU function

I’m using the ReLU activation function here:

f(x) = max(0, x)

It simply converts all negative values to zeros. It’s been shown to work well in classification tasks and trains faster than sigmoid or tanh. For more background, check here and here.

# Flatten input from: [None, height, width, channels]

# To: [None, height * width * channels] == [None, 3072]

images_flat = tf.contrib.layers.flatten(images_ph) # Fully connected layer.

# Generates logits of size [None, 62]

logits = tf.contrib.layers.fully_connected(images_flat, 62,

tf.nn.relu)

Bar chart visualization of a logits vector

The output of the fully connected layer is a logits vector of length 62 (technically, it’s [None, 62] because we’re dealing with a batch of logits vectors).

A row in the logits tensor might look like this: [0.3, 0, 0, 1.2, 2.1, .01, 0.4, ….., 0, 0]. The higher the value, the more likely that the image represents that label. Logits are not probabilities, though — They can have any value, and they don’t add up to 1. The actual absolute values of the logits are not important, just their values relative to each other. It’s easy to convert logits to probabilities using the softmax function if needed (it’s not needed here).

In this application, we just need the index of the largest value, which corresponds to the id of the label. The argmax op does that.

# Convert logits to label indexes.

# Shape [None], which is a 1D vector of length == batch_size.

predicted_labels = tf.argmax(logits, 1)

The argmax output will be integers in the range 0 to 61.

Loss Function and Gradient Descent

Credit: Wikipedia

Choosing the right loss function is an area of research in and of itself, which I won’t delve into it here other than to say that cross-entropy is the most common function for classification tasks. If you’re not familiar with it, there is a really good explanation here and here.

Cross-entropy is a measure of difference between two vectors of probabilities. So we need to convert labels and the logits to probability vectors. The function sparse_softmax_cross_entropy_with_logits() simplifies that. It takes the generated logits and the groundtruth labels and does three things: converts the label indexes of shape [None] to logits of shape [None, 62] (one-hot vectors), then it runs softmax to convert both prediction logits and label logits to probabilities, and finally calculates the cross-entropy between the two. This generates a loss vector of shape [None] (1D of length = batch size), which we pass through reduce_mean() to get one single number that represents the loss value.

loss = tf.reduce_mean(

tf.nn.sparse_softmax_cross_entropy_with_logits(

logits, labels_ph))

Choosing the optimization algorithm is another decision to make. I usually use the ADAM optimizer because it’s been shown to converge faster than simple gradient descent. This post does a great job comparing different gradient descent optimizers.

train = tf.train.AdamOptimizer(learning_rate=0.001).minimize(loss)

The last node in the graph is the initialization op, which simply sets the values of all variables to zeros (or to random values or whatever the variables are set to initialize to).

init = tf.initialize_all_variables()

Notice that the code above doesn’t execute any of the ops yet. It’s just building the graph and describing its inputs. The variables we defined above, such as init, loss, predicted_labels don’t contain numerical values. They are references to ops that we’ll execute next.

Training Loop

This is where we iteratively train the model to minimize the loss function. Before we start training, though, we need to create a Session object.

I mentioned the Graph object earlier and how it holds all the Ops of the model. The Session, on the other hand, holds the values of all the variables. If a graph holds the equation y=xW+b then the session holds the actual values of these variables.

session = tf.Session(graph=graph)

Usually the first thing to run after starting a session is the initialization op, init, to initialize the variables.

session.run(init)

Then we start the training loop and run the train op repeatedly. While not necessary, it’s useful to run the loss op as well to print its values and monitor the progress of the training.

for i in range(201):

_, loss_value = session.run(

[train, loss],

feed_dict={images_ph: images_a, labels_ph: labels_a}) if i % 10 == 0:

print("Loss: ", loss_value)

In case you’re wondering, I set the loop to 201 so that the i % 10 condition is satisfied in the last round and prints the last loss value. The output should look something like this:

Loss: 4.2588

Loss: 2.88972

Loss: 2.42234

Loss: 2.20074

Loss: 2.06985

Loss: 1.98126

Loss: 1.91674

Loss: 1.86652

Loss: 1.82595

...

Using the Model

Now we have a trained model in memory in the Session object. To use it, we call session.run() just like in the training code. The predicted_labels op returns the output of the argmax() function, so that’s what we need to run. Here I classify 10 random images and print both, the predictions and the groundtruth labels for comparison.

# Pick 10 random images

sample_indexes = random.sample(range(len(images32)), 10)

sample_images = [images32[i] for i in sample_indexes]

sample_labels = [labels[i] for i in sample_indexes] # Run the "predicted_labels" op.

predicted = session.run(predicted_labels,

{images_ph: sample_images})

print(sample_labels)

print(predicted)

Output:

[15, 22, 61, 44, 32, 22, 57, 38, 56, 38]

[14 22 61 44 32 22 56 38 56 38]

In the notebook, I include a function to visualize the results as well. It generates something like this:

The visualization shows that the model is working , but doesn’t quantify how accurate it is. And you might’ve noticed that it’s classifying the training images, so we don’t know yet if the model generalizes to images that it hasn’t seen before. Next, we calculate a better evaluation metric.

Evaluation

To properly measure how the model generalizes to data it hasn’t seen, I do the evaluation on test data that I didn’t use in training. The BelgiumTS dataset makes this easy by providing two separate sets, one for training and one for testing.

In the notebook I load the test set, resize the images to 32x32, and then calculate the accuracy. This is the relevant part of the code that calculates the accuracy.

# Run predictions against the full test set.

predicted = session.run(predicted_labels,

feed_dict={images_ph: test_images32})

# Calculate how many matches we got.

match_count = sum([int(y == y_)

for y, y_ in zip(test_labels, predicted)])

accuracy = match_count / len(test_labels)

print("Accuracy: {:.3f}".format(accuracy))

The accuracy I get in each run ranges between 0.40 and 0.70 depending on whether the model lands on a local minimum or a global minimum. This is expected when running a simple model like this one. In a future post I’ll talk about ways to improve the consistency of the results.

Closing the Session

Congratulations! We have a working simple neural network. Given how simple this neural network is, training takes just a minute on my laptop so I didn’t bother saving the trained model. In the next part, I’ll add code to save and load trained models and expand to use multiple layers, convolutional networks, and data augmentation. Stay tuned!