There are very few examples of sharp turns and most the data was between -2500 to 2500. Also as this distribution is normal, mean squared error (MSE) loss (which has been used for regression) should work well.

Additionally, data was augmented by flipping the images horizontally and hence flipping the steering axis as well.

if(np.random.choice(2, 1)[0] == 1):

pil_img = pil_img.transpose(Image.FLIP_LEFT_RIGHT)

label = -1 * label # Changing the direction of wheel axis.

This almost doubled the training data.

Network Architecture

I picked the famous PilotNet architecture by Nvidia with some modifications. Here is the original architecture

Source Nvidia blog

I additionally added Batch Normalization(BN) (on the channel axis for convolutional layers) after each layer for a faster training. I also tried using Dropout for regularisation but that didn’t seem to affect the accuracy hence removed it. There has been a lot of debate on when to apply BN which is either before the non-linearity or after the non-linearity. Since I used Relu activation at each layer, applying BN after the Relu activation was increasing the mean and reducing the variance of the hidden layer activation because BN was not considering the negative activations. Hence, I applied BN after the non-linearity, which worked out fine for me.

I scaled down the input image to 200 x 66 (same as PilotNet) to keep parameters of the fully connected layers low (parameters of the convolutional layers are not affected). This was important to avoid overfitting. Models with very high params have a high entropy and they tend to overfit (i.e. memorize the training set). With low entropy, gradient descent algorithm forces the model to learn important patterns in the data instead of memorizing the training set. While having very low params is also bad as the model may not learn anything.

The increase in parameters could have been avoided by using max-pooling but it is generally used for spatial invariance which is not desired in this case.

The input to the model was normalized by dividing the input pixels by 255. There are better ways to normalize input images using the mean and variance of the whole training set but this also works fine. I used mean squared error loss without any regularization. I came up with all these design choices after testing these parameters on the validation set. The rest of the network parameters remain unchanged.

This the final architecture which I used with input_shape = (66, 200, 3) .

_________________________________________________________________

Layer (type) Output Shape Param #

=================================================================

conv2d_1 (Conv2D) (None, 31, 98, 24) 1824

_________________________________________________________________

batch_normalization_1 (Batch (None, 31, 98, 24) 96

_________________________________________________________________

conv2d_2 (Conv2D) (None, 14, 47, 36) 21636

_________________________________________________________________

batch_normalization_2 (Batch (None, 14, 47, 36) 144

_________________________________________________________________

conv2d_3 (Conv2D) (None, 5, 22, 48) 43248

_________________________________________________________________

batch_normalization_3 (Batch (None, 5, 22, 48) 192

_________________________________________________________________

conv2d_4 (Conv2D) (None, 3, 20, 64) 27712

_________________________________________________________________

batch_normalization_4 (Batch (None, 3, 20, 64) 256

_________________________________________________________________

conv2d_5 (Conv2D) (None, 1, 18, 64) 36928

_________________________________________________________________

batch_normalization_5 (Batch (None, 1, 18, 64) 256

_________________________________________________________________

flatten_1 (Flatten) (None, 1152) 0

_________________________________________________________________

dense_1 (Dense) (None, 100) 115300

_________________________________________________________________

batch_normalization_6 (Batch (None, 100) 400

_________________________________________________________________

dense_2 (Dense) (None, 50) 5050

_________________________________________________________________

batch_normalization_7 (Batch (None, 50) 200

_________________________________________________________________

dense_3 (Dense) (None, 10) 510

_________________________________________________________________

batch_normalization_8 (Batch (None, 10) 40

_________________________________________________________________

dense_4 (Dense) (None, 1) 11

=================================================================

Total params: 253,803

Trainable params: 253,011

Non-trainable params: 792

I used Keras with Tensorflow backend for all the experiment as well as final training.

Training

The dataset had a very high correlation between adjacent images and hence it was important to shuffle for training. Along with that, after every epoch, the data set was re-shuffled so that every batch is unique across multiple epochs.

total data: 162495, training set: 140800, validation set: 21695

batch_size: 128, train_steps: 1100, val_steps: 170

Since the keras doesn’t have flow_from_directory for regression task, I had to write my own data_generator with data augmentation.

INPUT_NORMALIZATION = 255.0

OUTPUT_NORMALIZATION = 655.35 #picked this number to compare results with data source model.

img_shape = (66, 200, 3)

batch_size = 128

def generator(df, batch_size):

img_list = df['img']

wheel_axis = df['wheel-axis']

# create an empty batch

batch_img = np.zeros((batch_size,) + img_shape)

batch_label = np.zeros((batch_size, 1))

index = 0 while True:

for i in range(batch_size):

label = wheel_axis.iloc[index]

img_name = img_list.iloc[index]

pil_img = image.load_img(path_to_data+img_name)

# Data augmentation

if(np.random.choice(2, 1)[0] == 1):

pil_img = pil_img.transpose(Image.FLIP_LEFT_RIGHT)

label = -1 * label

batch_img[i] = image.img_to_array(pil_img)

batch_label[i] = label

index += 1

if index == len(img_list):

#End of an epoch hence reshuffle

df = df.sample(frac=1).reset_index(drop=True)

img_list = df['img']

wheel_axis = df['wheel-axis']

index = 0

yield batch_img / INPUT_NORMALIZATION, (batch_label / OUTPUT_NORMALIZATION)

Deciding what minibatch size to use was also tricky. If we use a very small batch size, the computed gradients might be very inaccurate hence the training will be noisy. If you pick a very large batch size, it may not fit in memory. I chose to use 128 as minibatch size.

I used stochastic gradient descent optimizer with momentum and learning rate decay to train the network.

sgd = SGD(lr=1e-3, decay=1e-6, momentum=0.9, nesterov=True)

The model was trained for 41 epochs on CPU for ~ 30 hours to achieve validation mean squared error of 0.1166 and validation mean absolute error of 0.2429 (at 36th epoch) which corresponds to a mean error of 160 (=0.2429 x OUTPUT_NORMALIZATION) on the scale of 20k.