Introduction

Image classification is known as the ABC’s of deep learning, to the point where the classification of MNIST digits has become the “Hello World” exercise in the field. It’s the most widely known and publicized aspect of deep learning, and is the best catered, with hundreds of datasets on Kaggle, and hundreds of tutorials covering a wide range of applications. The yearly ImageNet Large Scale Visual Recognition Challenge is famous for being the arena where now-industry standard VGGNet, ResNet, Inception made their mark as state-of the art.

While the classification of dissimilar classes of objects, such as apples from animals, is facile, intra-class classification is significantly more tricky. The difficulties arise when it comes to classifying highly similar subjects with small visually distinctive differences, leading to severe under or overfitting. A recent example was an adversarial study on Google’s Inception network, where a classifier was fooled into misclassifying a tabby cat as guacamole with only a few small pixel changes.

I recently took part in the AI for SEA hackathon organized by Grab, where the aim was to build an efficient yet computationally fast classifier for the Stanford-196 car dataset, under a theme of autonomous driving. While distinguishing between different classes of vehicles may be facile it becomes significantly more difficult to distinguish between intra-class subjects of different models or brands. In this tutorial, we will attempt to replicate our work to improve the baseline performance of a fast MobileNetV2 classifier on distinguishing between the make and model of different automobiles of the Stanford Cars-196 dataset.

Implementation

We’ve covered image classification before in our Malaria Classification example and we’ll be reusing much of the code; the reader is encouraged to revisit that for familiarization with the code. A pre-sorted version of the Stanford Cars-196 dataset available on Kaggle was utilized for our model. All of our code was split between two Notebooks, run on Google’s Colaboratory environment. Let’s define our model architecture.

Classifier Architecture. Notebook 1 covers background removal preprocessing, while Notebook 2 covers other steps together with the training of the network.

Our model consists of the base MobileNetV2 model, with the top layers relaced with a two densely connected layers (of size 1024 and 196, respectively), separated by a 50% dropout layer to prevent overfitting. The network was pre-loaded with ImageNet weights, and training was done using an ADAM optimizer at a learning rate of 0.0002.

Initially, we trained the MobileNetV2 architecture, trained with ImageNet weights, on the dataset over 50 epochs as a baseline, and observed a poor validation accuracy of below 40%. It was postulated that this poor accuracy may be due to a variety of factors including:

Unoptimized weights : High subject variation behind the ImageNet vehicle class, as well as the small relative size of the class versus other classes overall (for example, ImageNet possesses 374K images of vehicles versus 2799K images of animals).

: High subject variation behind the ImageNet vehicle class, as well as the small relative size of the class versus other classes overall (for example, ImageNet possesses 374K images of vehicles versus 2799K images of animals). Constrained dataset : The size of the Stanford-196 dataset was small, amounting to roughly a hundred images per dataset.

: The size of the Stanford-196 dataset was small, amounting to roughly a hundred images per dataset. Subject similarity : The overall profile of different vehicles of a shared vehicle class was similar, making distinguishing difficult. This is something that is difficult even for humans, and hence not surprising at all.

: The overall profile of different vehicles of a shared vehicle class was similar, making distinguishing difficult. This is something that is difficult even for humans, and hence not surprising at all. Background noise: A large amount of background is present in all images. Some of these backgrounds contained auxiliary vehicles, which may lead to misleading features being learned by our classifier.

With our problems defined, let’s address each of these points.

Background removal

You’ll notice that each featured excessive amounts of background noise, including partial images of trees, auxiliary vehicles, or persons. It would hence be in the interest of the classifier to remove these distractions, and focus on only the image features that we want the network to learn. While boundary boxes were provided by the dataset, their generation can often be time intensive or even unfeasible for different commercial applications. An automated solution would be ideal.

It was decided that a YOLO-based cropping detector would be more scalable under real-world data collection circumstances. YOLO is itself a InceptionV3-based trained neural network classifier, and is designed to detect the primary subject of an image, generate a boundary box for the subject, and then crop out the background. Such a detector was implemented in Notebook 1, and the outputs collected and stored on Google Drive as inputs for Notebook 2. The exact method called during the main thread is crop_path():

detector.crop_path(

‘../content/car_data/test/’,

‘../content/cropped_car_test/’,

dirs,

params={

‘detect_largest_box’: True,

‘smallest_detected_area’: 0.5,

}

)

To better understand how the solution works, let’s look at the key components of the Detector class in detail:

class Detector:

def __init__(self, model):

execution_path = os.getcwd()

detector = ObjectDetection() if model == ‘yolo’:

detector.setModelTypeAsYOLOv3()

elif model == ‘yolo-tiny’:

detector.setModelTypeAsTinyYOLOv3()

else:

raise ValueError(‘Model ‘ + model + ‘not fould. you should download the model and put it into “modules” directory.’) detector.setModelPath(os.path.join(execution_path , ‘../content/’ + model + ‘.h5’))

detector.loadModel()

custom_objects = detector.CustomObjects(car=True) self.detector = detector

self.custom_objects = custom_objects

self.execution_path = execution_path def get_box(self, detections, size, params = {‘detect_largest_box’: False, ‘smallest_detected_area’: None}):

detect_largest_box = params[‘detect_largest_box’]

smallest_detected_area = params[‘smallest_detected_area’]



area = 0

points = None if detect_largest_box == True:

for detection in detections:

current_area = (detection[‘box_points’][2] — detection[‘box_points’][0]) * (detection[‘box_points’][3] — detection[‘box_points’][1])

if current_area > area:

area = current_area

points = detection[‘box_points’]

else:

points = detections[0][‘box_points’] img_area = size[0] * size[1] if smallest_detected_area != None else None



if (img_area != None) and (area / img_area < smallest_detected_area):

return None

else:

return points def crop_image(self, source_image, input_type, **kwargs):

image = None

input_image = None

if input_type == ‘file’:

image = load_img(source_image)

input_image = source_image

elif input_type == ‘array’:

image = source_image

input_image = img_to_array(source_image) detections = self.detector.detectCustomObjectsFromImage(

custom_objects=self.custom_objects,

input_type=input_type,

input_image=input_image,

minimum_percentage_probability=10,

extract_detected_objects=False,

) box = self.get_box(detections, image.size, **kwargs)

cropped_img = image.crop(box) return cropped_img

During the initialization of the detector, we need to specify the type of model used (which is downloaded into our session by a separate command before initialization), and set our detector to only recognize objects belonging to the car class, else we’d be detecting irrelevant objects in the process.

Follow initialization, the crop_path() method eventually calls the crop_image() method, where we load each image of a directory and call upon the detection method of the YOLO detector to generate a set of car-class detected boundary boxes via detectCustomObjectsFromImage(), which are then passed for the get_box() method.

Finally, the get_box() method iterates over every single detected boundary box, and selects the largest possible entry. This is based on the assumption that the primary car subject in an image should logically possess the largest possible boundary box. Let’s take a look at a quick example:

Audi TTS Coupe pre and post- background cropping

Segmentation & feature extraction

To understand the position and nature of intra-class differences within our dataset, let’s take a look at an example image of a Chrysler:

Locations of visually distinct features of a Chrysler Sedan.

From the figure, we can conclude that the intra-class differences revolve around the stylistic differences at front and end of a vehicle. It would be hence fruitful to focus on these areas to assist the classifier in extracting feature specific to each model and hence improve performance.

To achieve this, we isolate this area from rest of the features within the raw image by using a image segmentation function, defined in Notebook 2. Briefly, this worked by dividing the image into two sets of halves, determined by width and height, respectively. A quartering function was also evaluated but found to perform worse, which was attributed to the each quarter possessing incomplete feature information.

for dir in dirnames:

files = os.listdir(dir)

#Each subfolder has a differring number of files in it, so we iterate with an int tracker

i=0

for f in files:



filename = os.path.join(dir,f)

print(“Generating quarter crops for file “+str(filename))



im = cv2.imread(filename)

h,w = im.shape[:2]

midh = h/2

midw = w/2

crop1 = im[0:int(h),0:int(w/2)]

crop2 = im[0:int(h),int(w/2):int(w)]

crop3 = im[0:int(h/2),0:int(w)]

crop4 = im[int(h/2):int(h),0:int(w)]





filecrop1 = str(i)+’crop1.jpg’

filecrop2=str(i)+’crop2.jpg’

filecrop3 = str(i)+’crop3.jpg’

filecrop4 = str(i)+’crop4.jpg’





#Create the new filenames

filesave1=os.path.join(dir,filecrop1)

filesave2=os.path.join(dir,filecrop2)

filesave3=os.path.join(dir,filecrop3)

filesave4=os.path.join(dir,filecrop4)





#Save images

save_img(filesave1, crop1)

save_img(filesave2, crop2)

save_img(filesave3, crop3)

save_img(filesave4, crop4)





i=i+1

Using the CV2 python package makes image manipulation facile. By defining the midpoint in height and width, we can easily crop each image appropriately, and save them together with the original image, increasing the dataset size by 400%.

2010 Chevrolet Malibu Sedan, pre- and post- segmentation cropping.

Note, not only are we helping our network focus on the features that we desire to be learned, but we are also augmenting the size of our dataset in the process, further improving classifier robustness by introducing it to “new” examples of the same image class.

Just-In-Time pre-processing

Before each batch of images were fed into our network, they underwent a set of preprocessing data-augmentation methods derived from the Keras and Tensorflow libraries designed to improve the performance and robustness of our model. The Keras library allows us to utilize traditional physical augmentation methods, such as:

Rescaling

Shear-based transformations

Zoom-based transformations

Translations

Flipping

While the Keras-based methods focuson translations-based transformations, the Tensorflow library provides more advanced capabilities to modify the colorspace:

Hue randomization

Saturation randomization

Brightness randomization

Contrast randomization

This was done to ensure that the classifier was exposed to a variety of colors for each vehicle class. Similarly, different lighting conditions should be considered in our dataset to simulate images taken under high-contrast or low-visibility conditions. Note, a tf.enable_eager_execution() call should be called at the beginning of the Notebook to avoid memory leaks, or you’ll quickly find your session RAM reaching 12 GB within 2 epochs.

With our pre-processing functions defined, training of the base model weights was allowed beyond 75 layers of the base network at a learning rate of 0.0002 with an ADAM optimizer, with fine-tuning permitted at 30 layers onwards using an RMSProp optimizer at a learning rate of 2E-5.The model was trained for 50 epochs, with fine-tuning permitted for 10 epochs.

The figure below shows the improvements in accuracy and loss over the first 50 epochs.

Not Bad.

As can be observed, we can now achieve a validation accuracy of close to 90%, an improvement of close to 40% improvement, without any significant changes to our network architecture. Such a classifier could then easily be ported to a mobile device, via Tensorflow Lite, for use for on-road vehicle classification.

While the performance of our classifier is good, it’s still too early to be jumping for joy. Some of the issues include:

Our cropped images may still have some excess noise in the background, especially other vehicles. To better improve performance, an adaptive cropping mechanism would be more preferable.

Our model may be exhibiting some slight overfitting to our training data, meaning that any test data should be of the similar domain for best performance. An easy solution to this problem would simply be to acquire more data.

The feature extraction process could be focused by specifying the front and back sections of the vehicles either manually or via an auxiliary classifier.

We hope you’ve enjoyed this article. Please check out GradientCrescent for more content — soon to come, we discuss countering strategies to deepfakes.