This blog post series will be updated as I have a second take on the fast ai lessons. These are my personal notesm a strive to understand things clearly and explain them well. Nothing new, only living up this blog.

Lesson 1 review

We used three lines of code to build an image classifier:

The organisation of data under PATH involves a train folder and valid folder. Under each of these folders are classification labels i.e. cats and dogs with corresponding images in them.

The training output: [ epoch number , training loss , validation loss , accuracy ]

0 0.157528 0.228553 0.927593

Choosing a good learning rate

The learning rate decide how quickly we zoom/hone in on a solution. This involves finding the minimum of a function that might have many parameters.

We start at a random point then find the gradient to determine which way is up or down. The distance we travel to the minima is proportional to the gradient; If it is steeper we are further away. We pick a gradient at a point and multiply it by a number. (learning rate/step).

If the learning rate is too small, it will take very long time to get to the bottom.If the learning rate is too big, it could get oscillate away from the bottom. If training a neural net and you find that the loss or accuracy is speeding to infinity the learning rate is too high.

We use a learning rate finder ( learn.lr_find ) to find an appropriate learning rate. With each mini-batch (i.e how many images we look at each time as we use parallel processing power of the GPU effectively, generally 64 or 128 images at a time). We gradually increase the learning rate multiplicatively, eventually the learning rate will be too big that the loss will start getting worse.

We plot the learning rate against loss to determine the lowest point. We then backtrack with one magnitude and pick that as our learning rate as that is the place the loss is decreasing 0.01 .

Math notations in python

Learning rate is the key number to set. Fast ai picks the rest of the hyper parameters for you. There are some more things we can tweak to get slightly better results.

This learning rate finder technique sits on top of Adam optimiser. Momentum and Adam are other ways of improving gradient descent.

The most important thing you can do to make the model better is to give it more data. Since these models have millions of parameters, if you train them for a while, they start to “overfit”.

Overfitting — Is whereby the model starts to see the specific details of the images in the training set rather than learning something general that can be transferred to the validation set.

We can either collect more data or use Data augmentation.

Data Augmentation

This refers to randomly changing the images in ways that shouldn’t impact their interpretation. Such as horizontal flipping, zooming, and rotating.

We can do this by passing aug_tfms (augmentation transforms) to tfms_from_model with a list of functions to apply that randomly change to the image however we wish. For photos that are largely taken from the side (e.g. most photos of dogs and cats, as opposed to photos taken from the top down, such as satellite imagery) we can use the pre-defined list of functions such as transforms_side_on . We can also specify random zooming of images up to specified scale by adding the max_zoom parameter.

You build a data class 6 times and each time you plot the same cat. Let’s look at some cat pictures of data augmentation.

Scenarios:

You want to use different types of data augmentation for different types of images (flip horizontally, vertically, zoom in, zoom out, vary contrast and brightness, and many more). for example, you want to recognise letters and digits you don’t want to flip horizontally as they will have different meaning. You don’t want to flip vertically for cats and dogs as the images are mostly upright. For icebergs in satellite images you may want to flip them upside down as it doesn’t matter which side the satellite was when taking the image.

transfrom side_on — is used for images taken on from the side ,lightly varies the photos zoom, rotates them slightly and varies contrast and brightness,

It is not exactly creating new data, but allows the convolutional neural net to learn how to recognize cats or dogs from different angles.

tsf — contains the data augmentation.

The data object includes augmentation. Initially the augmentation doesn’t do anything because of precompute=True

In the above picture each different layer has this activations that look for anything like the middle of flowers or eye balls of birds (circled in red) etc. The latter layers of this convolutional neural networks have activation (numbers) that for instance in the above picture specify the location and level of confidence (probability) of the eyeball of a bird.

We have pre-trained networks that has learnt to recognise features (certain kind of things e.g gradient, edges circles e.t.c). We take the second last layer that has all the necessary information to recognise these certain kind of things for example the level of “eyeballness”, “fluffy earness” etc. We save for every image this activations and call them pre-computed activations. We can then create a new classifier that takes advantage of this pre-computed activations. We can quickly train a simple linear model based on the pre-computed activations. That is what precompute=True means.

This is why when you train your model for the first time, it takes longer — it is pre-computing these activations.

Although we are trying to show a different version of the cat each time, we had already pre-computed the activations for a particular version of the cat i.e. we are not re-calculating the activations with the altered version. When precompute=True data augmentation does not work. We have to set it to learn.precompute=False for data augmentation to work.

Bad news is that accuracy is not improving, the good news is that the training loss ( trn_loss ), a way of measuring if the error of this model is getting better, is decreasing. The validation error ( val_loss ) is not decreasing, but we are not overfitting. Overfitting would mean that the training loss is much lower than the validation loss. In other words, when your model is doing much better job on the training set than it is on the validation set, that means your model is not generalizing.

Cycle_len parameter. What is it?

cycle_len=1 This enables Stochastic Gradient Descent with Restarts (SGDR). The basic idea is as you get closer and closer to the spot with the minimal loss, you may want to start decreasing the learning rate (taking smaller steps) in order to get to exactly the right spot. The idea of decreasing your learning rate as you train is called learning rate annealing. This is helpful because as we get closer to the optimal weights, we want to take smaller steps.

Stepwise annealing — you train a model with a certain learning rate for a while, and when it stops improving, manually drop down the learning rate. pick another learning rate and repeat the process very manually.

Cosine annealing — this turns out to be a better approach simply pick some kind of functional form — turns out the really good functional form is one half of the cosign curve which starts with a high learning rate at the beginning, then drop quickly when you get closer.

During training it is possible for gradient descent to get stuck at local minima rather than the global minimum.

At local minima the loss is worse and with a slightly different data set it won’t generalize. At global minimum the model will generalize better.

Note that annealing is not necessarily the same as restarts

We are not starting from scratch each time, but we are ‘jumping’ a bit to ensure we are in the best minima.

However, we may find ourselves in a part of the weight space that isn’t very resilient i.e small changes to the weights may result in big changes to the loss. We want to encourage our model to find parts of the weight space that are both accurate and stable. Therefore, from time to time we increase the learning rate (this is the ‘restarts’ in ‘SGDR’), which will force the model to jump to a different part of the weight space if the current area is “spikey”. Here’s a picture of how that might look if we reset the learning rates 3 times (in this paper they call it a “cyclic LR schedule”):

By increasing the learning rate suddenly, gradient descent may “hop” out of the local minima and find its way toward the global minimum. Doing this is called stochastic gradient descent with restarts (SGDR), an idea shown to be highly effective in this paper.

The number of epochs between resetting the learning rate is set by cycle_len , and the number of times this happens is referred to as the number of cycles, and is what we're actually passing as the 2nd parameter to fit() . So here's what our actual learning rates looks like:

The learning rate is restored to its original value after each epoch.

The learning rate is reset at the start of each epoch to the original value you entered as a parameter, then decreases again over the epoch as described above in cosine annealing.

Each time the learning rate drops to it’s minimum point (every 100 iterations in the figure above), we call it a cycle.

Can we get the same effect by using random starting point? Before SGDR was created, people used to create “ensembles” where they would relearn a whole new model ten times in the hope that one of them would end up being better. In SGDR, once we get close enough to the optimal and stable area, resetting will not actually “reset” but the weights keeps getting better. So SGDR will give you better results than just randomly try a few different starting points.

We pick the highest learning rate that is 1e-2 (0.01) for the SGD to use. We change the learning rate every single mini batch. The number of times we reset it is defined by the cycle_len=1 parameter. 1 means reset it after every epoch.

Our main goal is to generalize and not end up in the narrow optima. In this method, are we keeping track of the minima and averaging them and ensembling them? We are not currently doing that but if you wanted it to generalize even better, you can save the weights right before the resets and take the average. But for now, we are just going to pick the last one. (at the 1000 iteration)

There is a parameter called cycle_save_name which you can add as well as cycle_len , which will save a set of weights at the end of every learning rate cycle and then you can ensemble them.

Our validation loss isn’t improving much. So there’s probably no point further training the last layer on its own.

Saving and loading the model

From time to time save your weights call learn.save and pass the filename 224_lastlayer

Pre-computed activations and resized images are saved in the data folder in tmp files. Deleting the tmp folder is fast ai equivalent of turning on and off

Models are saved in the models folder when learn.save is called

What if you wanted to retrain a model from scratch? There is generally no reason to delete the pre-computed activations, because the precomputed activations are without any training.

Fine-tuning and differential learning rate annealing

So far anything we have done has not change the pre-trained filters. We have used a pre-trained model that knows how to find edges and gradients(layer1), corners and curves(layer2), then repeating partners, texts (layer3) and eventually eyeballs(layer4 and 5). We have not retrained any of those activation more specifically weights in the convolutional kernel. All we have done is we added some new layers on top and learned how to mix and match pre-trained features.

Images like satellite images, CT scans, etc have totally different kinds of features all together compared to ImageNet images. So you need to re-train many layers. For dogs and cats, images are similar to what the model was pre-trained with, but we still may still find it helpful to slightly tune some of the later layers.

Now that we have a good final layer trained, we can try fine-tuning the other layers. To tell the learner that we want to unfreeze the remaining layers, just call unfreeze() . This tells the learner we want to start changing the convolutional filters.

learn.unfreeze()

A frozen layer is a layer that is not trained i.e it is not updated.

unfreeze() unfreezes all the layers.

Layer one which detects edge and gradient and layer two which detects curves and corners don’t need much learning, they don’t need to change much. while the much later layers need to change. This is universally true when training for other image recognition.

What we do is create an array of learning rate.

lr=np.array([1e-4,1e-3,1e-2])

The earlier layers (as we’ve seen) have more general-purpose features. Therefore we would expect them to need less fine-tuning for new datasets. For this reason we will use different learning rates for different layers: the first few layers will be at 1e-4 for basic geometric features and layers closest to the pixels, the middle layers at 1e-3 for the middle sophisticated convolutional layers, and 1e-2 as before for the layers we add on top (fully connected layers). We refer to this as differential learning rates, although there’s no standard name for this technique in the literature that we’re aware of.

Why 3? Actually they are 3 ResNet blocks but for now, think of it as a group of layers.

How is differential learning rate different from grid search? There is no similarity to grid search. Grid search is where you are trying to find the best hyperparameters. For differential learning rate it tries a lot of learning rate, it tries to find which is best. For the entire training it uses a different learning rate for each layer.

What if I have a bigger images than the model is trained with? With this library and modern architectures we are using, we can use any size we like.

Can we unfreeze just specific layers? We are not doing it yet, but if you wanted, you can do learn.unfreeze_to(n) which will unfreeze layers from layer n onwards . It almost never helps because, using differential learning rates the optimizer can learn just as much as it needs to. The one place it is helpful is if you are using a really big memory intensive model and if you running out of GPU, the less layers you unfreeze, the less memory and time it takes.

Note; you can’t unfreeze one specific layer.

Earlier we said 3 is the number of epochs, but it is actually cycles. In this case learn is doing 3 cycles of 1 epoch.( cycle_len=1 )

If cycle_len=2 , It will do 3 cycles where each cycle is 2 epochs (i.e. 6 epochs).

Then why did it do 7 epochs? It is because of cycle_mult this doubles the length of each cycle.(1 epoch + 2 epochs + 4 epochs = 7 epochs).

Using differential learning rate we have a model that is 99.05% accurate.

If the cycle length is too short, it starts going down to find a good spot, then pops out, and goes down trying to find a good spot and pops out and so on such that it never actually get to find a good spot. Earlier on, you want it to do that because it is trying to find a spot that is smoother, but later on, you want it to do more exploring. That is why cycle_mult=2 seems to be a good approach.

We are introducing more and more hyper parameters having told you that there are not many. You can get away with just choosing a good learning rate, but then adding these extra tweaks helps get that extra level-up without any effort. In general, good starting points are:

n_cycle=3, cycle_len=1, cycle_mult=2

n_cycle=3, cycle_len=2 (no cycle_mult )

Why do smoother surfaces correlate to more generalized networks?

X-axis is showing how good this is at recognizing dogs vs. cats as you change this particular parameter. To be generalizable means that we want it to work when we give it a slightly different dataset. Slightly different dataset may have a slightly different relationship between this parameter and how cat-like vs. dog-like it is. It may, instead look like the red line. In other words, if we end up at X or Z , then it will not going to do a good job on this slightly different dataset. Or else, if we end up at Y, it will still do a good job on the red dataset.

Let’s take a look at pictures we predicted incorrectly

When we do the validation set, all of our inputs to our model must be square. The GPU does not go very quickly if you have different dimensions for different images. It needs to be consistent so that every part of the GPU can do the same thing.

To make it square, we just pick out the square in the middle, as you can see below, it is understandable why this picture was classified incorrectly.

The dogs head was not identified

We will use Test Time Augmentation(TTA) or inference time or test time it makes predictions not just on the images in your validation set, but also makes predictions on a number of randomly augmented versions of them too (by default, it uses the original image along with 4 randomly augmented versions given that they move around). It then takes the average prediction from these images, and uses that as our final prediction. To use TTA on the validation set, we can use the learner’s TTA() method.

The accuracy improved to 99.25%. The Neural net gets multiple argumentations of the same picture making the accuracy go up.

NOTE; TTA is for validation/ test set. when training we are not doing TTA.

Why not add a border or padding to make it square? It does not help much with neural net as the image of the cat does not change. Zooming would work. Reflection padding where by you add borders on the outside to reflect the image making the image bigger, works well with satellite imagery.Generally speaking, using TTA plus data augmentation, the best thing to do is try to use as large images as possible. If you crop you tend to lose for example the dogs face.

Data augmentation for non-image dataset? No one seems to know. It seems like it would be helpful, but there are very few number of examples. In natural language processing, people tried replacing synonyms for instance, but on the whole the area is under researched and under developed.

Can we use a sliding window to generate other images for example generate 3 image parts from one picture of a dog? For training that would not be better because we would not get much better variations, because you have like three standard ways you are giving it to look at the data. You want to give it as many ways to look at the data. Having fixed crop locations plus random contrast, brightness, rotation changes might be better for TTA.

Analyzing results

Confusion Matrix

A quick way to evaluate classification algorithm is using a confusion matrix. It helps with identifying which group of classification you are having trouble with.

preds = np.argmax(probs, axis=1)

probs = probs[:,1] from sklearn.metrics import confusion_matrix

cm = confusion_matrix(y, preds) plot_confusion_matrix(cm, data.classes)

We have 987 cats that we predicted right and 13 we predicted wrongly. 993 dogs that we predicted were right and 7 that we got wrong.

Steps to train a world-class image classifier

Enable data augmentation( side_on or top_down depending on what you doing), and precompute=True Use lr_find() to find highest learning rate where loss is still clearly improving. Train last layer from precomputed activations for 1–2 epochs. Turn off precompute ( precompute=False )which allows us to use data augmentation for 2–3 epochs with cycle_len=1 Unfreeze all layers. Set earlier layers to 3x-10x lower learning rate than next higher layer. Rule of thumb: for pre-trained 10x for ImageNet like images, 3x for satellite or medical imaging. Use lr_find() again (Note: if you call lr_find having set differential learning rates, it prints out the learning rate of the last layers.) Train full network with cycle_mult=2 until over-fitting.

Let’s do it again

This challenge is to determine the breed of a dog in an image.

Use the kaggle CLI to download data. It is an unofficial kaggle command line tool. Useful for downloading the data when using cloud VM instances such as AWS or paperspace. Make sure you accept the competition rules before using the CLI by clicking the download button first. If you have you account connected with another account for login you have to forget your password and choose the third option to set up a new password and link your two accounts.

$ kg download -u <username> -p <password> -c dog-breed-identification -f <name of file>

Where dog-breed-identification is name of the competition, you can find the name of competition at end of URL of competition after /c/ part, https://www.kaggle.com/c/dog-breed-identification .

Here is the actual command;

$ kg download -u gerald -p mypassword -c dog-breed-identification

Once the file download is complete, we can extract the files using following commands.

#To extract .7z files

7z x -so <file_name>.7z | tar xf - #To extract.zip files

unzip <file_name>.zip

structure of the dogbreeds folder

This is different to our previous dataset. Instead of train folder which has a separate folder for each breed of dog, it has a CSV file with the correct labels.

The imports

from fastai.imports import *

from fastai.torch_imports import *

from fastai.transforms import *

from fastai.conv_learner import *

from fastai.model import *

from fastai.dataset import *

from fastai.sgdr import *

from fastai.plots import * PATH = "data/dogbreeds/"

sz = 224

arch = resnext101_64

bs = 58

We will read CSV file with Pandas. Which is used to do structured data analysis.

label_csv = f'{PATH}labels.csv'

n = len(list(open(label_csv))) - 1 # header is not counted (-1)

val_idxs = get_cv_idxs(n) # random 20% data for validation set n

10222 val_idxs

array([2882, 4514, 7717, ..., 8922, 6774, 37]) len(val_idxs) #20% of 10222

2044

n = len(list(open(label_csv)))-1 : Open the CSV file, create a list of rows, then take the length. -1 because the first row is a header. Hence n is the number of images/rows we have.

val_idxs = get_cv_idxs(n) : “get cross validation indexes” — this will return, by default, random 20% of the rows (indexes) to use as a validation set. You can also send val_pct to get a specific percentage e.g val_idxs = get_cv_idxs(n, val_pct=1.0) gets 100%, but 20% is the default.

This consists of image name or id and the label.

Below is a pandas frame to group how many dogs are of the different breeds.

There is 120 rows representing 120 breeds.

Going through the steps;

tfms = tfms_from_model(arch, sz, aug_tfms=transforms_side_on, max_zoom=1.1)

data = ImageClassifierData.from_csv(PATH, 'train', f'{PATH}labels.csv', test_name='test', # we need to specify where the test set is if you want to submit to Kaggle competitions

val_idxs=val_idxs, suffix='.jpg', tfms=tfms, bs=bs)

Enabling data augmentation;

tfms = tfms_from_model(arch, sz, aug_tfms=transforms_side_on, max_zoom=1.1)

call tfms_from_model and pass aug_tfms=transforms_side_on there are probably side on photos.

max_zoom — we will zoom into the image with up to 1.1 times

data = ImageClassifierData.from_csv(PATH, 'train', f'{PATH}labels.csv', test_name='test', val_idxs=val_idxs, suffix='.jpg', tfms=tfms, bs=bs)

ImageClassifierData.from_csv — last time, we used from_paths (which says the name of the folder are the name of the labels) but since the labels are in CSV file, we will call from_csv instead and call f’{PATH}labels.csv csv file that contains the labels. PATH is the contains all the data, train folder contain the training data. test_name specifies where the test set is if you will submit to Kaggle later.

val_idx — there is no validation folder but we still want to track how good our performance is locally. Separates out images and puts them in a validation set.

suffix=’.jpg’ — File names have .jpg at the end, but CSV file does not. So we will set suffix so it knows the full file names.

Get the training data set in the data object using trn_ds which contains the file names. Below is an example of a filename (fnames) .

img = PIL.Image.open(fn); img

We need to check the size of the image. Then we need to know how to deal with them depending on whether they are too large or too small. Most of ImageNet models are trained on either 224 by 224 or 299 by 299 images.

Create a dictionary comprehension;

size_d = {k: PIL.Image.open(PATH + k).size for k in data.trn_ds.fnames}

Go through all the files and create a dictionary that maps the name of the file to the size of that file.

row_sz, col_sz = list(zip(*size_d.values()))

Takes the dictionary and turns it to rows and columns. Then turn them into numpy arrays as shown below:

row_sz = np.array(row_sz); col_sz = np.array(col_sz)

Here are the first five row sizes:

row_sz[:5]

array([500, 500, 500, 500, 500])

ploting with matplotlib. Images and the number of pixels:

from the histogram most images are around 500 pixels.

Plotting those less than 1000pixels to zoom in on the diagram:

4599 images lie within 451 pixels.

How many images should be in the validation set? The size of the validation set depends on the size of your dataset. It should not always be 20%. If you train the same model multiple times and you are getting very different validation set results, then your validation set is too small.

The image of the dog seems to be at the centre and taking the largest part of the frame. Therefore we don’t need cropping, this would be different for medical imaging as sometimes the tumor might be on one side of the frame thus requiring zooming.

Initial Model