Introduction

What we are doing

Last time we looked at how to create new images to train from using image augmentation. However, you now might have too many images to load them all at once like you normally do. In this post, I would be showing you a really neat technique that reduces both the memory required to train and reduces overfitting. In essence, this technique involves only loading in and training on a small random pick of images at a time. You can find the code we will be writing in this post here.

Why do this?

The idea of not using our full training set on every epoch may seem like a bad idea but there are actually several benefits to this technique.

Firstly, as I’ve already mentioned it reduces the amount of memory required for training drastically as you no longer have to load all your data in at the start of training but instead only have one batch of images loaded at a time meaning the number of images in memory doesn’t need to exceed your batch size.

Secondly, by reloading new images every epoch you can actually perform image augmentation on the fly which isn’t something I’ve done with this code but does allow for a near infinite amount of unique training data.

Thirdly, as the model sees each image less frequently during training, due to the fact that the images are changed every epoch, it is a lot harder for the model to overfit as the training data is changing completely with every epoch. Finally, as I load n/2 images for each of our two classes so the total number of images is n each batch of training data has an equal number of images from both classes. However, I must stress this only works with slightly imbalanced data sets. If you have, say 40 images of cats and 200 of dogs and a batch size of 20 there the same image from cats will be seen far more frequently than the same image from dogs so these cat images will be overfitted to.

TLDR: You get less memory usage, less overfitting, some help with inbalanced classes, and the chance to perform on the fly image augmentation

The code!

Imports

1–10 We import the libraries we will be using

Cleanup function

12 We create a subroutine that will be called at the end of the code to perform cleanup. The nature of the way I’ve coded this program means that there are going to be several times when there is some code that the need for doesn’t make sense until later on so your just going to have to bear with me. As our subroutine for generating our batches of images will pick random images from the augmented folder, the subroutine for loading the validation data (which is coming next) will move the images that it picks to a temporary folder so they can’t be picked up by load_images will look.

This is a subroutine that moves the contents of this temporary folder back to the augmented folder. This subroutine will take the arguments directories (a list of directories to move the test folder back to) and tmp_dirs (a list of the all the folders within the temporary folder)

13 We run a command to move all of the contents of the first directory in tmp_dirs to the first directory in directories

14 Same as 13 but with the second directory

Function for loading the validation data

16 We create a subroutine to create our validation data. This is being created as the nature of our train data loading means we can’t run sklearn.model_selection.train_test_split. It will take the arguments directories and tmp_dirs as val_cleanup took but it will also take the argument of num which is the how many pieces of data we wish to generate

17 We redefine num as num/2 rounded down, as num is the total number of pieces of data to generate but there are two classes so we want to generate num/2 pieces of data for each class. We round down as if for example num is 11, 11/2=5.5 so we need to round up or down and as this is validation data I prefer to generate slightly less data

18 We create a list to store our images in

20 We store a list of all the files in the first directory in our directories list in the dir_1_files variable

21 Same as 20 but for the second directory

22 We rename dir_1_files to files as this makes the code that follows read nicer

24 We create the test directory

25 We create a directory in the test directory with the name of the first directory in test_dirs

27 We create a for loop to generate num amount of data

28 We pick a random number between 0 and the length of our files list-1 (-1 as, due to the fact that lists start at 0, the last index is the length-1), remove the filename with the index of that random number and store the filename in the file variable

29 We load the image in the directory of the first item in directories and the filename of file in its default size

30 We convert the image to a numpy array

31 We divide the contents of the numpy array by 255 (for data normalisation) and then append it to the imgs list

33 We move the file to our temporary holding folder as mentioned earlier so it isn’t used by our train data generator

35 We redefine files as dir_2_files

36–44 Same as above but for the second directory

46 We return our list of images

Creating the generator

48 We create a subroutine that will be a python generator that will give us our batches of train data. It will take the arguments directories and num both of which you should be familiar with by now

49 Same as before but ceiling (always round up) as I prefer to generate slightly more data than slightly less

54 We put the rest of our code in an infinite loop as we want to be able to generate data infinitely

57–59 We check if either of lists doesn’t have enough remaining files in to create the next batch and if either of them doesn’t have enough data we “refill” both of them lists by setting them back to their original state, meaning images are now being reused but at least we can continue to provide data

76 This is where the magic to make this a generator and not a normal subroutine comes in. By using yield instead of return we create what’s called a generator. Generators can be iterated over in the same way that you would with a string, a list, or a range() object but they don’t support indexing (where you use square brackets to get a certain variable) due to the fact that the returned/yielded data is in essence thrown away every time new data is requested. Generators can be accessed either by using next(some_generator) or doing for i in some_generator. You can think of generators as being called, running the code, getting up to the yield line and then pausing until they are called again at which point they resume where they left off. Therefore all the code before the while True block is only run once and all our local variables keep their state.

Within the scope of the same for or next statement, a generator will run until it gets up to the yield line at which point it will pause until the next set of data is requested at which point it will resume where it left over thereby preserving local variable state TLDR on generators: We can iterate over generators and they run the code to generate the data as we iterate over the generator

Function for creating Y data

78 We create a subroutine to generate our Y data. As we know we will have num images for both classes, this code is fairly simple. It takes the single argument of num.

79 We create a list to store our data in

81–82 We add [1,0] (0 one hot encoded) to the y_data list num times

83–84 We add [0,1] (1 one hot encoded) to the y_data list num times

86 We return our y_data list

Specifying the model architecture

This is exactly the same code as the model architecture code in part 2 of this series

Initialising variables

102 We create a variable that is rather descriptively called i to store the current epoch in

103 We create a variable called epochs to store the total number of epochs we wish to perform in

105 We store the directories that our (augmented) images are stored in, in the dirs variable

106 We store the names of the temporary holding directories that we are going to create in the tmp_dirs variable

Creating validation data

108 We use a neat python trick called a list comprehension that allows you to create a list of items by using [transform for item in list if condition]. In this particular list comprehension, we omit the condition part as we simply want a list of the number of files in each directory in dirs for us to run sum on (sum simply adds up these numbers to give us a single number that is the total number of images). We do this as it is a useful metric for sanity checking etc. but also so we can run the next line

109 We create a 20:80 test:train split by multiplying total_images from the previous line by 0.2, rounding and storing this in the variable we will use to store the number of validation images

110 We create a variable to store the number of training images to generate, 32 is used as this will be our batch_size when we call model.fit

113 We create val_size/2 rounded down amount of validation Y data so that we have the same amount of Y data as X data

115 We create train_size/2 rounded up amount of train Y data as this data remains constant from batch to batch, so it is more efficient to generate it only once now as opposed to every time we have a new batch of X data

117–9 We tell the user the amount of validation data, the number of images available for use as train data and the number of epochs

Training

121 We put all our training code in a try block so we can handle exceptions (such as KeyboardInterrupts) elegantly

122 We iterate over load_images, this amounts to an infinite loop due to the while True in our load_images code meaning it never stops generating data.

123 If the amount of epochs done is greater than or equal to the target number of epochs (i.e we’ve finished training)

124 We tell the user what we are doing

125 We break out of the for loop

Sorry not sorry

127 We convert the list that load_images returned to a numpy array

129 We train the model as we normally would but this time there is the additional argument of intial_epoch which tells Keras which epoch to start counting from making the output look nicer, also we pass i+2 as the epochs argument so that we train starting from epoch number i until we reach epoch i+2 i.e we train for two epochs

130 Now that we have trained for two epochs, we increment i by 2

Cleanup

132 Now that training has finished (i.e now we’ve broken out of the for loop) we run our val_cleanup function so that the validation images are back in the right places ready for future training runs

134 In the event of a KeyboardInterrupt (which is probably an attempt to stop training early)

135 Make it clear to the user that the KeyboardInterrupt has been registered in case cleanup takes a little while

138 In the event of any other exception (the reason that KeyboardInterrupts are handled separately is that they aren’t picked up on by except Exception)

142 Now that all the validation images are back where they should be, we raise the error so that the error is still seen

Conclusion

This series on CNNs has not covered all the code I’ve written to do with CNN’s but merely the code that is most useful and/or hardest to understand. I would encourage you to take a look at my metrics code (which adds meaning to your percentage accuracy by showing which images your model is misclassifying), visualisation code (which allows you to see the outputs of the conv layers) and heatmap code (which allows you to visualise the activations shown by the visualisation code as a heatmap of which areas of the image the network is most interested in).

Please share this post on social media if you enjoyed it/found it useful. If there are any inaccuracies in this article please let me know. Please feel free to leave feedback in the comments so I know how to improve moving forward. Any questions or problems with the code tell me in the comments. If you want help with Keras generally use the Keras google group and me or someone else will help you.