Data pipelines are one of the most important part of any machine learning or deep learning training process. Efficient data pipelines have following advantages.

Allows the use of multi-processing

Allows you to generate batches

Allows you to do data augmentation

Makes the code neat

No need to write boilerplate code

Hopefully after reading this article you will learn how to construct and use a data pipeline in Keras.

Keras has DataGenerator classes available for different data types for constructing the data pipeline. In this post I will be writing about the Image DataGenerator class.

The documentation can be found here:https://keras.io/preprocessing/image/

There are two steps in creating the generator.

Instantiate ImageDataGenerator with required arguments

Use appropriate flow command to construct the generator which will yield tuples of (x,y). These are batches of data and the method supports multiprocessing.

#Import the required libaries import matplotlib.pyplot as plt from PIL import Image import os import numpy as np from skimage import io from keras.preprocessing.image import ImageDataGenerator from matplotlib import cm from mpl_toolkits.axes_grid1 import ImageGrid import math %matplotlib inline

I’ve written a grid plot utility function that plots neat grids of images and helps in visualization. It accepts input as either list of images or a numpy array.

def show_grid(image_list,nrows,ncols,label_list=None,show_labels=False,savename=None,figsize=(10,10),showaxis='off'): if type(image_list) is not list: if(image_list.shape[-1]==1): image_list = [image_list[i,:,:,0] for i in range(image_list.shape[0])] elif(image_list.shape[-1]==3): image_list = [image_list[i,:,:,:] for i in range(image_list.shape[0])] fig = plt.figure(None, figsize,frameon=False) grid = ImageGrid(fig, 111, # similar to subplot(111) nrows_ncols=(nrows, ncols), # creates 2x2 grid of axes axes_pad=0.3, # pad between axes in inch. share_all=True, ) for i in range(nrows*ncols): ax = grid[i] ax.imshow(image_list[i],cmap='Greys_r') # The AxesGrid object work as a list of axes. ax.axis('off') if show_labels: ax.set_title(class_mapping[y_int[i]]) if savename != None: plt.savefig(savename,bbox_inches='tight')

Let us start with the datagenerator.

batch_size=32 datagen_args = dict(rotation_range=20, width_shift_range=0.2, height_shift_range=0.2, rescale=1./255) datagen = ImageDataGenerator(**datagen_args) datagenerator = datagen.flow_from_directory('./Dataset/dtd/images',target_size=(128,128), batch_size=batch_size,interpolation="lanczos",shuffle=True)

So we start with the first line of the code which specifies the batch size. We have set it to 32 which means that one batch of image will have 32 images stacked together in an numpy array. For tensorflow backend the shape of this array would be (batch_size, image_y, image_x, channels).

There are few arguments specified in the dictionary for the ImageDataGenerator constructor. They are explained below.

rotation_range : Int. Degree range for random rotations.

: Int. Degree range for random rotations. height_shift_range : Shifts the image along the height dimension. It supports various inputs. For float the image shall be shifted by

fraction of total height, if < 1, or pixels if >= 1.

: Shifts the image along the height dimension. It supports various inputs. For float the image shall be shifted by fraction of total height, if < 1, or pixels if >= 1. width_shift_range : Shifts the image along the width dimension.

: Shifts the image along the width dimension. rescale : rescaling factor. Defaults to None. If None or 0, no rescaling is applied, otherwise we multiply the data by the value provided (after applying all other transformations).

: rescaling factor. Defaults to None. If None or 0, no rescaling is applied, otherwise we multiply the data by the value provided (after applying all other transformations). fill_mode: One of {“constant”, “nearest”, “reflect” or “wrap”}. Default is ‘nearest’. Points outside the boundaries of the input are filled according to the given mode.

Apart from the above arguments there are several others available. These allow you to augment your data on the fly when feeding to your network. Please refer the documentation for more details.

These are passed on to the ImageDataGenerator and we create the datagen object. Next step is to use the flow_from _directory function of this object.

The directory structure should be as follows.

The data directory should contain one folder per class named as the class name and in those folders the images of those respective classes. In above example there are classes and examples per class. This makes the total number of samples .

The arguments for the flow_from_directory function are explained below.

directory : string, path to the target directory. It should contain one subdirectory per class.

: string, path to the target directory. It should contain one subdirectory per class. classes : Optional list of class subdirectories (e.g. ['dogs', 'cats'] ). If you want only few of the classes in the directly just specify those as a list. The order matters and class indices are assigned as per the list order.

: Optional list of class subdirectories (e.g. ). If you want only few of the classes in the directly just specify those as a list. The order matters and class indices are assigned as per the list order. class_mode : One of “categorical”, “binary”, “sparse”, “input”, or None. Default: “categorical”. This determines the type of label that is returned by the generator. ‘categorical’ is for multiple classes. Labels are one hot encoded. ‘binary’ is for two classes. ‘sparse’ returns 1D integer labels ‘input’ this is useful for autoencoders where you require the input image to be the label ‘None’ will cause the generator to return no label

: One of “categorical”, “binary”, “sparse”, “input”, or None. Default: “categorical”. This determines the type of label that is returned by the generator. target_size : Tuple of integers (height, width) , default: (256, 256) . The dimensions to which all images found will be resized.

: Tuple of integers , default: . The dimensions to which all images found will be resized. interpolation: Interpolation method used to resample the image if the target size is different from that of the loaded image. “lanczos” use useful when you are downscaling your images. In this example that is why I have use “lanczos”.

For the tutorial I am using the describable texture dataset which is available here. It contains 47 classes and 120 examples per class. All the images are of variable size. The resize function of flow_from_directory allows you to create batches of equal sizes.

Next we look at some of the properties and functions available for the datagenerator that we just created. ‘samples’ gives you total number of images available in the dataset. ‘class_indices’ gives you dictionary of class name to integer mapping.

‘filenames’ gives you a list of all filenames in the directory.

Now the datagenerator object is a generator and yields (x,y) pairs on every step. In python next() applied to a generator yields one sample from the generator.

As expected (x,y) are both numpy arrays. Image batch is 4d array with 32 samples having (128,128,3) dimension. The labels are one hot encoded vectors having shape of (32,47). One hot encoding meaning you encode the class numbers as vectors having the length equal to the number of classes. The vectors has zeros for all classes except for the class to which the sample belongs. So for a three class dataset, the one hot vector for a sample from class 2 would be [0,1,0].

Next now that we have one batch and its labels with us, we shall visualize and check whether everything is as expected.

show_grid(x,4,8,label_list=y_int,show_labels=True,figsize=(20,10),savename='./Images/image_grid.png')

We see that the images are rotated randomly as expected and the filling is nearest which repeats the nearest pixel value from the valid frame. The images are also shifted randomly in the horizontal and vertical directions. All of them are resized to (128,128) and they retain their color values since the color mode is ‘rgb’.

Moving on let’s compare how the image batch appears in comparison to the original images. For this we set shuffle equal to False and create another generator. This allows us to map the filenames to the batches that are yielded by the datagenerator. There is a reset() method for the datagenerators which resets it to the first batch. So whenever you would want to correlate the model output with the filenames you need to set shuffle as False and reset the datagenerator before performing any prediction.

#Shuffle has been set to False dgen_no_shuffle = datagen.flow_from_directory('./Dataset/dtd/images',target_size=(128,128),batch_size=32,interpolation="lanczos",shuffle=False) # We get the third batch dgen_no_shuffle.reset() #resets the generator to the first batch for i in range(3): x1,y1 = next(dgen_no_shuffle) y1_int = np.argmax(y1,axis=-1) #Plot the batch images w.r.t. the dataset images. plt.figure(figsize=(20,20)) idx=1 for i in range(8): plt.subplot(4,4,idx) idx+=1 plt.imshow(x1[i].reshape(128,128,3)) plt.subplot(4,4,idx) plt.imshow(io.imread(os.path.join(dgen_no_shuffle.directory,dgen_no_shuffle.filenames[(dgen_no_shuffle.batch_index-1)*32+i]))) idx+=1 plt.savefig('./Images/visual_original_comp.png',bbox_inches='tight')

We can see that the original images are of different sizes and orientations. We get augmented images in the batches.

Next let’s move on to how to train a model using the datagenerator.

epochs = 25 hist = model.fit_generator(datagenerator, steps_per_epoch= math.ceil(datagenerator.samples//batch_size), epochs=epochs, validation_data=validation_generator, validation_steps=math.ceil(validation_generator.samples//batch_size verbose=1, workers=8)

I’ll explain the arguments being used.

steps_per_epoch : Integer. Total number of steps (batches of samples) to yield from generator before declaring one epoch finished and starting the next epoch. It should typically be equal to ceil(num_samples / batch_size). This ensures that the model sees all the examples once per epoch.

: Integer. Total number of steps (batches of samples) to yield from before declaring one epoch finished and starting the next epoch. It should typically be equal to epochs : Integer. Number of epochs to train the model. An epoch is an iteration over the entire data provided, as defined by steps_per_epoch .

: Integer. Number of epochs to train the model. An epoch is an iteration over the entire data provided, as defined by . workers : Integer. Maximum number of processes to spin up when using process-based threading. If unspecified, workers will default to 1. If 0, will execute the generator on the main thread.

: Integer. Maximum number of processes to spin up when using process-based threading. If unspecified, will default to 1. If 0, will execute the generator on the main thread. use_multiprocessing : Boolean. If True , use process-based threading. If unspecified, use_multiprocessing will default to False . Note that because this implementation relies on multiprocessing, you should not pass non-picklable arguments to the generator as they can’t be passed easily to children processes.

: Boolean. If , use process-based threading. If unspecified, will default to . Note that because this implementation relies on multiprocessing, you should not pass non-picklable arguments to the generator as they can’t be passed easily to children processes. validation_data : This can be either a generator or a Sequence object for the validation data tuple (x_val, y_val) tuple (x_val, y_val, val_sample_weights) on which to evaluate the loss and any model metrics at the end of each epoch. The model will not be trained on this data.

: This can be either validation_steps: Only relevant if validation_data is a generator. Total number of steps (batches of samples) to yield from validation_data generator before stopping at the end of every epoch. It should typically be equal to the number of samples of your validation dataset divided by the batch size. Optional for Sequence : if unspecified, will use the len(validation_data) as a number of steps.

The workers and use_multiprocessing function allows you to use multiprocessing. Specify only one of them at a time.

Using the datagenerator to get predictions from a model.

#This ensures you start from first batch. #And the datagen shuffle is set to False. #This will allow you to correspond the predictions with the generator filenames. dgen_no_shuffle.reset() y = model.predict_generator(dgen_no_shuffle,steps= math.ceil(dgen_no_shuffle.samples//batch_size),workers=8)

It has same multiprocessing arguments available.

The last section of this post will focus on train, validation and test set creation. There are two options.

First is you make three separate directories and create three different data generators. Second is use ‘validation_split’ argument in the ImageDataGenerator constructor. In this case you have one directory for train and validation set. But you still have a separate directory for test set. Additionally you’ll have to use the subset argument for the flow_from_directory function. These arguments are explained below.

validation_split: Float. Fraction of images reserved for validation (strictly between 0 and 1). So if the value of 0.2 is used then 20% samples will be reserved for the validation set and remaining 80% for the training set.

subset: Subset of data ( "training" or "validation" ) if validation_split is set in ImageDataGenerator

The code would look something like the following.

So since I specified a validation_split value of 0.2, 20% of samples i.e. 1128 images were assigned to the validation generator. The training and validation generator were identified in the flow_from_directory function with the subset argument.

The code is available at following github repository.

https://github.com/msminhas93/KerasImageDatagenTutorial

Thank you for reading the post. If you found it useful please to like, share and subscribe. Happy learning!