Prepare your own data set for image classification in Machine learning Python

There is large amount of open source data sets available on the Internet for Machine Learning, but while managing your own project you may require your own data set. Today, let’s discuss how can we prepare our own data set for Image Classification.

Collect Image data

The first and foremost task is to collect data (images). One can use camera for collecting images or download from Google Images (copyright images needs permission). There are many browser plugins for downloading images in bulk from Google Images. Suppose you want to classify cars to bikes. Download images of cars in one folder and bikes in another folder.

Process the Data

The downloaded images may be of varying pixel size but for training the model we will require images of same sizes. So let’s resize the images using simple Python code. We will be using built-in library PIL.

data set for image classification in Machine learning Python

Resize



from PIL import Image import os def resize_multiple_images(src_path, dst_path): # Here src_path is the location where images are saved. for filename in os.listdir(src_path): try: img=Image.open(src_path+filename) new_img = img.resize((64,64)) if not os.path.exists(dst_path): os.makedirs(dst_path) new_img.save(dst_path+filename) print('Resized and saved {} successfully.'.format(filename)) except: continue src_path = <Enter the source path> dst_path = <Enter the destination path> resize_multiple_images(src_path, dst_path)

The images should have small size so that the number of features is not large enough while feeding the images into a Neural Network. For example, a colored image is 600X800 large, then the Neural Network need to handle 600*800*3 = 1,440,000 parameters, which is quite large. On the other hand any colored image of 64X64 size needs only 64*64*3 = 12,288 parameters, which is fairly low and will be computationally efficient. Now since we have resized the images, we need to rename the files so as to properly label the data set.

Rename



import os def rename_multiple_files(path,obj): i=0 for filename in os.listdir(path): try: f,extension = os.path.splitext(path+filename) src=path+filename dst=path+obj+str(i)+extension os.rename(src,dst) i+=1 print('Rename successful.') except: i+=1 path=<Enter the path of objects to be renamed> obj=<Enter the prefix to be added to each file. For ex. car, bike, cat, dog, etc.> rename_multiple_files(path,obj)

Since, we have processed our data. Merge the content of ‘car’ and ‘bikes’ folder and name it ‘train set’. Pull out some images of cars and some of bikes from the ‘train set’ folder and put it in a new folder ‘test set’. Now we have to import it into our python code so that the colorful image can be represented in numbers to be able to apply Image Classification Algorithms.

Import Images in form of array

from PIL import Image import os import numpy as np import re def get_data(path): all_images_as_array=[] label=[] for filename in os.listdir(path): try: if re.match(r'car',filename): label.append(1) else: label.append(0) img=Image.open(path + filename) np_array = np.asarray(img) l,b,c = np_array.shape np_array = np_array.reshape(l*b*c,) all_images_as_array.append(np_array) except: continue return np.array(all_images_as_array), np.array(label) path_to_train_set = <Enter the location of train set> path_to_test_set = <Enter the location of test set> X_train,y_train = get_data(path_to_train_set) X_test, y_test = get_data(path_to_test_set) print('X_train set : ',X_train) print('y_train set : ',y_train) print('X_test set : ',X_test) print('y_test set : ',y_test)

Woah! You made it. Your image classification data set is ready to be fed to the neural network model. Feel free to comment below.