How to Create Your Own Image Dataset for Deep Learning

Bridging the Gap Between Introductory Learning and Real-World Application

Photo by Beata Ratuszniak on Unsplash

Motivation

There are a plethora of MOOCs out there that claim to make you a deep learning/computer vision expert by walking you through the classic MNIST problem. That’s essentially saying that I’d be an expert programmer for knowing how to type: print(“Hello World”) . Real expertise is demonstrated by using deep learning to solve your own problems. However, building your own image dataset is a non-trivial task by itself, and it is covered far less comprehensively in most online courses.

The goal of this article is to help you gather your own dataset of raw images, which you can then use for your own image classification/computer vision projects.

Requirements:

Python : You’ll need to have a working version of python on your machine. (I’m using 3.7.4)

: You’ll need to have a working version of python on your machine. (I’m using 3.7.4) Linux/Unix Terminal: We will be running the image downloader from the command line. If you are using Mac or Linux then the standard terminal should be fine. (I’m running Ubuntu 18.04). For Windows, you may need to set up the Windows Subsystem for Linux or find another 3rd party terminal app.

Steps

Believe it or not, downloading a bunch of images can be done in just a few easy steps.

One: Install google-image-downloader using pip:

pip install googleimagedownloader

Two: Download Google Chrome and Chromedriver

You will want to make sure that you get the version of Chromedriver that corresponds to the version of Google Chrome that you are running.

To check the version of Chrome on your machine: open up a Chrome browser window, click the menu button in the upper right-hand corner (three stacked dots), then click on ‘Help’ > ‘About Google Chrome’.

Once you have Chromedriver downloaded, make sure that you note where the ‘chromedriver’ executable file is stored. We will need to know its location for the next step.

Three: Use the command line to download images in batches

As an example, let’s say that I want to build a model that can differentiate lizards and snakes. That means I’d need a data set that has images of both lizards and snakes. I’d start by using the following command to download images of lizards:

$ googleimagesdownload -k "lizard" -s medium -l 500 -o dataset/train -i lizards -cd ~/chromedriver

This command will scrape 500 images from Google Images using the keyword ‘lizard’. It will output those images to: dataset/train/lizards/ . The -cd argument points to the location of the ‘chromedriver’ executable file we downloaded earlier.

(Note: It make take a few minutes to run for 500 images, so I’d recommend testing it with 10–15 images first to make sure it’s working as expected)

If you open up the output folder you should see something like this: