In A.I., data is power.

A.I. algorithms and how they are plugged into each other is an art that is becoming known through university courses, online training, and literally by people watching YouTube videos. Artificial Intelligence is open source, and it should be. What you can do to protect your company from competition is build proprietary datasets.

There are plenty of datasets open to the public. For example, Kaggle, and other corporate or academic datasets, and many federal and municipal data sources. We use these for lots of projects, but so can everyone else. What you want is to build something “special” that I don’t have. For example, you can’t beat Google at search because they know what people search for and you don’t. The advantage is the size and depth of their dataset, rather than market share alone.

We often come across requirements to build an artificial intelligence solution where the client needs an image dataset to move forward. The client often has no dataset of images to start off with. They can’t simply use an off-the-shelf solution or API, because the typical 1000 objects in the off-the-shelf convolutional networks are not as broad as one would hope, and also a classifier that differentiates between 2 classes can be a lot more powerful than one with 1000. There are simply fewer chances to make the wrong prediction when the number of “output classes” (types of things the system can see) is small, and so these specialized models tend to work well.

What I want to walk you through today is one way that we build up these custom image datasets. Let’s talk about the case where there are only 2 classes: infected leaves and healthy leaves. The idea is to use A.I. to distinguish between healthy and sickly leaves in a field somewhere.

To start, we install images-scraper and nodejs, and we limit the images we will scrape to non-https URLs. We make sure the URL has ‘.jpg’ on the end, and that is is well formed in general. What we need next is a set of keywords to scrape, and so we use the following keyword list to start off with:

keywordList = ['healthy', 'thriving', 'growing', 'living', 'beautiful','nourishing','tasty','green'] baseKeyword = 'leaf'

From here we generate combinations of the keyword list with the base keyword. For example:

('healthy leaf', 'thriving leaf', 'growing leaf', 'living leaf', 'beautiful leaf')

Next, we pass each of the combinations into a separate nodejs thread. The scraper collects the scraped images into a base directory with subfolders for each keyword combination. We then run scripts to remove duplicate and empty files.

Here is a full example of how we scrape the images for infected leaves.

keywordList = ['sick', 'damaged', 'infected', 'dying', 'bacteria','virus','sickly','wilting']

baseKeyword = 'leaf' import lemayScraper as ls ls.scrapeImages(keywordList, baseKeyword)

At this point we have 2 folders, one containing thousands of images of healthy leaves (and a lot of junk) and the other containing thousands of images of infected leaves (and a lot more junk). The next task is to browse the images by hand, and delete the images that are not relevant to leaves (a baby holding an umbrella), and then go through the images again, and remove images that are the wrong type (a drawing of a leaf, a 3D render of a leaf, etc). Finally, the human operator combs through the images and adds as much effort as they feel they need for a first pass. In later stages we may choose to crop images, or do other image cleanup. At this stage the goal is simply to make sure that bad data does not filter into the training data.