How to find datasets for Artificial Intelligence training

Recent advances in Deep Learning are only possible with the availability of large datasets

The recent proliferation of machine learning is due to a number of things. Algorithmic improvements? Definitely. Increased hardware capabilities? Without a doubt. But let’s not forget that the algorithms and hardware aren’t useful without data. We are now generating more data than ever before.

Figure 1: Daily data creation, courtesy of Mikhal Khoso

We, as a society, are creating more than 2.5 exabytes of data — each day! That’s 2.5 million terabytes of video, email, photos, Facebook posts, and everything else. It’s simply astonishing. Without data, machine learning algorithms would be stuck in an AI winter and relegated to the halls of academics to continue to tinker without application.

It cannot be understated that machine learning algorithms need data. They are only able to produce astonishing results if they have the data to find patterns in. The purpose of this article is to curate a list of sources of public datasets that are freely available.

But before we get into data sources, let’s do a quick overview about the different kinds of machine learning algorithms. It’s generally difficult to create buckets that everyone agrees to, but within the Artificial Intelligence community, it’s generally understood that there are three broad classes of learning algorithms.

Learning Algorithms

Reinforcement Learning

This is a form of machine learning that tries to solve a problem by figuring out an optimal move for a given scenario. To put it a BIT more formally (but not overly formal), Reinforcement Learning (RL) is used in cases where an agent is able explore a space (either known or unknown) and figure out optimal rules and policies of how to act in that space.

Generally, games come to mind when discussing RL. A lot of work that Google’ DeepMind or OpenAI does it in the world of RL with games. AlphaGo, AlphaZero and AlphaStar are examples of RL agents. Of course, there are countless others, but it can be said that the agents explore the possible moves in the game space and come up with a policy that lets them determine the optimal moves for any situation.

There are datasets available to help train RL agents, but generally the training is done with some sort of a simulation and trial and error. The more accurate the simulation, the better the agent will be able to learn. Currently, the OpenAI Gym is a world class platform for building, training and testing RL agents to solve a variety of problems.

Unsupervised Learning

Sometimes we have data and no specific objective. Generally, this form of machine learning is called unsupervised learning. In unsupervised learning, an algorithm is given a dataset and has to figure out some traits of that data. Typically things like “clustering” come to mind where there may be similar points within the dataset that an algorithm can group together automatically.

If you feed an algorithm thousands of pictures of dogs, cats and birds, an unsupervised learning algorithm might be able to identify that there are 3 distinct clusters of pictures. The problems arise when the algorithm decides to cluster pictures differently than how we would think to.

As another example, consider an algorithm that can cluster users by how likely they are to buy a particular product. These sorts of algorithms are widely used by retailers to segment customers automatically. If you’ve ever seen “those who like ______ also like _____,” you can be sure that some form of unsupervised learning is afoot.

Supervised Learning

Finally, we have probably the most common form of learning algorithm that’s currently being used, which is called supervised learning. In supervised learning, we have some form of input (it could be images, data points about a user, audio and so on) and are trying to predict some output: what is in the image, what group the user falls into, what words are in the audio file, etc.

It’s safe to say that most image recognition, handwriting recognition, value prediction, etc, are in the supervised learning class.

In this article, we will give some sources to find data to feed the supervised learning algorithms. The more data we have, the better the algorithms are at finding patterns and making predictions. Fortunately, there are many sources of data that are freely available to train your supervised learning models.

Public Datasets

Here, we outline a few freely available datasets that can be used for a variety of supervised learning tasks.

CIFAR

There are two datasets within the CIFAR dataset. There is the CIFAR-10, which has 60,000 images that map to 10 different classes. There is also the CIFAR-100, which has 60,000 images that map to 10 different classes.

ImageNet

This is one of the largest public datasets available. ImageNet contains over 14 million images in over 20,000 categories. Many innovative neural network architectures are developed using the ImageNet data as a benchmark.

OpenImages

This is an initiative put forth by Google. They have URLs of over 9 million images that map to 6,000 categories. OpenImages continues to be updated.

YouTube-8M

This is a MASSIVE dataset taken from YouTube videos with complete annotations on a frame-by-frame basis. They have over 4,000 entities annotated.

CelebFaces

Who doesn’t love celebrities? This dataset contains faces with more than 200,000 celebrity images, each with 40 attribute annotations.

COCO

COCO is a large-scale object detection, segmentation, and captioning dataset. This is a joint effort between many of the powerhouses in AI: Google Brain, Facebook AI Research, Microsoft and others. While many of the datasets focus on classification of a certain class within a picture, COCO actually as pixel-level segmentation which can be useful for a variety of tasks.

COIL100

The Columbia University Image Library contains 100 images that are mapped from all possible angles. This can be useful for many deep learning tasks where parts of an image might be occluded and need to be reconstructed.

Conclusion

This list is by no means exhaustive, but hopefully it gives an idea of what’s possible with public datasets. When faced with an AI challenge, it’s useful to do a quick search to see if datasets exist that match the problem we’re working on. Even if there’s not an exact match, more often than not, there will be one (or more) datasets that can be used as a starting point to help make even better models.

Often, not only do researchers release these great datasets, they also release the actual network architecture and weights used in their examples. With transfer learning, one can save a lot of model training time by using the existing model with parameters as a starting point. As an example, if you had a task to detect if a picture contained a particular kind of computer that you can’t find in any dataset, you can start out with an architecture trained on ImageNet and use a small set of images for the final training to solve your specific problem.