This article was written in collaboration with Jennifer Prendki, Founder and CEO of Alectio.

The Big Data Labeling Crisis

The field of computer vision reached a tipping point when the size and quality of available datasets finally met the needs of theoretical machine learning algorithms. The release of ImageNet, a fully-labeled dataset of 14-million-images, played a critical role. When ImageNet was realized, it was close to impossible for most companies to generate such a large, clean, and (most importantly) labeled dataset for computer vision. The reason for that was not that collecting the actual data was challenging (in fact, data collection and storage had already gotten much easier and cheaper), it was because obtaining and validating such a large volume of labels was slow, tedious, and expensive.

Data scientists know all too well that data preparation takes up most of their time, yet many people — including seasoned engineers — do not fully grasp the challenges related to data labeling. Data labeling can be viewed as the step that consists of encoding and integrating human knowledge into the algorithms. Getting it right is critical.

Some industries are lucky in that aspect because the data they are working with is “naturally” labeled. For example, in e-commerce, the label itself comes directly from the customer (e.g. Did this customer buy this specific product? What rating did the reviewer give to a specific product?). However, if you are working on a machine translation or a computer vision use case, the burden of labeling the data is completely on you. That’s why self-driving car companies, like Voyage, are putting a lot of effort, time, and money into getting high-quality labels.

An Industry-Wide Crisis

If you think this dynamic is getting better soon, think again. With the amount of data collected worldwide doubling every two years (or even faster, according to some sources), it is easy to see why manually labeling data is not scalable. The good news is that there are now good options for machine learning teams to outsource their labeling needs to companies that focus specifically on generating high quality labels. At Voyage, we work closely with Scale to achieve this.

In addition, more and more research is trying to solve the “Big Data Labeling Crisis” by adopting semi-automated, human-in-the-loop approaches. The underlying idea, inspired by the Pareto principle, is to use a machine learning algorithm to generate labels in the easy cases, and have a human annotator handle the difficult ones. Yet, even today in 2020, most labeling is still done manually. This means that each individual record has to be reviewed by a human tasked with the job. Such a job could be marking Tweets as proper or improper content (in the case of content moderation), transcribing utterances (for a speech-to-text application), or drawing bounding boxes around relevant objects in an image (in the case of computer vision). It should come as no surprise that this process can be both time-consuming and expensive.

Labeling for Self-Driving Cars

Self-driving cars require an enormous amount of labeled data. This is not simply to ensure the models are accurate, but also because there simply isn’t room for error. Getting the correct label is safety critical. At Voyage, our perception module is responsible for observing and understanding the environment. One of the many layers within our perception module relies on a deep neural network to make sense of the world around us. This means not only detecting and classifying the objects, but also predicting where they are going. The performance of this system is dependent on the data it is trained on, and achieving state-of-the-art performance relies on a state-of-the-art dataset. Over the last few years advancements in neural network architectures, as well as improvements in the sensors themselves, have made this problem more approachable. Data collection itself is not a bottleneck in developing self-driving cars, but structuring this data is an expensive and time consuming process. Incorrectly labeling data has the potential to seriously degrade the performance of the system.

When Voyage and Alectio were introduced, we immediately saw the benefits of working together. At Voyage, we have been building our dataset for years, and as the data grows, so does our training time. Understanding which data was contributing the most to our model’s performance, and which data might be causing biases in our model, was critical to the next phase of our development. We decided to partner with Alectio to explore an approach called Active Learning in order to help us answer these questions.

Not All Data is Created Equal

There is no question that having more data is better (up to a certain point). Building a simple learning curve will prove it, loud and clear. The more records a machine learning model sees in a training dataset, the more information it will have to learn from. Still, many machine learning engineers have failed to realize a critical truth: not all data is equally valuable for a model. In fact, it is fair to believe that a significant portion of collected data is actually completely useless. Imagine, for example, that you are building a model to interpret parking signs. but your training set is a collection of all street signs. Most of this data will not be contributing to your task. Things get even trickier if your dataset contains duplicates or pseudo-duplicates; in this case, many records contain the exact same information, but only the first one is truly valuable.

Smart Data > Big Data

At this point, you might be inclined to ask, “If not all data is equally valuable, then how do I find the valuable stuff?” This challenge is even harder than it looks, since we need to identify the valuable labels without labeling the data in the first place. To showcase this problem, we created an experiment. The goal of our study was to demonstrate that picking the right data can critically impact the performance of a machine learning model.