Iteration => Accuracy & Consistency

The axiom of garbage in, garbage out can be masked in training. Even when fed random noise, such as random labels or unstructured pixels, certain models are capable of overtraining to the point of attaining 0% training error (Understanding Deep Learning Requires Rethinking Generalization).

This is because recent high-capacity models like deep neural networks can memorize massive datasets. While these models don’t commit errors during training, when tested, they perform no better than random guessing.

Therefore, iteration and rigorous QA/QC processes are essential to a proper data labeling workflow.

Quality evaluation methods can be classified in three main families: (i) automatic, (ii) by direct inspection of the job provider and (iii) methods using the crowd itself as evaluator.” — Worker Ranking Determination in Crowdsourcing Platforms using Aggregation Functions)

Since, in most cases, automated evaluation without human input is either impossible or guarantees minimal quality, we’ll discuss how to implement QA/QC methods of the latter categories to help improve the confidence in the quality of your training data:

Test questions Direct inspection Consensus

Test questions and direct inspection are QA/QC methods that fit into category (ii) where the job provider, or data scientist, is directly responsible for evaluating quality. Test questions is a standard technique. It refers to a set of data that is correctly labeled by the data scientist and then distributed randomly amongst labelers to test their accuracy. Direct inspection is the process of visually inspecting your labeled data to gauge accuracy.

Visual screening is a basic functionality that everyone should have to preprocess data and post-label review for accuracy. In his article, Why You Need To Improve Your Training Data, And How To Do It, Pete Warden recommends randomly browsing through your data. This basic practice can reveal valuable information about your dataset, such as an “unbalanced number of examples in different categories, corrupted data (for example PNGs labeled with JPG file extensions), incorrect labels, or just surprising combinations.”

While most open source tools do not provide this essential feature, Labelbox is a repository of labeled data that allows you to visually browse and manage your data in one place.

While the QA/QC methods of category (ii) are extremely useful, they have two implicit drawbacks. First, they are inherently unscalable since the resources of the job provider, or data scientist, to evaluate the accuracy of crowdsourced labels is finite. Second, in order to perform these methods, the correct answers must already be known.

Consensus, on the other hand, is both inherently scalable and useful when the correct answers are unknown. Consensus requires multiple annotators to provide labels for the same piece of data. With that information, consensus computes Intersection Over Union (IOU) to average out idiosyncrasies across labelers and get better attenuation of the signal.

In other words, the answers to the same question are compared to determine the rate of agreement. High agreement is indicative of a high-quality dataset, while low agreement typically points to poor data quality. But it can also be indicative of ambiguous examples.

Labelbox offers a built-in consensus tool so you can monitor your quality metrics in real-time. Read more about how the Labelbox Consensus tool works here.