Using simple techniques, we found annotation errors on more than 20% of images in popular open source datasets like VOC or COCO. By manually correcting those errors, we got an average error reduction of 5% for state-of-the-art computer vision models (and up to 8.3% for one dataset). This uplift can be the difference between a computer vision application that equals or even exceeds human performance, and an application that will never go into production. See below for getting access to our platform and visualizing the labelling errors on the popular datasets that we used.

Research community vs Industrial applications

Computer vision research papers come out every day by the dozen. The research community is impressively dynamic and is constantly pushing the limits of what is possible. However, its main focus is on neural network design and engineering, and on all the tricks that can make models perform better on a given dataset. This is crucial to understand. And that’s normal as this is intrinsic to the scientific process whose objective is to build better algorithms and thus to compare them on fixed datasets.

This is not true in the application and industrial world.

In his blogpost Software 2.0 more than two years ago, Andrej Karpathy, Director of AI at Tesla, laid the groundwork of a fundamental shift in how we write software, stating that “the 2.0 programmers manually curate, maintain, massage, clean and label datasets.” But what’s at stake with this 2.0 programmer job? how do those jobs ultimately affect the performance of AI projects? and how should those jobs be done?

A significant performance boost

We took several popular datasets (VOC 2012, COCO 2017, Self Driving Car Udacity) and we started by training models on the original datasets with six different neural network architectures (see below).

We then used the error spotting tool on the Deepomatic platform to detect errors and to correct them.

Finally, we trained new models using the same neural network architectures and parameters to see to what extent it was possible to improve performance in this way. Thus, contrary to the research community, we did not work with a fixed dataset, but with fixed models.

We reduced the prediction error by 5% on average and up to 8.9% for some dataset.

In the table below, we report the error reduction on the three datasets that we used for each one of the six neural network architectures we included in our study.

VOC 2012 COCO 2017 Udacity – Self Driving Car Yolo v2 6.5% 0.5% -2.1% Yolo v3 11.9% 9.0% 21.5% SSD Inception v2 5.8% 1.4% 10.5% SSD MobileNet v2 3.4% 0.9% 7.5% Faster RCNN – ResNet 50 v1 2.1% -2.0% 6.7% RFCN – ResNet 101 v1 2.5% -1.3% 5.9% Average Error Reduction 5.4% 1.4% 8.3%

What is the Error Reduction metric? The error reduction metric quantifies the improvement in performance between two models. For each neural network architecture that we used, we trained two models: the model 1 on training set 1 before any error cleaning

the model 2 on training set 2 after cleaning errors We then evaluated the two models on the same set of images and computed the Mean Average Precision for each one of them. The Mean Average Precision (mAP) is a popular computer vision metric for evaluating object detection algorithms (read this blogpost to understand better how it works). Finally we computed the error reduction between model 2 and model 1 with: Error Reduction = 1 - (1 - mAP2) / (1 - mAP 1) For instance, if the mAP is 0.90 for the model 1 and 0.91 for the model 2, the Error Reduction is then 10% (we have reduced the missing mAP by 10%).

We chose this metric because it reflects the difficulty of improving the mAP when starting from a very high value. Increasing the mAP by 0.01 does not have the same value when the initial value is 0.3 or when it is 0.9. The corresponding Error Reduction is 1.4% in the first case, and 10% on the second.

How to spot the labelling errors?

Labelling images with bounding boxes is a tedious task, and like all painful and repetitive tasks, we humans make mistakes when we do them on large volumes. Even when the task is quite simple and without ambiguity, it’s hard to go below 4 or 5% of labelling errors for a dataset of a few thousand images.

Rather than fighting against this natural phenomenon by putting annotation teams under pressure, we have developed a product that automatically escalates potential labelling errors. Here are the results for the three datasets that we used:

VOC 2012 COCO 2017 Udacity – Self Driving Car Training Set Images 17 177 94 439 11 992 Training Set Labels 20 80 9 Training Set Objects 49 834 686 385 78 230 Manual Correction 21.1% 23.6% 25%

This is huge! More than 20% of all boxes have been corrected in all three datasets, mainly boxes that have been removed or added during the correction. This means that 20% of the information that is used when training a neural network is flawed.

Top errors for the bus label in the VOC dataset (green pictures are valid errors) – The pink dotted boxes are objects that have not been labelled but that our error spotting algorithm highlighted.

Proportion of corrected errors per label (COCO 2017 dataset)

On the Deepomatic platform, our users work with datasets of tens of thousands of images, sometimes up to a few million. We have had an intuition for several years about the importance of dataset quality to achieve the performance required for production applications, but today is the first time we are confirming this impact and getting a first measure of it.

Create an account on the Deepomatic platform with the voucher code “SPOT ERRORS” to visualize the detected errors. Once you are on a dataset, click on the label that you want and use the slider at the top right corner of the page to switch modes (we call it smart detection). You can then access three tabs and the errors are listed in the False Positive and False Negative tabs (similarly to the screenshot above).