The MNIST dataset of handwritten digits has been used as a standard machine learning benchmark for over two decades. It has a training set of 60,000 examples and a test set of 10,000 examples. This relatively small number of test images has however come under suspicion in our big data age, with many researchers concerned that the overuse of MNIST test data could lead to overfitting of models.

To address this a couple of researchers from New York University and Facebook AI Research recently added 50,000 test samples to the dataset. Facebook Chief AI Scientist Yann LeCun, who co-developed the MNIST, tweeted his approval: “MNIST reborn, restored and expanded.”

MNIST was derived from the NIST (National Institute of Standards and Technology) dataset, whose segmented characters each occupy a 128x128 pixel raster and are labeled by one of 62 classes corresponding to “0”-“9”, “A”- “Z” and “a”-“z.” The set of images in the MNIST database is a combination of two original NIST databases: Special Database 1 and Special Database 3, which consist of digits written by high school students and employees of the United States Census Bureau, respectively.

Increasingly impressive published performance on the MNIST raised researchers’ concerns that models were overfitting to the small test set, throwing the MNIST itself into question: Why trust any new conclusions drawn from this dataset? How quickly do machine learning datasets become useless?

In the paper Cold Case: The Lost MNIST Digits, researchers reconstruct the MNIST dataset by tracing each MNIST digit to its original NIST source and metadata; and augment the test set with 50,000 additional samples.

To create their reconstructed dataset, QMNIST, researchers leveraged existing reconstruction algorithms combined with a curious resampling algorithm, where the code computes the exact overlap of the input and output image pixels.

Researchers recorded MNIST and QMNIST results for various methods include k-nearest neighbors (KNN), support vector machines (SVM), multilayer perceptrons (MLP), and convolutional networks to re-examine MNIST performance results by taking advantage of the 50,000 newly reconstructed test examples.

LeCun believes it may be time for researchers to update their character recognition models: “If you used the original MNIST test set more than a few times, chances are your models overfit the test set. Time to test them on those extra samples.”

The paper Cold Case: The Lost MNIST Digits is on arXiv.