In ML there is the saying garbage in, garbage out. But what does it really mean to have good or bad data? In this post, we will explore data redundancies in the training set of fashion-MNIST and how it affects test set accuracy.

What is Data Redundancy?

We leave the more detailed explanation for a next post but let’s give you an example of redundant data. Imagine you’re building a classifier, trying to distinguish between images of cats and dogs. You already have a dataset of 100 cats and are looking for a dataset of dog pictures. Two friends of yours, Robert and Tom, offer their dataset.

Robert took 100 pictures of his dog Bella he took last week.

Tom offers you pictures of 100 different dogs he collected over the last year.

Which dataset of dogs would you pick?

Of course, it depends on the exact goal of the classifier and the environment it will run in production. But I hope that in the majority of the cases you would agree, that the dataset from Tom with 100 different dogs makes more sense.

Why?

Let’s assume every image has a certain amount of information it can contribute to our dataset. Images with the same content (same dog) add less additional information than images with new content (different dogs). One could say that similar images have semantic redundancy.

There are papers such as The 10% You Don’t Need exploring this in more detail.

Remove Redundant Data

Papers such as The 10% You Don’t Need follow a two-step procedure to find and remove less informative samples. First, they train an embedding. Then they apply a clustering method to remove nearest neighbors using Agglomerative Clustering. For the clustering, they use the common cosine distance as a metric. You could also normalize the features to the unit norm and use L2 distance instead.

There are two problems we need to solve:

How do we get good embedding?

Agglomerative Clustering is slow O(n³) time and O(n²) space complexity

The first problem the authors solve by training the embedding using the provided labels. Think about training a classifier and then removing the last layer to get good features.

The second problem needs more creativity. Let’s assume we have very good embeddings separating the individual classes. We can now process the individual classes independently. For a dataset with 50k samples and 10 classes, we would run 10 times the clustering on 5k samples each. Since time and space complexities are O(n³) and O(n²) this is a significant speedup.

The Startup Approach of WhatToLabel

At WhatToLabel we want to make the use of machine learning more efficient by focusing on the most important data. We help ML engineers filter and analyze their training data.

We use the same two-step approach.

A high-level overview of our data selection algorithms

First, we want to get a good embedding. A lot of our effort is focusing on this part. We have pre-trained models we use as a base we can fine-tune on a specific dataset using self-supervision. This allows us to work with unlabeled data. We will explain the self-supervision part in another blog post. The pre-trained model has a ResNet50 like architecture. However, the output dimension of the embedding has only 64 dimensions. High-dimensions introduce various issues such as higher computation and storage time and less meaningful distances due to the curse of dimensionality.

The way we train our embedding we still get a high accuracy:

Comparison of different embedding on ImageNet 2012 val set

Second, we want to use a fast algorithm for data selection based on the embedding. Agglomerative Clustering is too slow. We explore the local neighborhoods by building a graph and then iteratively run algorithms on it. We use two types of algorithms. Destructive ones, where we start with the full dataset and then remove samples. Constructive algorithms on the other hand start building a new dataset from scratch by only adding relevant samples one by one.

When we combine the two steps we get a fast filtering solution that even works without labels.

We have developed a product to do exactly that which you can check out at whattolabel.com or you can build your own pipeline following this two-step Procedure for Filtering.