Data Valuation

There’s been some interesting research on data valuation. The idea is that you look at a dataset, and rank data points based on their value with respect to a specific model or predictive task.

Data Shapley

Equitable Valuation of Data for Machine Learning [2] by Ghorbani and Zhou came out earlier this year. The authors took the Shapley Value from game theory [3] (which got Lloyd Shapley the 2012 Nobel Prize in Economics) and applied to data in machine learning. They proposed that an equitable valuation method should satisfy the following conditions:

It should give a data point which does not change the performance of your ML system when added to any subset of the training data a value of zero. It should give data points that have exactly equal contribution when added to any subset of the training data the same value by symmetry. If the value of a machine learning system can be decomposed into a sum of subsystems, the value of the data should be additive in those subsystems.

The only property that can satisfy this? The Shapley Value.

To compute the Shapley Value for a given data point, you train a model with every subset of the data with that data point, and then again without it. You take the difference in performance and the Shapley Value is the weighted average across all subsets.

But computing this value requires an exponentially large number of computations with respect to the number of training data sources. That’s why the paper proposed the following estimations which are all designed to save time and compute:

Monte Carlo Estimation:

We can estimate the value of sample i by only looking at some subsets of the data. These subsets are randomly sampled and can provably approximate the rest of the data.

2. Truncation:

Ending the training early for many iterations if the marginal value of the ML model is not increasing (helpful for large datasets). With the above, this is TMC-Shapley.

3. Gradient Shapley:

Treating the value of the ML system as the performance gained by one epoch of training rather than through a full training of the model. This is G-Shapley.

4. Group Shapley:

Instead of valuing data at the sample level, group level valuation is valuing a batch of data together. You might want to rank your data sources by value, in which case you don’t care about sample level granularity.

So does any of this work?

The following are results from the paper showing the effects of removing training data on the prediction accuracy.

The data was valued in four ways, the highest value data was removed on the left, and the lowest value data on the right.

LOO is Leave-One-Out: a simpler valuation method that trains on the whole dataset, then everything but a single data point, and values that point based on the effect its exclusion had. Besides being too computationally heavy, this method also can’t really handle relationships between different data points.

Random values the data randomly as a baseline.

As you can see the effect of removing high value and low value data is dramatically greater than random. In the breast cancer example, prediction accuracy went up more than 16% with more than 40% of the data removed!

But what about overfitting?

If I’m removing data and getting better, doesn’t that mean I’m removing data I perform worse on, and isn’t that bad?

Well, no. The idea is that you only remove the data from your training set. And performance is being measured on a hold-out test set, which is (hopefully) representative of the your data distribution. This is standard practice in ML.

Valuation doesn’t punish outliers, it rewards them. If you’re confident in the quality and coverage of your test dataset, this type of valuation will rank higher any rare samples that appear in both the training set and the test set. There is no risk of overfitting by removing redundant, irrelevant, or mislabeled data.