If you’re having trouble viewing the formulas below, try turning off Adblock

This is a summary of the core algorithm from Generalization in Adaptive Data Analysis and Holdout Reuse, which lets you reuse a holdout set without having to worry about overfitting. It’s also been published in Science as The reusable holdout: Preserving validity in adaptive data analysis. While the work has generated a fair few articles, none that I’ve found actually talk about how it works - which is a shame, because despite all the heavy theory it’s fairly simple.

Thresholdout

Thresholdout (§4 in the arXiv paper) is an algorithm which wraps your holdout set. It comes with a query budget $B$ and a threshold $T$. Each time you query Thresholdout, it calculates the performance on the training set, performance on the holdout set, and random variables $\alpha, \beta$.

If the performances on the two sets are within $T + \alpha$ of each other, then it returns the exact performance on the training set. If the performances on the two sets differ by more than $T + \alpha$, it returns the performance on the holdout summed with the random $\beta$. It also reduces the query budget $B$ by 1.

You can repeat the above until the budget $B$ reaches zero, at which point Thresholdout tells you you’ve exhausted your holdout set.

The idea is that every time you hit case (2), you learn something about how the holdout set differs from the training set. Just how much you learn depends on how heavily the threshold and the reported holdout performance have been randomized: the greater the variance, the less you learn. So by choosing the distributions of the random variables carefully, you can bound the probability of being able to overfit on the holdout before your budget runs out (theorem 22).

Budget

Two theoretically-guaranteed budget formulas are on offer. If $n$ is the size of your holdout set and $\tau$ is the generalization error you’ll tolerate, then you should set your budget to

$$B = \tau^2 n \quad\text{ or }\quad B \approx \tau^4 n^2$$

These are pretty punishing bounds. If you’ve got a thousand element holdout set and you want a generalization error of less than $\tau = 0.1$, you end up with a budget of $B = 10$ queries.

Fortunately, in practice it seems you can do better. In the implementation (see below), Dwork et al use different random variable distributions to the ones they use in the proof, and claim that the new distributions give a more reasonable budget. Unfortunately, they don’t say how to calculate that budget explicitly.

Implementation

If you’re looking for an implementation, code for the §5 experiments was released. Somewhat confusingly, it mixes code for the classifier in with the code for Thresholdout. It also differs from the algorithm proven in the paper (see §3 of the the supplementary materials), and doesn’t make any mention of budget or generalization error.

Notes

Both in the arXiv paper and Science’s supplementary material, in case (2) the variable $\hat T$ is set to $T + \gamma$. This seems unnecessary because $\hat T$ is set to $T + \gamma$ at the start of each step as well. Either way, it doesn’t affect the algorithm’s function - it’s just replacing a random sample with another random sample.

Finally, my usual disclaimer: this post is a poor imitation of the full work. If you’ve got the time and the mathematical foundations, it’s well worth your time to sit down and read it through.