Statisticians have spent a lot of time attempting to do complicated inference for various machine learning models. In fact, there's an enormously simple and naive way to do this in complete generality: Simply use a paired T-test to compare the performance of two models on your test set!

Here are the details: suppose you have $n$ $(x, y)$ pairs, drawn iid from some true $(X, Y)$ distribution. "Machine learning" is the problem of trying to estimate $y$ using $x$, given the example $(x, y)$ pairs. Ultimately you produce a function $f(x)$ that's supposed to be a reasonable estimate for $y$. Typically, one has a loss function $L(y, f(x))$ that describes how good an estimate is. You can compare estimators by their expected loss, $L(f) = E[L(Y, f(X))]$. (The expectation we'll consider here is taken over new $(X, Y)$, where the $(x, y)$ points you used to fit $f$ are considered fixed. But there are other reasonable definitions, including expecting over the $(x, y)$ data pairs, or expecting just over the $y$'s and only looking at the loss at the $n$ $x$ points).

Now, let's say we have a test set of $m$ $(x, y)$ pairs, (separate from the $n$ you used to fit your model). If you wish to compare two models you fit, $f$ and $g$, you can look at $L(y, f(x)) - L(y, g(x))$ for each of the m $(x, y)$ pairs. This will give you $m$ iid random variables, and then you can use a T-test to test whether their mean is equal to zero or not. Put in words, this paired T-test can test the null hypothesis that $f$ and $g$ are equally good functions against the alternative that they aren't, (or against one-sided alternatives if you wish).

There's a lot you can do with this simple setup. For example, $f$ might be a model that includes some set of features, and $g$ might include these same features except one. So you can test the "significance" of individual features, just like you can do with linear or logistic regression, for example. You can also compare a model under different tuning parameters, two different kinds of models, (neural nets vs. random forests, for example), or last month's model vs. this month's.

This approach isn't a panacea, though. For one, it's quite wasteful in terms of how it's using data, and as a result is very low-powered. After being used to fit $f$ or $g$, the $n$ training data points are thrown away, and only the $m$ test points are used. Contrast this to the maximum-likelihood/likelihood-ratio-test (e.g. linear or logistic regression) approach to inference, where you can use the same $n$ training points for both fitting the model and also for inference. Fundamentally the reason you can do this for these parametric models is that you can easily understand the "degrees of freedom" of adding an additional feature, something that isn't possible in general machine-learning models.

A natural modification, then, is to use k-fold cross validation instead of just a single train/test split. But this doesn't quite work: First of all, doing this requires changing our expected loss definition. Previously, we were considering the $n$ $(x, y)$ points as fixed, but in the cross-validation setup we can't do that anymore since each point is considered random in its fold. That's not a big deal in and of itself--it just changes the hypothesis we're testing to how good a particular model fit on a random 80% (i.e. $1 - 1/k$) of the data is.

The problem, though, is that changing this setup and making all the $(x, y)$ points random makes the loss differences for held-out points no longer iid, even in the same fold. That's because the loss difference now depends on all the random $(x, y)$ pairs in the training folds (through $f$ and $g$), in addition to just the $(x, y)$ pair in the held-out fold. For example, if the loss difference of the first point in a held-out fold abnormally favors $f$ over $g$, it suggests that the $(x, y)$ pairs in the training folds may have made $f$ an abnormally better model than $g$, and hence that the loss differences in the rest of the fold may also abnormally favor $f$. (In the train/test split approach, we were considering the training points as fixed, effectively conditioning on $f$ and $g$ in the loss expectation and avoiding this problem). So the independence of the points no longer holds, and consequently we can't use the same T-test trick.

One idea I've had about this is to look at leave-one-out cross validation instead of k-fold CV. With LOOCV, the loss differences for different points will all be exchangeable. So potentially by finding the right exchangeability CLT theorems, and putting the right conditions on the model you're considering, you could have a powerful and general approach to hypothesis tests for machine learning models.

You might also enjoy...