When you are a student, one of the first problems you learn about is the two-sample test. So you might think that this problem is old news. But it has had a revival: there is a lot of recent research activity on this seemingly simple problem. What makes the problem still interesting and challenging is that, these days, we need effective high dimensional versions.

I’ll discuss three recent innovations: kernel tests, energy tests, and cross-match tests.

This post will have some ML ideas that are mostly ignored in statistics and as well as some statistics ideas that are mostly ignored in ML.

(By the way, much of what I’ll say about two-sample tests also applies to the problem of testing whether two random variables and are independent. The reason is that testing for independence really amounts to tasting whether two distribution are the same, namely, the joint distribution for and the product distribution .)

1. The Problem

We observe two independent samples:

and

(These are random vectors.) The problem is to test the null hypothesis

versus the alternative hypothesis

The various tests that I describe below all define a test statistic which s a function of the data . As usual, we reject if for some critical value . We choose such that, if is true then

where is the distribution of when is true.

Some of the test statistics can be pretty complicated. This raises the question of how we choose . In other words, how do we find the distribution of under ?

There are four ways:

We may be able to find explicitly. We may be able to find the large sample limit of and use this as an approximation. Bootstrapping. Permutation testing.

The coolest, most general and most amazing approach is number 4, the permutation approach. This might surprise you, but I think the permutation approach to testing is one of the greatest contributions of statistics to science. And, as far as I can tell, it is not well known in ML.

Since the permutation approach can be used for any test, I’ll describe the permutation method before describe any of the test statistics.

2. The Amazing Permutation Method

Let be any test statistic. It doesn’t matter what the statistic is or how complicated it is. Let us organize the data as follows:

0

0



0

1

1



1

We have added group labels to indicate the two samples. The test statistic , whatever it is, is a function of the data vectors and the group labels .

If is true, then the entire data vector is an iid sample from and the group labels are meaningless. It is as if someone randomly wrote down some 0’s and 1’s.

Now we randomly permute the group labels and recompute the test statistic. This changes the values of but (under ) it does not change the distribution of . We then repeat this process times (where is some large number like 10,000). Let us denote the values of the test statistics from the permuted data by . Since the assignment of the 0’s and 1’s is meaningless under , the ranks of are uniformly distributed. That is, if we order these values, then the original statistics is equally likely to be anywhere in this ordered list.

Finally, we define the p-value

where is the indicator function. So is just the fraction of times is larger than . Because the ranks are uniform, is uniformly distributed (except for a bit of discreteness). If we reject when , we see that the type I error, that s the probability of a false rejection, is no more than .

There are no approximations here. There are no appeals to asymptotics. This is exact. No matter what the statistic is and no mater what the distribution is, if we use the permutation methodology, we get correct type I error control.

Now we can discuss the test statistics.

3. Kernel Tests

For a number of years, a group of researchers has used kernels and reproducing kernel Hilbert spaces (RKHS) to define very general tests. Some key references are Gretton, Fukumizu, Harchaoui, Sriperumbudur (2012), Gretton, Borgwardt, Rasch, Sch{ö}lkopf and Smola (2012) Sejdnovic, Gretton, Sriperumbudur and Fukumizu (2012).

To start with we choose a kernel . It will be helpful to have a specific kernel in mind so let’s consider the Gaussian kernel

The test statistic is

There is some deep theory justifying this statistic (see above references) but the intuition is clear. Think of as a measure of similarity between and . Then compares the typical similarity of a pair of points from the same group versus the similarity of a pair of points from different groups.

The aforementioned authors derive many interesting properties of including the limiting distribution. But, for practical purposes, we don’t need to know the distribution. We simply apply the permutation method described earlier.

If you are paying close attention, you will have noticed that there is a tuning parameter so we should really write the statistic as . How do we choose this?

We don’t have to. We can define a new statistic

This new statistic has no tuning parameter. The null distribution is a nightmare. But that doesn’t matter. We simply apply the permutation method to the new statistic .

4. Energy Test

Szekely and Rizzo (2004, 2005) defined a test (which later they turned into a test for independence: see Szekely and Rizzo 2009) based on estimating the following distance between and :

where and . The test statistic is a sample estimate of this distance. The resulting test has lots of interesting properties.

Now, is based on the Euclidean metric and obviously this can be generalized to other metrics . Sejdnovic, Gretton, Sriperumbudur and Fukumizu (2012) show that, under certain conditions on , the resulting test corresponds exactly to some kernel test. Thus we can consider the energy test as a type of kernel test.

5. The Cross-Match Test

There is a long tradition of defining two sample tests based on certain “local coincidences.” I’ll describe just one: the cross-match test due to Paul Rosenbaum (2005).

The test works as follows. Ignore the labels and treat the data as one sample of size . Now form a non-bipartite matching. That is, put the data into non-overlapping pairs. For example, a typical pairing might look like this:

(For simplicity, assume there are an even number of observations.) However, we don’t pick just any matching. We choose the matching that minimizes the total distances between the points in each pair. (Any distance can be used.)

Now we look at the group labels. Some pairs will have labels or while some will have labels of the form or . Let be the number of pairs of type , let be the number of pairs of type , and let be the number of pairs of type or .

Define . Under , it is as if someone randomly sprinkled 0’s and 1’s onto the data and we can easily compute the exact distribution of under which is

where .

In fact, this distribution can be accurately approximated by a Normal with mean and variance . (Note that approximating a known, exact, distribution which contains no nuisance parameters is quite different than using a test whose null distribution is exact only in the limit.)

So the nice thing about Rosenbaum’s test is that it does not require the permutation methods. This might not seem like a big deal but it is useful if you are doing many tests.

6. Which Test Is Best?

There is no best test, of course. More precisely, the power of the test depends on and . No test will uniformly dominate another test.

If you are undecided about which test to use, you can always combine them into one meta-test. Then you can use the permutation method to get the p-value.

Having said that, I think there is room for more research about the merits of various tests under different assumptions, especially in the high-dimensional case.

7. References

Gretton, A. and Fukumizu, K. and Harchaoui, Z. and Sriperumbudur, B.K. (2012). A fast, consistent kernel two-sample test. Advances in Neural Information Processing Systems, 673–681.

Gretton, A., Borgwardt, K.M., Rasch, M.J., Sch{ö}lkopf and Smola, A. (2012). A kernel two-sample test. The Journal of Machine Learning Research, 13, 723–773.

Rosenbaum, P.R. (2005). An exact distribution-free test comparing two multivariate distributions based on adjacency. Journal of the Royal Statistical Society: Series B, 67, 515-530.

Sejdnovic, D., Gretton, A., Sriperumbudur, B. and Fukumizu, K. (2012). Hypothesis testing using pairwise distances and associated kernels. link

Sz{é}kely, G.J. and Rizzo, M.L. (2005). A new test for multivariate normality. Journal of Multivariate Analysis, 93, 58-80.

Szekely, G.J. and Rizzo, M.L. (2004). Testing for equal distributions in high dimension. InterStat, 5.

Szekely, G.J. and Rizzo, M.L. (2009). Brownian distance covariance. The Annals of Applied Statistics, 3, 1236-1265.

—Larry Wasserman