From modern day web-scale marketing campaigns to their scientific origins at the Guinness Brewery, AB tests continue to stand as a pillar upon which "intelligent" business decisions are made. Assessing the conclusiveness of such tests inescapably brings forth the arduous concept of "statistical significance". Indeed, it is now being more and more publicly acknowledged that hypothesis testing is the point where a cloud of perplexed turbulence often starts to develop... As a matter of fact, there's been a lot of high-profile activity recently regarding the so-called "reproducibility crisis" of which we appear to be in the throes in...

In addition to its inherently perilous conceptual subtleties, real-world hypothesis testing has to deal with the fact that AB tests are constantly subjected to numerous and somewhat uncontrollable changes (deployment of new features, seasonality, special case scenarios...). Correlations, perceived or real, between all these effects often endow actionable business decisions reached on the basis of "statistical significance" with an unfortunate sense of murkiness.

Perversely, the purpose of this note is to exemplify one essential (and not widely recognized) factor that contributes to the hazards of hypothesis testing: slow convergence due to finite size-effects. Note that we will do so by placing ourselves under the most ideal situations where everything about the AB test is known in advance and nothing changes during the experiment as described in the first section below. The calculator presented in the second section below provides an implementation of the ideas that helps get some quantitative insight...

Description of the idealized AB test

In order to see the importance of finite-size effects, let us consider a series of N repeated events where each individual event can only support 2 possible outcomes: "success" with some probability p* and "failure" with probability of 1-p*. For concreteness, one may think of coin flip in which p* represents the probability of observing a specific side (p* ≠ 1/2 for biased coins). Alternatively, business intelligence readers may consider p* as the probability to click on a web page. Obviously, the statistical properties of such Bernoulli trials are well known: The process eventually converges to a Gaussian distribution of mean value p* and of standard deviation proportional to N-1/2 (decreasing as the inverse square root of the number of events).

Now that we know how our events are generated, let us further assume that (perhaps through previous experience) we are already aware of some "baseline success rate" p0 against which we would like to test the performance of our experiment. The traditional technique consists in measuring our "empirical success rate" p and in feeding it along with p0 into a statistical significance test in order to extract a p-value. According to common practice, the statistical test is the so-called "one-proportion test". If the resulting p-value is smaller than 0.05, we can declare the test as conclusive meaning that the difference between p and p0 is indeed "statistically significant". Here, it is crucial to realize that because of the finiteness of the number of events N, p is random variable which is only on average equal to the "true success rate" p*. Unfortunately, one is rarely in a position to run multiple realizations of the experiment and must usually be satisfied with a single value p of the empirical success rate whose deviation away from p* depend on the sample size N.

The crux of the problem: finite-size effects & slow convergence

One important consequence is that the p-value obtained by the single test described above cannot be taken at face value since it is itself a random variable with a complicated statistical distribution. Therefore, an interesting question becomes:

Given a "baseline rate" p0, a "true success rate" p* ≥ p0, how many events N do we need until our single test yields a significant conclusion (as it should) with a probability of at least 0.9? In other words, how many events do we need in order to reduce the probability of observing a falsely non-significant result drops below 0.1. This kind of analysis is usually referred to as the "power" of the hypothesis testing.

The calculator below allows you to explore this question and discover a surprisingly slow convergence rate leading to rather high values of N... Note that the same tool allows one to also investigate the opposite situation where p* ≤ p0. In this case, we may be interested to know how many events it takes such that the experiment would yield a significant result (clearly the wrong conclusion) with a probability of no more than 0.01.

Slow convergence exposed: see for yourself

Guide: Valid probabilities can only take values between 0 and 1 exclusive. For example, you can start with p* = 0.51 and p0 = 0.5 in order to see that one needs to wait 20,000 events in order to ensure that the probability of observing a significant result is at least 0.9. Business intelligence readers may be more interested in numbers such as p* = 0.0042 and p0 = 0.004 typical of so called click-through rates. In this case, it takes almost 1 million events even though there is a true improvement of 5% in performance.

p* = p0 =



