A Standard Problem: Determining Sample Size

Recently, I was tasked with a straightforward question: "In an A/B test setting, how many samples do I have to collect in order to obtain significant results?" As ususal in statistics, the answer is not quite as straightforward as the question, and it depends quite a bit on the framework. In this case, the A/B test was supposed to test whether the effect of a treatment on the success rate p had the assumed size e. The value of the success rate had to be estimated in both test and control group, i.e. p test and p control . In short, the test hypotheses were thus

H 0 : p test = p control vs.

H 1 : p test = p control + e

Now, for each statistical test, we aim at minimizing (or at least controlling for) the following types of error:

Type I: Even though H 0 is true, the test decides for H 1

is true, the test decides for H Type II: Even though H 1 is true, the test decides for H 0 .

Since I can never remember stuff like this, I immediately started looking for a simple mnemonic, and I found this one:

The null hypothesis is often represented as H 0 . Although mathematicians may disagree, where I live 0 is an even number, as evidenced by the fact that it is both preceded and followed by an odd number. Even numbers go together well. An even number and an odd number do not go together well. Hence the null hypothesis (even) is rejected by the type I error (odd), but accepted by the type II error (even).

The test statistic

In the given setup, the test now runs as follows: Calculate the contingency matrix of successes in both groups, and apply Fisher's exact test. If the test is negative, i.e. does not reject the null hypothesis, we have to repeat the experiment. However, we are quite sure that the null hypothesis is wrong and would like to prove that with as little effort as possible.

The basic question in this situation is "how many observations do I need to collect, in order to avoid both errors of Type I and II to an appropriate degree of certainty?". The "appropriate degree of certainty" is parametrized in the probability of errors of Type I (significance level) and Type II (power). The default choices for these values are 0.05 for the significance level, and 0.8 for power: In 5% of cases, we reject a "true" H 0 , and in 20% of cases we reject a "true" H 1 . Quite clearly, only the power of the test (and not the significance level) depends on the difference of the parameters p test and p control .

Existing functions in R

Are there already pre-defined functions to calculate minimal required sample sizes? A bit of digging around yields a match in the package Hmisc. There, the authors implement a method developed Fleiss, Tytun and Ury . However, according to the documentation, the function is written only for the two-sided test case and does not include the continuity correction. I disagree with both decisions:

the continuity correction term can grow quite large, and is always positive (see (5) in the cited paper). Thus, neglecting this term will always end in an underestimation of the necessary number of observations and may therefore lead to unsuccessful experiments.

the two-sided case is not the norm, but rather the exception. When testing p control vs. p test , the counterhypothesis will almost always read "p test > p control ", since the measures taken assume to have, if any, a positive effect.

A new R function: calculate_binomial_samplesize

After these considerations, I decided to write my own function. Below is the code, the function allows for "switching the continuity correction off", and for differentiating between the one-sided and the two-sided case. In the two-sided case without continuity correction, it coincides with "Hmisc:bsamsize", as can be seen from the example provided.