“Without data you’re just another person with an opinion.”

This famous W. Edwards Deming quote shows that A/B tests are crucial to make good business decisions. At ManoMano we expose several millions of DIY and gardening products to several million users and run dozens of A/B tests everyday in order to improve the customer experience on our website.

Example of product ranking algorithm A/B test at ManoMano on the garden sheds category page

However, running A/B tests and interpreting results can be very difficult and might lead you to erroneous conclusions if it’s done the wrong way. The purpose of this blog post is not to say what you should do when running A/B tests, but rather to say what you should not do. Here are the 10 common errors we all make when interpreting frequentist A/B tests results.

1 — Looking at the whole population when only a part of it is impacted

Example: you want to test your search engine relevance but you look at the whole population when analyzing your A/B tests results instead of only customers using the search engine. Although this is not scientifically wrong, reaching statistical significance will take longer because you’re adding some noise in the analyzed data:

Takeaway n°1: to reach statistical significance faster, observe results only on users who have interacted with the tested feature (here, the search engine).

2 — Running tests without any business intuition

A variant is to run a test with too many variations (A/B/C/…/n). For example, if you use an α = 5% significance threshold and decide to test 20 different scenarios, on average one of those will be positive only by chance. This is an illustration of the multiple comparison problem. Therefore business intuition is crucial to decide which AB tests to launch. To illustrate this point, one can tweak W. Edwards Deming’s quote:

“Without d̶a̶t̶a̶ an opinion, you’re just another person with a̶n̶ ̶o̶p̶i̶n̶i̶o̶n̶ data” .

from xkcd

Takeaway n°2: use your intuition (or even better, do user research) to decide which A/B test to launch.

3 — Segmenting the population to reach statistical significance

This is another illustration of the multiple comparison problem: “My A/B test is not significant so I will segment my data on country * device (for example) to get significant results”. You have to be very careful when doing post-test segmentation. Indeed, the more segments you compare, the greater the chances of getting a false positive among the results are.

For the country * device example, we have n = 15 segments at ManoMano (5 countries * 3 devices : France/mobile, France/desktop, Spain/tablet, etc.). Let’s compute the probability to have at least one significant result just by chance on one of these segments:

We have more than a 50–50 chance to have at least one significant result on our segmented population, completely due to chance. Therefore drawing conclusions and taking actions from a post-test segmentation can be very risky. Some techniques exist though to mitigate the problem, such as the Bonferroni correction.

Takeaway n°3: don’t segment your population to reach statistical significance.

4 — Looking at several metrics to reach statistical significance

Yet another illustration of the multiple comparison problem: “My A/B test shows no significant result on the conversion rate, nor on the average basket and on the bounce rate. But it’s significant on the number of products per basket!”. If you look at enough metrics you’ll eventually find one showing a significant result just by chance:

A/B test results illustration

Takeaway n°4: stick to the metric the test was designed for.

5 — Stopping the test when reaching statistical significance

Statistical significance must not tell you when you should stop a test. You need to wait for the calculated sample size to be reached before stopping a test. Use an A/B test duration calculator to compute the sample size you need for your test. For more details on this bias, read this problem illustration. You can also simulate an A/A test here to see that it’s quite frequent to reach statistical significance early during a test even when the test is not significant at the end:

Observed significance of an A/A test experiment, depending on the number of sample, using James Luterek’s tool.

Takeaway n°5: even if your test is statistically significant, keep the test alive until it is finished.

6 — Keeping the test alive until reaching statistical significance

Again, statistical significance must not tell you when you can stop a test, or continue the test. You should not wait for a test to be significant because it may never happen, especially if you have already reached the sample size needed that you computed before the test, which would mean that your test has a sufficient statistical power to conclude.

Takeaway n°6: stop your test once you have reached the required sample size.

7 — Considering (1 - p-value) as the probability of B being better than A

This is a very common mistake. Having a p-value of 2% doesn’t mean that B has a 98% chance of being better than A. This assumption is mathematically wrong because it also depends on the base rate, which is the percentage of the tests you perform that actually have a positive effect (only God knows this number!). This number reflects your business intuition level.

Let’s suppose we are God and we know that the base rate at ManoMano is 20%. It means that 20% of our tests are actually positive:

80% of positive tests (grey) reject the null hypothesis (statistical power):

5% of negative tests (white) reject the null hypothesis (significance threshold):

Conclusion: given a 80% statistical power, a 5% significance threshold and a 20% base rate, when a test is considered positive (p-value < 0.05), we only have 16 / (16+4) = 80% chance that it is actually positive, and not 95%.

In this configuration (statistical power = 80%, significance threshold = 5%) and knowing that your A/B test results are significant, here are a few probabilities that your test is actually positive depending on the base rate:

Takeaway n°7: (1 - p-value) is not the probability that the test is positive. If you still want to calculate this probability, use a Bayesian A/B testing approach.

8 — Considering that the observed increment is the increment brought by the feature

The increment observed by the test allows you to compute the statistical significance, but it’s wrong to consider the observed increment as the actual increment brought by the feature, which would usually require much more users or sessions.

If stating that B is better than A is not enough, choose instead to communicate confidence intervals rather than raw increments. To illustrate, let’s take an example of an A/B test:

The p-value is 0.014 so with an accepted significance level of 95%, this A/B test is positive, which means that the variation is significantly better than the control. The observed conversion rate relative increment can easily be computed:

But it is incorrect to consider this observed increment as the real increment brought by the feature. You should rather compute the confidence intervals for each group using the standard error:

where p is the observed conversion rate of the group, n is the pool sample size, and Zα is the z-value corresponding to the confidence level α (95% in our case). You can find z-values for usual confidence levels here.

Using this formula you can finally calculate the 95% confidence interval (CI) conversion rate of the two groups:

It’s also possible de calculate a confidence interval on the relative increment (PctDiff) but it’s more complicated. See section 3.3.2 of this survey if you want more details.

Takeaway n°8: when your test is significantly positive, prefer to communicate confidence intervals rather than raw increments.

9 — Ignoring A/B tests results when they go against your intuition

There is no point launching A/B tests if you (and your organization) are not ready to update your product with the winning version, other than feeding your confirmation bias. Intuition is crucial when choosing what to test but it should not stand against an A/B test’s unbiased results.

Takeaway n°9: determine with stakeholders a threshold and the associated action(s) before the test.

10 — Forgetting to check that your A/B testing system is reliable

To guarantee the reliability of your A/B tests results, it is essential that your A/B testing system is calibrated and functional. One way to ensure this reliability is to continuously perform an A/A test and to check that there is no significant difference between the 2 populations:

ManoMano’s continous A/A test that allowed us to quickly detect a cache bug encountered during August that invalidated all our tests running between August 20 and August 22.

Takeaway n°10: continuously perform an A/A test to detect reliability issues.

Conclusion

As you can see, the risk of making mistakes when analysing A/B tests results is very high, while decisions taken following the tests are crucial for your company. That’s why you should be very skeptical about the A/B tests results that are communicated to you, especially when they come from someone who has a personal interest in having positive results (someone who wants to sell you something, for example). At ManoMano, we have a trusted committee entitled to analyse the results of all internal and external A/B tests with a disinterested view on the conclusions.

Acknowledgments

Thanks to all my ManoMano’s colleagues, especially those who took the time to review this article: Florent Martineau, Hugo Epicier, Alexandre Cazé, Raphaël Siméon, Yohan Grember, Marin De Beauchamp, Jacques Peeters, Clément Caillol, Grégoire Paris, Pierre Trollé, Chloé Martinot, Enguerran Chevalier, Charles Goddet.

Join us

We are looking for new colleagues at ManoMano, take a look at our job offers!