How Etsy Handles Peeking in A/B Testing

Etsy relies heavily on experimentation to improve our decision-making process. We leverage our internal A/B testing tool when we launch new features, polish the look and feel of our site, or even make changes to our search and recommendation algorithms. For years, Etsy has prided ourselves on our culture of continuous experimentation. However, as our experimentation platform scales and the velocity of experimentation increases rapidly across the company, we also face a number of new challenges. In this post, we investigate one of these challenges: how to peek at experimental results early in order to increase the velocity of our decision-making without sacrificing the integrity of our results.

The Peeking Problem

In A/B testing, we’re looking to determine if a metric we care about (i.e. percentage of visitors who make a purchase) is different between the control and treatment groups. But when we detect a change in the metric, how do we know if it is real or due to random chance? We can look at the p-value of our statistical test, which indicates the probability we would see the detected difference between groups assuming there is no true difference. When the p-value falls below the significance level threshold we say that the result is statistically significant and we reject the hypothesis that the control and treatment are the same.

So we can just stop the experiment when the hypothesis test for the metric we care about has a p-value of less than 0.05, right? Wrong. In order to draw the strongest conclusions from the p-value in the context of an A/B test, we have to have fixed the sample size of an experiment in advance, and to only make a decision on the p-value once. Peeking at data regularly and stopping an experiment as soon as the p-value dips below 0.05 increases the rate of Type I errors, or false positives, because the false positive of each test compounds increasing the overall probability that you’ll see a false result.

Let’s look at an example to gain a more concrete view of the problem. Suppose we run an experiment where there is no true change between the control and experimental variant and both have a baseline target metric of 50%. If we are using a significance level of 0.1 and there is no peeking, in other words, the sample size needed before a decision is made is determined in advance, then the rate of false positives is 10%. However, if we do peek and we check the significance level at every observation, then after 500 observations, there is over a 50% chance of incorrectly stating that treatment is different than the control (Figure 1).

Figure 1: Chances for accepting that A and B are different, with A and B both converting at 50%.

At this point, you might already have figured that the simplest way to solve the problem would be to fix a sample size in advance and run an experiment until the end before checking the significance level. However, this requires strictly enforced separation between the design and analysis of experiments which can have large repercussions throughout the experimental process. In early stages of an experiment, we may miss a bug in the set up or with the feature being tested that will invalidate our results later. If we don’t catch these early, it slows down our experimental process unnecessarily, leaving less time for iterations and real site changes. Another issue involved in set up is that it can be difficult to predict the effect size product teams would like to obtain prior to the experiment, which can make it hard to optimize the sample size in advance. Even assuming we set up our experiment perfectly, there are down the line implications. If an experiment is impacting a metric in a negative way, we want to be aware as soon as possible so we don’t negatively affect our users’ experience. These considerations become even more pronounced when we’re running an experiment on a small population, or in a less trafficked part of the site and it can take months to reach the target sample size. Across teams, we want to be able to iterate quickly without sacrificing the integrity of our results.

With this in mind, we need to come up with statistical methodology that will give reliable inference while still providing product teams the ability to continuously monitor experiments, especially for our long-running experiments. At Etsy, we tackle this challenge from two sides, user interface and statistical procedures. We made a few user interface changes to our A/B testing tool to prevent our stakeholders from drawing false conclusions, and we implemented a flexible p-value stopping-point in our platform, which takes inspiration from the sequential testing concept in statistics.

It is worth noting that the peeking problem has been studied by many, including industry veterans1, 2, developers of large-scale commercial A/B testing platforms3, 4 and academic researchers5. Moreover, it is hardly a challenge exclusive to A/B testing on the web. The peeking problem has troubled the medical field for a long time; for example, medical scientists could peek at the results and stop a clinical trial early because of initial positive results, leading to flawed interpretations of the data6, 7.

Our Approach

In this section, we dive into the approach that we have designed and adapted to address the peeking problem: transitioning from traditional, fixed-horizon testing to sequential testing, and preventing peeking behaviors through user interface changes.

Sequential Testing with Difference in Converting Visits

Sequential testing, which has been widely used in clinical trials8, 9 and gained recent popularity for web experimentation10 , guarantees that if we end the test when the p-value is below a predefined threshold α , the false positive rate will be no more than α. It does so by computing the probabilities of false-positives at each potential stopping point using dynamic programming, assuming that our test statistic is normally distributed. Since we can compute these probabilities, we can then adjust the test’s p-value threshold, which in turn changes the false-positive chance, at every step so that the total false positive rate is below the threshold that we desire. Therefore, sequential testing enables concluding experiments as soon as the data justifies it, while also keeping our false positive rate in check.

We investigated a few methods including O’Brien-Fleming, Pocock and sequential testing using difference in successful observations. We ultimately settled on the last approach. Using the difference in successful observations, we look at the raw difference in converting visits and stop an experiment when this difference becomes large enough. The difference threshold is only valid until we reach a total number of converted visits. This method is good for detecting small changes and does so quickly, which makes it most suitable for our needs. Nevertheless, we did consider some cons this method presented as well. Traditional power and significance calculations use proportion of successes whereas looking at difference in converted visits does not take into account total population size. Because of this, we are more likely to reach the total number of converted visits before we see a large enough difference in converted visits with high baselines target metrics. This means we are more likely to miss a true change in these cases. Furthermore, it requires extra set up when an experiment is not evenly split across variants. We chose to use this method with a few adjustments for these shortcomings so we could increase our speed of detecting real changes between experimental groups.

Our implementation of this method is influenced by the approach Evan Miller described here. This method sets a threshold for difference between the control and treatment converted visits based on minimal detected effect and target false positive and negative rates. If the experiment reaches or passes the threshold, we allow the experiment to end early. If this difference is not reached, we assess our results using the standard approach of a power analysis. The combination of these methods creates a continuous p-value threshold for which we can safely stop an experiment when the p-value is under the curve. This threshold is lower near the beginning of an experiment and converges to our significance level as the experiment reaches our targeted power. This allows us to detect changes quicker with low baselines while not missing smaller changes for experiments with high baseline target metrics.

Figure 2: Example of a p-value threshold curve.

To validate this approach, we tested it on results from experimental simulations with various baselines and effect sizes using mock experimental conditions. Before implementing, we wanted to understand:

What effect will this have on false positive rates? What effect does early stopping have on reported effect size and confidence intervals? How much faster will we get a signal for experiments with true changes between groups?

We found that when using a p-value curve tuned for a 5% false positive rate, our early stopping threshold does not materially increase the false positive rate and we can be confident of a directional change.

One of the downfalls with stopping experiments early, however, is that with an effect size under ~5%, we tend to overestimate the impact and widen the confidence interval. To accurately attribute increases in metrics to experimental wins, we developed a haircut formula to apply to the effect size in metrics for experiments that we decide to end early. Furthermore, we offset some of these by setting a standard of running experiments for at least 7 days to account for different weekend and weekday trends.

Figure 3: Reported Vs. True Effect Size

We tested this method with a series of simulations and saw that for experiments which would take 3 weeks to run assuming a standard power analysis, we could save at least a week in most cases where there was a real change between variants. This helped us feel confident that even with a slight overestimation of effect size, it was worth the time savings for teams with low baselines target metrics who typically struggle with long experimental run times.

Figure 4: Day Savings From Sequential Testing

UI Improvements

In our experimental testing tool, we wanted stakeholders to have access to metrics and calculations we measure throughout the duration of the experiment. In additional to the p-value, we care about power and confidence interval. First, power. Teams at Etsy have to often coordinate experiments on the same page so it is important for teams to have an idea of how long an experiment will have to run assuming no early stopping. We do this by running an experiment until we reach a set power.

Second, Confidence interval (CI), is the range of values that are a good estimate of the true value in which we are confident a particular metric falls. In the context of A/B testing for example, if we ran the experiment millions of times, 90% of the time the true value of some effect size would fall within the 90% CI. There are three things that we care most about in relation to the confidence interval of an effect in an experiment:

Whether the CI includes zero, because this maps exactly to the decision we would make with the p-value; if the 90% CI includes zero, then the p-value is greater than 0.1. Conversely, if it doesn’t include zero, then the p-value is less than 0.1; The smaller the CI, the better estimate of the parameter we have; The farther away from zero the CI is, the more confident we can be that there is a true difference.

Previously in our A/B testing tool UI, we displayed statistical data as shown in the table below on the left. The “observed” column indicates results for the control and there is a “% Change” column for each treatment variant. When hovering over a number in the “% Change” column, a popover table appears, showing the observed and actual effect size, confidence level, p-value, and number of days we could expect to have enough data to power the experiment based on our expected effect size.

Figure 5: User interface before changes.

However, always displaying numerical results in the “% Change” column could lead to stakeholders peeking at data and making an incorrect inference about the success of the experiment. Therefore, we added a row in the hover table to show the power of the test (assuming some fixed effect size), and made the following changes to our user interface:

Show a visualization of the C.I. and color the bar red when the C.I. is entirely negative to indicate a significant decrease, green when the C.I. is entirely positive to indicate a significant increase, and grey when the C.I. spans 0. Display different messages in the “% Change” column and hover table to indicate different stages the experiment metric is currently in, depending on its power, p-value and calculated flexible p-value threshold. In the “% Change” column, possible messages include “Waiting on data”, “Not enough data”, “No change” and “+/- X %” (to show significant increase/ decrease). In the hover table, possible headers include “metric is not powered”, “there is no detectable change”, “we’re confident we detected a change”, and “directional change is correct but magnitude might be inflated” when early stopping is reached but the metric is not powered yet.

Figure 6: User interface after changes.

Even after making these UI changes, making a decision on when to stop an experiment and whether or not to launch it is not always simple. Generally some things we advise our stakeholders to consider are:

Do we have statistically significant results that support our hypothesis? Do we have statistically significant results that are positive but aren’t what we anticipated? If we don’t have enough data yet, can we just keep it running or is it blocking other experiments? Is there anything broken in the product experience that we want to correct, even if the metrics don’t show anything negative? If we have enough information on the main metrics overall, do we have enough information to iterate? For example, if we want to look at impact on a particular segment, which could be 50% of the traffic, then we’ll need to run the experiment twice as long as we had to in order to look at the overall impact.

We hope that these UI changes will help our stakeholders make better informed decisions while still letting them uncover cases where they have changed something more dramatically than expected and thus can stop the experiment sooner.

Further Discussion

In this section, we discuss a few more issues we examined while designing Etsy’s solutions to peeking.

Trade-off Between Power and Significance

There is a trade-off between Type I (false positive) and Type II (false negative) errors – if we decrease the probability of one of the errors, the probability of the other will increase – for a more detailed explanation, please see this short post. This translates into a trade-off between p-value and power because if we require stronger evidence to reject the null hypothesis (i.e. a smaller p-value threshold), then there is a smaller chance that we will be able to correctly reject a false null hypothesis a.k.a decreased power. The different messages we display on the user interface balance this issue to some degree. At the end, it is just a choice that we have to make based on our priorities and focus in experimentation.

Weekend vs. Weekday Data Sample Size

At Etsy, the volume of traffic and intent of visitors varies from weekdays to weekends. This is not a concern for the sequential testing approach that we ultimately chose. However, it would be an issue for some other methods that require equal daily data sample size. During our research, we looked into ways to handle the inconsistency in our daily data sample size. We found that the GroupSeq package in R, which enables the construction of group sequential designs and has various alpha spending functions available to choose among, is a good way to account for this.

Other Types of Designs

The sequential sampling method that we have designed is a straightforward form of a stopping rule modified to best suit our needs and circumstances. However, there are other types of sequential approaches that are more formally defined, such as the Sequential Probability Ratio Test (SPRT), which is utilized by Optimizely’s New Stats Engine4, and the Sequential Generalized Likelihood Ratio test, which has been used in clinical trials11. There has also been debate in both academic and industry about the effectiveness of Bayesian A/B testing in solving the peeking problem2, 5. It is indeed a very interesting problem!

Final Thoughts

Accurate interpretation of statistical data is crucial in making informed decisions about product development. When online experiments have to be run efficiently to save time and cost, we inevitably run into dilemmas unique to our context, and peeking is just one of them. In researching and designing solutions to this problem, we examined some more rigorous theoretical work. However, the characteristics and priorities in online experimentation makes the application of it difficult. Our approach outlined in this post, even though simple, addresses the root cause of the peeking problem effectively. Looking forward, we think the balance between statistical rigorousness and practical constraints is what makes online experimentation intriguing and fun to work on, and we at Etsy are very excited about tackling more interesting problems awaiting us.

This work is a collaboration between Callie McRee and Kelly Shen from the Analytics and Analytics Engineering teams. We would like to thank Gerald van den Berg, Emily Robinson, Evan D’Agostini, Anastasia Erbe, Mossab Alsadig, Lushi Li, Allison McKnight, Alexandra Pappas, David Schott and Robert Xu for helpful discussions and feedback.

References

Related Posts