Approaches for anonymous audiences

A/B tests

A standard approach for evaluating select variants in an online setting is A/B testing. As the name implies, the technique is an experiment to determine the performance of two options, “A” and “B.” These could be ad banners or webpage formatting styles, for example. Option “A” generally represents what is currently in use and acts as a control to compare to choice “B,” though this need not be the case. In a real setting, any number of alternative options can be tested at the same time.

During an A/B test, each variant is presented to an equal number of viewers to explore its performance. After the test concludes, the best option is identified and used exclusively, exploiting the knowledge that was gained from the test.

One drawback with A/B testing is that it incurs “regret,” which is the concept that during the test, inferior options were presented to some viewers who may have had a better interaction with a better choice. For example, imagine performing an A/B test that compares different ad banners and finds that one banner leads to more conversions than the others. In this context, regret refers to the loss of conversions of viewers that did not click on the inferior ad that they were shown, who would have clicked on the best ad if it had been presented to them.

During an A/B test (left of the dashed line), equal percentages of viewers are presented with each of the options (1–4). After the test (right of the dashed line), the best option (1) is used exclusively. Since suboptimal variants (2–4) were used during the test, the exploration phase necessarily caused losses that are regrettable.

With regular use of A/B testing, regret quietly affects a company’s bottom line over time, through things like lost conversions or less-than-ideal user experiences. For any company that regularly runs tests to find an optimal version, minimizing regret as much as possible could provide a significant advantage.

Bandit algorithms

Bandit algorithms can reduce the amount of regret that occurs with A/B tests because they continuously balance exploration with exploitation. After every new sample, the knowledge that was learned is used to make a better choice the next time around. Over time, the options that perform better are used more often than the underperformers, and eventually the best option wins out.

A bandit algorithm begins with equal percentages of viewers being presented with each of the options (1–4). As it learns from each experience, the algorithm begins showing the best option (1) more frequently, and the lesser options (2–4) less frequently, leading to less overall regret.

An additional advantage of bandit algorithms is that they can adjust to new options, because there is no “testing phase.” If a new choice is added to the set of available variants, the bandit will begin to explore that option and exploit it if it performs better than the existing variants.

How the balance of exploration and exploitation is achieved depends on the particular bandit algorithm. One of the simplest bandit algorithms is the ε-Greedy algorithm, which uses a parameter (“ε”) to control the percentage of time that a random option is used — corresponding to exploration. The remainder of the time, the option that has historically performed the best is used — corresponding to exploitation.

Another bandit is the Thompson Sampling algorithm, which is a probability matching algorithm. It tries to match the probability that a particular choice is used to the probability that that choice is the best one. To accomplish this, each tested option is treated as having an intrinsic probability of resulting in a positive user interaction. To make a selection for a viewer, each option’s probability distribution is sampled and the one with the highest probability of having a positive interaction is used. After observing the response, the estimate of that option’s probability distribution is updated for the next selection.

There are other bandit algorithms as well, such as the Softmax algorithm, which can be used to minimize negative interactions, at the cost of using the best option less often. Choosing which bandit algorithm to use depends on what is being tested and the priorities of the tester.