This content originally appeared on Medium.
What is Stats Accelerator and why use it?
When a customer sets up an A/B/n test using Optimizely, they choose what percentage of visitors (or samples) will be exposed to each of the test’s variations. This is the test’s traffic allocation. The traffic allocation stays the same, unless a customer manually changes it — for example, to divert traffic to a variation that seems to be performing better than the others.
For several years, we only allowed customers to manually adjust traffic. Over time, we heard from several of them that our setup was inconvenient and unscalable because their goals required frequently adjusting traffic. Their business needs could be classified into two scenarios:
- Identifying a statistically significant variation quickly: In this scenario, a customer running an A/B/n test wants to know if a variation can lead to actionable insight. That is, is a variation different from the baseline with respect to a given metric? We call this difference the lift or improvement. If different, find the one furthest from the baseline as quickly as possible — while controlling false discovery rate (FDR) under a predetermined threshold — so this can inform them on how to optimize their business.
- Optimizing reward for a period of time: In this scenario, customers want to maximize impact, such as revenue. For example, a customer with a Black Friday promotion may wish to experiment on different headlines on their landing page to maximize profit. They’re not concerned with implementing any permanent changes; they only want to direct visitors to the variation that yields the highest conversion rates right now, and will remove the variations from their website after the promotion ends.
To help customers achieve these two goals, we launched Stats Accelerator with two modes: Accelerate Learnings to address the first scenario, and Accelerate Impact to address the second. Both modes dynamically change a test’s traffic allocation, but their strategies and intent are vastly different.
We’ve partnered with Kevin Jamieson, from the Computer Science & Engineering Department at the University of Washington, who has developed an iterative algorithm that discovers significant variations as quickly as possible by directing more visitors to the variations that have a better chance of reaching statistical significance. This prioritizes finding the variation whose effect size — that is, the observed difference between it and the baseline — is greatest while keeping FDR below a given threshold. We’ve implemented his algorithm and call it Accelerate Learnings.
For simplicity, let’s assume that all variations have a true mean that is greater than the baseline for the rest of this section. This implies that all variations will eventually be found to have statistically significant improvements. We’re interested in first finding the variation that is furthest from the baseline in as few samples as possible. Loosely speaking, this algorithm allocates more traffic to the variation with the highest confidence bound on each iteration. The highest confidence bound is the upper endpoint of the confidence interval that is furthest from 0. At the next iteration, it observes the resulting variations’ confidence intervals and empirical means and makes decisions based off of that. Once a variation reaches statistical significance, it is taken out of consideration and the process repeats, where the algorithm focuses on the remaining inconclusive variations to determine the next best one.
Variations whose true means that are close to the baseline’s true mean require more samples to detect this effect size, whereas those furthest away require fewer. To identify the variations whose true means are furthest from that of the baseline, the algorithm uses the confidence interval by interpreting the upper confidence bound as the highest possible value it’ll observe of that variation. It picks the one with the highest upper confidence bound and guesses that this will be the best performing one. If it’s right, the allocated samples will continue to reinforce the empirical mean it observed and the variation will reach statistical significance faster. If it’s wrong, the updated confidence interval and empirical mean would reveal this, and the algorithm refocuses on other variations. It does so according to the same heuristic used when starting the round that just completed: it assumes that the variation whose upper confidence bound is greatest shows the most potential to reach statistical significance first.
This adaptive algorithm reduces the time to find the first best statistically significant variation compared to one using uniform allocation. Check out the Experiments Section of the paper, where the results indicate that this speedup can be at least two to three times faster.
As an example, suppose a customer runs Accelerate Learnings and is looking for the best variation. At time t = t0, the results are:
Figure 1. An illustration of what Accelerate Learnings observes before updating the test’s allocation at time t0. The dashed line indicates where improvement is 0. For each variation, the circle is the observed improvement and the line represents its confidence interval. Anything above the dotted line means the improvement is positive.
Accelerate Learnings allocates more traffic to Var1 compared to Var2 and Var3 because it looks like it can reach statistical significance with fewer samples than the others. This is suggested by its confidence interval, whose upper bound is the furthest from 0.
And at time t = t0 + 1 — the next time we run the algorithm — we observe the results:
Figure 2. What is observed after we run Accelerate Learnings for one iteration.
Var1 reaches statistical significance, so we no longer consider it; we allocate new visitors to the remaining inconclusive variations to identify the next best statistically significant variation as soon as possible.
To optimize for a Bernoulli metric, we use Thompson sampling (refer to the Glossary below for explanations of these terms). For each variation, we use the test’s current observed number of unique conversions and number of visitors to characterize a beta distribution. We sample these distributions a number of times and allocate traffic according to the their win ratio.
For example, suppose we have the following results on comparing unique conversions when we are updating the traffic allocation:
Figure 3. The observed results of the experiment under a Bernoulli metric.
We characterize beta distributions, denoted as Beta(𝛼, 𝛽), based on these values, where 𝛼 is the number of unique conversions and 𝛽 is the number of visitors for each variation. For example, the Original would be characterized by Beta(189, 323).
Figure 4. The resulting Beta distributions characterized by the results in Figure 3.
We sample these distributions N times — say 10,000 — recording the distribution that yields the highest value in each round. This simulates what might happen if we had 10,000 visitors allocated to the Original, 10,000 to Variation #1 and 10,000 to Variation #2. The ratio of times each variation wins determines the percentage of new visitors allocated to it:
Figure 5. Result of Thompson Sampling where N=10000.
On the next iteration, we run Thompson Sampling again with the updated observations.
The above algorithm works for binary metrics, but not for numeric metrics. For numeric metrics, such as revenue or number of events per user, we implemented an Epsilon-Greedy bandit (see the Glossary below for more details). This bandit distributes a small (epsilon) fraction of the traffic uniformly amongst all variations for exploration, and then allocates the larger (1- epsilon) fraction to the variation with the largest observed mean for exploitation.
The reason why there needs to be a portion of traffic designated for exploratory purposes is because the empirical means may not be close to their true means. Both Epsilon-Greedy and Thompson Sampling in Accelerate Impact are common strategies that address the exploitation-exploration dilemma (see Glossary for definitions of exploitation and exploration). Thompson Sampling favors early exploration due to uncertainties at the early stages of a test whereas Epsilon-Greedy focuses more on exploitation, funneling most traffic to the variation with the highest empirical mean. If a customer only makes use of the information they’ve observed, then they would direct all new visitors to the variation that has the highest empirical mean. But how do they know if that variation indeed has the highest true mean? They would need more samples to reduce this uncertainty. This is where exploration becomes essential because that reveals more evidence about the underlying means. Hence, a balance between the two is necessary.
Accelerate Learnings also has a form of exploration and exploitation as well. In terms of exploitation, it uses the upper confidence bounds of each variation, allocating more traffic to the highest bound. And if its decision was suboptimal, it will know from the updated results of the test. As a result, it will explore other variations.
When should I use Accelerate Impact vs Accelerate Learnings?
Accelerate Learnings provides customers with actionable insight, where the best significant variation can be implemented quickly in their business. It focuses on the statistical uncertainties of the variations, using the confidence intervals to update the test’s allocation in order to reduce time to finding the best significant variation.
Accelerate Impact should be used for promotions, sales, or anything that is temporary, where the intent is to drive more traffic to a variation that has the highest metric value. It only focuses on the variations’ empirical mean with the intent of maximizing some payoff by funneling more traffic to the winning variation.
From customers’ desire to automate traffic allocation and their different business needs comes two distinct algorithms, each focusing on a different goal.
Coming soon: a blog post on how we tackled Simpson’s Paradox
In our next blog post, we’ll discuss how these fare in the wild. Specifically, we’ll examine how time-variation — such as the seasonality of visitor behavior — combines with traffic allocation changes to cause Simpson’s Paradox. More on that effect, and how we tackled it, in an upcoming post. Stay tuned.
- Arm: This is bandit lingo. An arm is synonymous with a variation.
- Bernoulli metric: A measurable action that is either a 0 or 1, such as whether a visitor converted or not.
- Effect Size: In an A/B/n test, this is the observed difference between the baseline and variation.
- Epsilon-Greedy Bandit: Check out “Epsilon-Greedy strategy” under “Bandit Strategies” here.
- Exploitation: Bandit lingo. Making a decision given the current information.
- Exploration: Bandit lingo. Gathering more information.
- Exploration-exploitation problem: Bandit lingo. Do we direct all traffic to a variation that we’ve observed has the highest empirical mean (exploitation)? If this isn’t the best variation, then how do we direct some traffic to the other variations to find the one with the highest true mean (exploration)? What is the optimal trade-off between these two states?
- False Discovery Rate: This is the average number of incorrect rejections of the null hypothesis over all discoveries made in a set of A/B/n tests. Check out this wonderful article that talks about why we use this within Stats Engine.
- Multi-armed bandit: The type of problem where the strategy is to optimize a specific function by deciding how to allocate resources.
- Numeric metric: A measurable event that isn’t a 0 or 1 (read: not a Bernoulli metric). This includes metrics such as revenue, the number of clicks per visitor, and so on.
- Sample: An observation. If we’re measuring unique conversions per visitor, a sample would be a visitor.
- Statistical significance: The likelihood that the difference in conversion rates between a given variation and the baseline is not due to chance. Check out this link to see how confidence intervals and statistical significance are related.
- Thompson Sampling: A heuristic to choose how to allocate traffic while addressing the exploitation-exploration problem in a multi-armed bandit scenario. Check out this link for an in-depth introduction to this topic.
- Traffic allocation: The distribution of new visitors to a test’s variations.
- Upper confidence bound: Bandit lingo. Also known as UCB. The upper value of the confidence interval.