Download our FREE ebook of 37 A/B testing case studies from the world's leading companies for test ideas and inspiration. Download Now

Optimizely Blog

Tips & Tricks for Building Your Experimentation Program

X

Download our FREE Testing Toolkit for A/B testing ideas, planning worksheets, presentation templates, and more!

Get It Now




This is part two of a series on Stats Accelerator. In the first part, we explained the when, why, and how of Stats Accelerator. In today’s installment we will discuss a major roadblock to successful productionization of bandits in the A/B testing context and how we eventually overcame it. This is a high-level overview. For a more detailed treatment, see the technical white paper.

Last fall, Optimizely announced Stats Accelerator, a pair of multi-armed bandit algorithms for accelerating experimentation. These algorithms are designed to either optimize rewards for a period of time or identify a statistically significant variant as quickly as possible by intelligently changing the allocation of traffic between variations (or arms, in machine learning lingo) of the experiment.

However, when underlying conversion rates or means are varying over time (e.g. due to seasonality), dynamic traffic allocation can cause substantial bias in estimates of the difference between the treatment and the control, a phenomenon known as Simpson’s Paradox. This bias can completely invalidate statistical results on experiments using Stats Accelerator, breaking usual guarantees on false discovery rate control.

To prevent this, we developed Epoch Stats Engine, a simple modification to our existing Stats Engine methodology, which makes it robust to Simpson’s Paradox. At its core is a stratified estimate of the difference in means between the control and the treatment. Because it requires no estimation of the underlying time variation and is also compatible with other central limit theorem-based approaches such as the t-test, we believe it lifts a substantial roadblock to combining traditional hypothesis testing and bandit approaches to A/B testing in practice.

What is time variation?

A fundamental assumption behind many A/B tests is that the underlying parameter values we are interested in do not change over time. When this assumption is violated, we say there is time variation. In the context of A/B experiments using Stats Accelerator, Simpson’s Paradox can only occur in the presence of time variation, so a precise understanding of time variation will be useful going forward.

Let’s take a step back. In each experiment we imagine that there are underlying quantities that determine the performance of each variation; we are interested in measuring these quantities, but they cannot be observed. These are parameters. For example, in a conversion rate web experiment we imagine that each variation of a page has some true ability to induce a conversion for each visitor. If we express this ability in terms of the probability that each visitor will convert, then these true conversion probabilities for each variation would be the parameters of interest. Since parameters are unobserved, we must compute point estimates from the data to infer parameter values and decide whether the control or treatment is better. In the conversion rate example, the observed conversion rates would be point estimates for the true conversion probabilities of each variation.

Noise and randomness will cause point estimates to fluctuate. In the basic A/B scenario, we view these fluctuations as centered around a constant parameter value. However, when the parameter values themselves fluctuate over time, reflecting underlying changes in the true performance of the control and treatment, we say that there is time variation.

Figure 1a: The traditional, no time variation view of the parameter values behind an A/B test.

Figure 1b: An example of time variation impacting underlying parameter values in an A/B test.

The classic example is seasonality. For example, it is often reasonable to suspect that visitors to a website during the workweek may behave differently than visitors on the weekends. Therefore, conversion rate experiments may see higher (or lower) observed conversion rates on weekdays compared to weekends, reflecting matching potential time variation in the true conversion rates which represents underlying true differences in how visitors behave on weekdays and weekends.

Time variation can also manifest as a one-time effect. A landing page with a banner for a 20%-off sale valid for the month of December may generate a higher-than-usual number of purchases per visitor for December, but then drop off after January arrives. This would be time variation in the parameter representing the average number of purchases per visitor for that page.

Time variation can take different forms and affect your results in different ways. Whether time variation is cyclic (seasonality) or transient (a one-time effect) suggest different ways to interpret your results. Another key distinction regards how the time variation affects different arms of the experiment. Symmetric time variation occurs when parameters vary across time in a way such that all arms of the experiment are affected equally (in a sense to be defined shortly). Asymmetric time variation covers a broad swath of scenarios where this is not the case. Optimizely’s Stats Engine currently has a feature to detect strong asymmetric time variation and will reset your statistical results accordingly to avoid misleading conclusions, but handling asymmetric time variation in general requires strong assumptions and/or a wholly different type of analysis. This remains an open area of research.

In what follows, we will restrict ourselves to the symmetric case with an additive effect for simplicity. Specifically, we imagine the parameters for the control and treatment θC(t) and θT(t) may be written as θC(t) = μC + f(t) and θT(t) = μT + f(t) so that each can be decomposed into non-time-varying components μC and μT and the common source of the time variation f(t). The underlying lift may therefore be written by the non-time-dependent quantity μT – μC = θT(t) – θC(t).

Generally, symmetric time variation of this sort occurs when the source of the time variation is not associated with a specific variation but rather the entire population of visitors. For example, visitors during the winter holiday season are more inclined to purchase in general. Therefore, a variation that induces more purchases will tend to maintain the difference over the control even with a higher overall holiday-influenced click-through rate for both the variation and the control.

In general, most A/B testing procedures such as the t-test and sequential testing are robust to this type of time variation. This is because we are often less interested in estimating the individual parameters of the control and treatment, and more interested in the difference between the parameters. Therefore, if both parameters are impacted in an additive manner by the same amount, then such time variation will be cancelled out once differences are taken and any subsequent inference will be relatively unaffected. Using the notation above, this can been seen in the fact that the difference in the time-varying parameters θT(t) – θC(t) = μT – μC does not contain the time-varying factor f(t).

As it turns out though, the innocuous-seeming case of symmetric time variation can become a completely different beast when dynamic traffic allocation is introduced to the equation.

Simpson’s paradox

If the traffic split in an experiment is adjusted in sync with underlying time variation, then a disproportionate amount of high- or low-performing traffic may be allocated to one arm relative to the other, biasing our view of the true difference between the two arms. This is a form of Simpson’s paradox, the phenomenon of a trend appearing in several different groups of data which then disappears or reverses when the groups are aggregated together. This bias can completely invalidate experiments on any platform by tricking the stats methodology into declaring a larger or smaller effect size than what actually exists. Let’s motivate this by an example.

Consider a two-month conversion rate experiment with one control and one treatment. In the month of November, the true conversion rates for the control and treatment are at 10% and 20%, respectively. For the month of December, they rise to 20% and 30%. In each month, the difference in conversion rates is 10 percentage points (pp).

If traffic is split 50% to treatment and 50% to control (or any other proportion for that matter) for the entire duration of the experiment, then it is clear that the final estimate of the difference between the two variations should be close to 10%. What happens if traffic is split 50/50 in November but then changes to 75% to control and 25% to treatment in December? For simplicity, let’s assume that there are 1000 total visitors to the experiment in each month. A simple calculation shows that:

Control:

Total visitors: 500 + 750 = 1250

Percent from high-converting regime: 750 / 1250 = 60%

Treatment:

Total visitors: 500 + 250 = 750

Percent from high-converting regime: 250 / 750 = 33%

So both control and treatment have equal numbers of visitors from low-converting November, but the treatment has far fewer visitors than the control from high-converting December. This imbalance clues us in that there will be bias, and doing the math confirms this: the conversion rate for the control will be around 16%, and the conversion rate for the treatment will be around 23%, a difference of only 7% rather than the 10% that we would normally expect.

This phenomenon is also laid clear in a continuous view, such as an Optimizely customer would witness:

Figure 2a: Movement of observed conversion rates under time variation with constant traffic allocation

Figure 2b: Movement of observed conversion rates under time variation with changing traffic allocation. Simulation begins at 50% allocation to treatment initially, switching to 90% allocation to treatment after 2,500 visitors.

In this example, the diminished point estimate might cause a statistical method to fail to report significance when it otherwise would with an unbiased point estimate closer to 10%.  But other adverse effects are also possible. When there is no true effect (e.g. we ran an A/A test), Simpson’s Paradox can cause the illusion of a significant positive or negative effect, leading to inflated false positives. Or, when the time variation is especially strong or the traffic allocation syncs up well with the time variation, then this bias can be so drastic as to reverse the sign of the estimated difference between the control and treatment parameters (as seen in Figure 2a), completely misleading experimenters as to the true effect of their proposed change.

Epoch Stats Engine

Since Simpson’s Paradox manifests as a bias in the point estimate of the difference in means of the control and treatment, mitigating Simpson’s Paradox requires coming up with a way to remove such bias. In turn, removing bias requires accounting for one of the two factors causing it: time variation or dynamic traffic allocation. Since time variation is unknown and therefore must be estimated, but traffic allocation is directly controlled by the customer or Optimizely and therefore known, we opted to focus on the latter.

Our solution for Simpson’s Paradox follows from the observation that bias due to Simpson’s paradox cannot occur over periods of constant traffic allocation. Therefore, we may derive an unbiased point estimate by making estimates within periods of constant allocation (called epochs) and then aggregating those individual within-epoch estimates to obtain one unbiased across-epoch estimate. If we can show that this quantity is compatible with the sequential testing methodology underneath Stats Engine’s hood, then we have a plug-and-play solution that works seamlessly with experiments using Stats Accelerator. As we will also see, this estimator is also simple enough to be applied to other statistical tests such as the t-test.

Let’s get into the math a bit. Suppose there are K(n) total epochs by the time the experiment has seen n visitors. Within each epoch k, denote by nk,C and nk,T the sample sizes of the control and treatment respectively, and by k and k the sample means of the control and treatment respectively. Letting nk = nk,C + nk,T, the epoch estimator for the difference in means is

Because the dependence across epochs induced by the data-dependent allocation rule is restricted to changes in the relative balance of traffic between the control and treatment, the within-epoch estimates are orthogonal and the variance for Tn is well-estimated by the sum of the estimated variances of each within-epoch component:

where σ̂C and σ̂T are consistent estimates for the standard deviations of the data-generating processes for the control and treatment arms.

At a high level, Tn is a stratified estimate where the strata represent data belonging to individual epochs of fixed allocation. At a low level, this is a weighted estimate of within-epoch estimates where the weight assigned to each epoch is proportional to the total number of visitors within that epoch. This is also why we surface the epoch estimate as “Weighted improvement” on the results page for experiments using Stats Accelerator.

Figure 3: Calculation of an epoch stratified estimate (computed at 15,000 visitors) of the difference in true conversion probabilities in an experiment with a traffic allocation change and one-time symmetric time variation both occurring at 10,000 visitors.

It’s worth repeating that the epoch estimate is guaranteed to be unbiased since each within-epoch estimate is unbiased. In addition, we provide rigorous guarantees that the epoch estimate is fully compatible with the sequential test employed at Optimizely and also generally valid for use in other central limit theorem-based methods (such as the t-test) as well. See the full write-up for more details.

Performance on simulated data

We simulated data with time variation and ran four different Stats Engine configurations on that data:

  1. Standard  Stats Engine
  2. Epoch Stats Engine
  3. Standard Stats Engine with Accelerate Learnings
  4. Epoch Stats Engine with Accelerate Learnings

Specifically, we generated 600,000 draws from 7 Bernoulli arms with one control and 6 variants, with 1 truly higher-converting arm and all others converting at the same rate as the control. The conversion rate for the control starts out at 0.10 and then undergoes cyclic time variation rising as high as 0.15. In each of these plots, we plot visitors on the horizontal axis and either false discovery rate (FDR) or true discovery rate (TDR) on the vertical axis, averaged over 1000 simulations.

Figure 4: Average false discovery rate over time for the simulation scenario described above.

The FDR plot shows that Epoch Stats Engine does exactly what we designed it to do–protect customers from false discoveries due to Simpson’s Paradox. The non-epoch bandit policy’s FDR exceeds the configured FDR level (0.10) by up to 150% while the epoch-enabled bandit policy shows proper control of FDR at levels comparable to those achieved by Stats Engine without the bandit policy enabled. This is the main goal–to bring FDR levels of bandit-enhanced A/B testing down to levels comparable to no bandit enabled at all.

Figure 5: Average true discovery rate over time for the simulation scenario described above.

The TDR plot shows that we do not lose much power due to switching from standard to Epoch Stats Engine. First, we observe a large gap between the bandit allocation runs and the fixed allocation runs reflecting the fact that speedup due to bandit allocation is preserved under Epoch Stats Engine. Furthermore, we observe little difference in time to significance between the epoch and non-epoch scenarios under fixed allocation while we observe a small gap in time to significance between the epoch and non-epoch scenarios under the bandit policy. This gap can be ascribed to the fact that the non-epoch Stats Engine running under dynamic allocation experiences high sensitivity to time variation, especially just after crossing an epoch boundary, thereby creating the scalloped shape of the blue curve. Higher TPR is paid for with higher FDR.

Conclusion

At Optimizely we pride ourselves on simultaneously pushing the envelope in A/B testing while always prioritizing statistical rigor to ensure customers make the most impactful decisions.

As we saw in the previous blog post, the latest iteration on this theme produced Stats Accelerator, a marriage of multi-armed bandit techniques with false discovery rate control. In Epoch Stats Engine, we’ve surmounted a major real-world obstacle to safe productionizing of this technology and have developed a solution which is

  1. proven to work well with sequential testing in both theory and simulation
  2. simple to compute and understand, and
  3. widely applicable to other methods and platforms, not just sequential testing and Optimizely

Interested in a deeper dive? Check out the technical paper!

Did you find this interesting? Come join my team! We’re hiring a statistician in San Francisco.

Optimizely X