“Is it done yet?” Getting real about calling a test

As a Web Strategist and A/B testing consultant, I’ve worked with everything from Fortune 100 companies to freshly minted start-ups. Regardless of the length of their customer list or the volume of their website traffic, they all ask me the same question: “How long should I run this A/B test for? How long will it take to get accurate results?” It’s a natural question and one that may be even more fundamental if you’re trying to reach a particular quarterly goal for your testing program.  And while I’m not a statistician, I’m not afraid to say that it’s very rare that you’ll be able to conclusively say how long a test will need to run.

Of course test duration calculators do exist. But even the “best” calculators ask you to make assumptions that are difficult to know in advance. For example, “What’s your expected improvement percentage from this test?” That’s a difficult question to answer. Small changes you may have thought would only have a small effect, end up having a very significant effect. The reality is that user behavior is somewhat unpredictable.

But don’t fret. There are guidelines that can serve you better than any number cruncher could.

With that in mind, here are my guidelines both for test length estimation and answering the ever important question, “Is this test done yet?”.

Guidelines for Test Length Estimation and Calling a Test “Done”

100 Conversions: Make sure each variation has 100 conversions before calling a test “done.” Obviously, the more conversions the better. Some sites get many thousands of conversions per day and others maybe only a few. But even for sites with very low conversion numbers, you’re unlikely to be able to draw statistically meaningful results without at least 100 conversions per variation.* (*See “Bigger Variation Differences” below.)

  • Recommendation: Run the test as long as you can to maximize conversions. If you can’t get enough conversions for your primary event (eg. a purchase), measure a secondary conversion metric (eg. an add to cart) to get a sense if you’re moving in the right direction.

One Week or More: At a minimum you should let a test run for a full work week and weekend. (For a retail store websites that do a lot of external marketing, I usually recommend two weeks and two weekends. This helps negate the effects of any one marketing campaign that may have been running). User browsing patterns can differ significantly on weekends and your site may attract an entirely different type of user at different times of the week. You want to make sure you’ve had a chance to review these potential differences by letting your test span the entire week.

  • Recommendation: Run each test for anywhere from one week to a maximum of a month (particularly for sites/experiments with low traffic volume). If you don’t see statistically significant results after a month, you are unlikely to see them even if you give the experiment more time. (I’ve seen exceptions but these are rare.)

Bigger Variation Differences: The more extreme the differences in the variations, the more quickly you will be able to call your test. For example, a complete redesign is likely to affect users more than changing the text from light green to dark green. While the slight difference in text color may have an effect, you’re going to need a lot of time (or traffic) to show that conclusively.

  • Recommendation: Start with the easy-to-implement, small and iterative-change tests. If you don’t see statistically significant results even after running the test for a month, test adding these changes together into one variation that you test against the original.

Regardless of what specific AB testing tool you use, you’ll likely find that  it’s your previous testing experience, your understanding of your site’s audience, and the relative significance of the changes you’re testing, that will be the best guide you have when it comes to determining test length.

3 thoughts on ““Is it done yet?” Getting real about calling a test

    • Hi Geromme,

      Good question. Many major sites with tons of traffic will never allocate more than 10% of their traffic to the experiment (and split among the variations from there). However, if you have a smaller site or your measuring a conversion event that will not get a lot of conversions, you’ll want to allocate more. Remember that if you have low traffic volume and you also allocate a smaller percentage of that volume to the experiment, it may simply mean you need to run the experiment for that much longer.

      In short, it’s definitely NOT necessary to do a 50/50 split but if your traffic volume is low, you’ll want to allocate as much as you can.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>