a blue line graph

So far in this series, we’ve talked about the hidden challenges that can stop experimentation from getting off the ground, like getting tricked by statistics, or choosing the wrong metrics. Many of you have already overcome these hurdles and built a budding testing program. So in today’s post, I’ll be talking about the next pitfall: getting stuck at “good enough.”

Here’s a pop quiz to get us started:

  • How many experiments did your company run last year?
  • What percentage of those yielded meaningful insights?
  • What’s your goal for each of those numbers this year?

Depending on your size and maturity, these numbers can range from 0-1000+ and 5-50%. Most often, though, the answer I hear is some version of, “uh…we don’t really know.” Take a moment to try answering these questions. If you don’t have the data, or if you haven’t set a goal, then you’re in danger of losing steam before reaping the biggest benefits of experimentation.

I’ve seen many teams take the challenging leap from “zero to one”, bringing data-driven decisions into their culture for the first time. They might run a one-off A/B test, see some success, and start to build a roadmap. But a year or two later, they hit a plateau, running 1-2 experiments a month with mixed results. That might seem like good enough, but it’s not even in the ballpark of a real culture of experimentation.

Amazon’s Jeff Bezos likes to say, “Our success is a function of how many experiments we do per year, per month, per week, per day.” In five years at Microsoft, we grew from a few experiments a month to over 300 per week. Netflix runs hundreds of experiments a year, and Airbnb has passed 500 per month. For each of these companies, experimentation isn’t just an ingrained cultural practice—it’s a core competitive differentiator. They don’t just measure the number of experiments they run; they set aggressive goals to increase them every year.

Measuring Success

Not every company needs to run 1000 experiments at a time. But to build a culture of experimentation, you do need a quantitative measure of success. We all know this when it comes to one-off experiments – “what’s measured is moved” – but we forget it all too often in the context of our larger program.

So what’s the right measure of success? Most top experimentation teams use these two::

  • Velocity: the number of experiments started every month
  • Insight Rate: the percentage of experiments that drive significant change to an important metric

Velocity captures the quantity of experimentation. It tells you how many ideas in your organization are validated with data, and how many people in your organization are adopting an experimental mindset. The goal you set here should depend on the size of your team. If 500/week sounds insane, try setting an achievable goal like “1 experiment per quarter per 3 developers”. Then once you hit that goal, double it.

Insight Rate captures the quality of your experiments. It tells you how often you’re actually getting value out of these experiments. Note that many teams will choose to measure “win rate”, but that misses the point. It can be just as valuable to prevent risky launches and to learn from bad ideas. For example – in just the last few months, I’ve seen three different companies do major redesigns, only to find that the new flow dropped their revenue by >3%. In each case, the experiment helped them catch the drop and build a plan to overcome it.

Finally, make sure you have some way of measuring and reporting on these metrics. This could be as simple as a spreadsheet or whiteboard, or a purpose-built solution for program reporting. The important thing is that you measure these numbers, share them regularly, and set ambitious goals.

Dashboard

Boosting Velocity

If you’ve made it this far, you might be thinking, “It’s not enough to just set a goal! How do I actually run so many experiments? Come to think of it, what do we even mean by ‘experiment’ here?” To be fair: I don’t mean to imply that just by measuring your velocity, the number will magically go up. But I have seen first-hand that if you go looking for opportunities to experiment, you’ll find them everywhere. At Bing, we doubled experiment velocity year over year, and every time we thought we’d hit a limit – but then we’d discover a whole new area to test in, and push on.

When many people hear the word “experiment”, they mentally substitute “A/B test” and bring in assumptions about when and where it might apply. Think bigger! There are many kinds of experiments, and they can fit into any stage of the development lifecycle.

Experiment steps

Isn’t this an awful lot of work? Well, sort of – but I find it striking how some of the simplest experiments are also the most neglected. For example, most product teams still aren’t A/B testing with feature flags. If you’re launching a new feature, try rolling it out to just 50% of your users, and measure that half against a control who doesn’t have the feature yet. Compared to the work of building the thing, this test requires almost no extra effort, but it can yield a wealth of insight about the impact you’re having and mitigate the risk of a bad user reaction.

Or take the opposite example. Many teams are competent at feature flagging and staged rollouts, but they do nothing to promote their feature or drive adoption after it launches. Some of the cheapest and most impactful experiments take an existing piece of your interface and subtly change it to drive discovery. For example, if you reach a feature through a button – experiment on that button. Move it around, change the wording, or draw attention with a “new” badge or temporary popup. These experiments are cheap to run and have a major impact to adoption. For example, a major news site added a duration timestamp to their videos and saw a 2x increase in time spent watching videos, a major revenue driver.

One common theme in these missed opportunities is that product teams will get used to one kind of testing at the expense of others. Product managers and designers may rely on client-side testing tools to make visual changes, but miss the opportunity to run server-side experiments on deeper functionality. And developers may rely on home-grown feature flagging to roll out code, but their team might miss the opportunity for remote configuration or multivariate testing after launch. Look for ways to test in more places, moving up or down your technology stack and across multiple teams to unlock more opportunities.

Server-side vs client side experimentation graph

Improving Quality

As experiment velocity rises, there’s a natural tendency for quality to suffer. You grab the low-hanging fruit, and it takes more effort to find big wins. New teams come onboard, and they repeat mistakes you may have already overcome. To some extent this tradeoff is inevitable, but there are some simple tips you can apply to keep up quality.

As a start, try testing more variants. The best experiments usually aren’t literal “A/B tests”. Instead, they involve more creativity and risk. When we come up with a single change, we prematurely narrow our opportunity. In particular, we often test things that we’re “sure” can win, rather than take the risk on something bolder and more different.

Try testing something completely different. Next time your designer comes up with two alternatives, don’t just pick one to test – ask them to come up with three more, and then test them all. Even if you don’t build and run all the options, just going through the exercise of brainstorming alternatives can unlock a big new idea.

Multiple variations illustration

My colleague Hazjier has studied the correlation between number of variations and win rate, and this data is remarkable. Studying tens of thousands of real experiments run on Optimizely, he found that over 75% of experiments had just two variations. But when he looked at the win rate of those variations, these narrower experiments performed the worst. He found that testing four variants against a control could nearly double the win rate compared to just one variant, taking the overall “insights rate” (wins + losses) to a best-in-class rate of over 50%.

Variation impact graph

What if you don’t have enough traffic to test so many variations? My earlier post walks through the tradeoffs and options there. But when in doubt, if you don’t test more variants, at least be bolder. Make larger changes, deeper in your stack, closer to the core actions you’re looking to drive among your users.

Conclusion

These are just a few tips to drive velocity and quality. But I want to reiterate that what matters most is setting a goal for your program and regularly measuring it. We can’t all test at Google’s scale, but we can all experiment more and better. Remember, the goal isn’t just to drive up conversion rates, it’s to adopt a whole different way of running a business. Don’t settle for a flat plateau of ad-hoc experimentation. If there’s one thing that separates a testing program from a culture of experimentation, it’s the constant search for ways to embed data and hypothesis thinking into every stage of product development.