July 19, 2019

Using Experimentation to Measure and Validate Backend Performance Improvements

Experimentation is highly effective as a learning tool when there is a clearly defined outcome that you want to achieve (such as improving conversion rate) and some uncertainty around the specific way to do it. But sometimes you know what you need to change and you just want to be able to measure how that change will affect the metrics that are most important to you. Experimentation works very well for this too in that it provides a way to validate and illustrate the impact of invisible changes that could otherwise go unnoticed.

As an engineer on the Optimizely backend team, I care deeply about performance and latency, as one should in all backend applications. Recently in doing some routine development, I discovered a costly database query that was being executed on many requests. This was a blocking query, meaning the application had to wait for it to return before doing anything else. I found a way to skip the query most of the time by caching the data it was fetching inside of our web processes.

I was confident that my cached approach was going to be faster than waiting for the existing query. At the same time, it seemed important to deploy this as an experiment so we could accurately measure the difference in approaches and also add better instrumentation for request latency events to enable future backend performance experimentation.

The Experiment

I used a Full Stack feature flag to control my new code, setting up the experiment to be remotely configured once it was deployed to each environment. This de-risked the deploy as I was able to test out the change prior to enabling it for actual customers.

def _has_cached_whitelist_feature():

  fullstack_client = fullstack_service.get_optly_object()

  request = request_context.get_request()

  if fullstack_client and request and 

request.registry.get('request_id'):

    # we will get significance faster if we activate on the unique Request ID

    # since it really only has to match for the duration of the request

    return fullstack_client.is_feature_enabled('cached_whitelist', request.registry.get('request_id'))

  return False

Control and variation feature toggles

Measuring Success

Overall, we were looking to see a decrease in latency per request. To measure this, I added a new event called api_response_time_ms that measures the time on the backend server between receiving the request and finishing building the response.

fullstack_client.track(
    'api_response_time_ms',
    self.request.registry.get('request_id'),
    attributes=attributes,
    event_tags=tags
)

The new event only applied to API requests (not page loads) and we made a conscious decision not to measure any in-transit time as a part of this, because we wanted this metric to measure the part that we control to validate if we’re improving things on our end. This means that the metric we are sending is not a true end-to-end latency metric, but it is “good enough” to provide a baseline for the parts that we can change. I also added attributes for segmenting results by response size and route since I wasn’t sure how it might affect the latency of some routes differently and I wanted to be able to explore the data later. So the full instrumentation looks like this:

    start_time = self.request.registry.get('request_start_time')

    render_time_ms = math.floor((time.time() - start_time) * 1000)



    fullstack_client = fullstack_service.get_optly_object()

    content_length = int(self.response.headers.get('Content-Length', 0))



    if content_length > 1000 * 1000:

      response_size = '>1000kB'

    elif content_length > 500 * 1000:

      response_size = '500-1000kB'

    else:

      response_size = '0-500kB'


    route = request_context.get_route_string()


    attributes = {

      'route': route,  # Example: GET /api/v1/categories

      'response_size': response_size, # Example: 515kB

    }
    

    tags = {

      'value': render_time_ms,

    }


    fullstack_client.track(

      'api_response_time_ms',

      self.request.registry.get('request_id'),

      attributes=attributes,

      event_tags=tags

    )

The Results

After the experiment ran for about a month we saw on average 40 milliseconds improvement in the average case for response time latency from our API. In the report below you can see how I was able to filter the report by specific segments. In the segment below the API response time improved by 19.1%.

When working on improving performance in complex applications, there is rarely one big change that you can make that will make everything immediately faster. Many small incremental changes built up over time are usually required to achieve your goals. Those changes can have hidden trade-offs and it’s not always clear how much a given change will improve things. Profiling and better instrumentation can help, but profilers don’t tend to get run in production and thus don’t always tell the whole story. Instrumentation generally just tells you how a metric has changed over time, but not necessarily what caused that change and small changes can sometimes get lost in the noise. Experimentation can tell you with certainty that you are making things better.

Thanks to the experiment results, we know with certainty that the change we made was worthwhile. The change has now been fully rolled out to benefit all of our customers and we can easily reuse the instrumentation work we did here to measure the latency impact on any backend change that we want to make in the future. On to the next improvement!

Lay the foundation for backend experimentation with Optimizely Rollouts, our free feature flagging solution.

Using Experimentation to Measure and Validate Backend Performance Improvements

Read next

Ecommerce evolution: Blurring the lines between B2B and B2C