diagram

Our tools are used by 15k+ real estate agents, and our web apps are being used by millions of users everyday. With over 300 people in product and engineering department, we need to be careful and efficient with each feature release.
In order to release a feature that satisfies our users’ needs, we need to be able to turn features on and off, specify audiences, test different variations, and integrate easily with current apps.

The Problem

Currently, Compass uses an internal service called “Experiments API” to roll out new features. It’s a simple service; it will provide either true or false given a user ID and an experiment name. Today, it makes decisions by:

  • Specified percentage with random bucketing of users
  • Force enabling/disabling logged in users by email

As our business grows, we are starting to see the need for the following:

  • More intricate rules when rolling out features: perhaps by region or by user role
  • A more stable service: the current experiment service is not reliable due to its underlying infrastructure, and its run-time incidents are happening at an unacceptable rate
  • The ability to test variation of features: we need to be able to “Learn from reality” by validating our hypothesis with our real users. With the Experiments Service, all data analysis is done on-demand, we need a better way to analyze features.

Thankfully today we can hire third-party services to handle feature tests and rollouts for us, so our engineers can focus on what’s most important: the features themselves! 😃

After careful consideration, we decided to go with Optimizely, more specifically the Optimizely Full Stack product, due to its robust feature rollout and testing capabilities. It will also demand less maintenance work than our current in-house experiment service.

The Project

Ok, now the question was “how do we start using it?”.

We assembled a small team to investigate the needs of Compass teams and how the Optimizely Full Stack product could help with these use cases. Soon we learned that we would have to build our in-house integrations with the following “broad” objectives:

  1. Main mission: every team that wants to use Optimizely will be able to do so easily
  2. Secondary mission: retire the homemade Experiments API

And it’s very important that we should achieve all that while generating as little future maintenance work as possible.

The Strategy

In order to achieve our objectives, we had principles we would base our decisions on:

Principle 1: We will make as many decisions as we can together.

We saw that we would need to learn a lot early on, and would need to make a lot of decisions that would affect almost every team in our company for years; this means we we simply cannot rely on our first ideas! Early on, every decision was usually brainstormed in small whiteboard sessions. We had as many planning sessions as we thought were needed. Also, when we started producing code, we were doing everything together, on the same computer, in what we later learned is also known as Mob Programming. This practice proved to be extremely valuable, and we got very confident in the quality of our code from the beginning!

Interestingly enough, we believe that this did not delay our deliveries, on the contrary — we kept our ideas in sync and were moving ahead at a very consistent pace.

Principle 2: Do not become a bottleneck by introducing too many abstractions

Everything we produce needs to be easy to understand and easy to use. We don’t want to impose a big cognitive load on other teams whenever they need to use Optimizely.

To achieve this goal, we will try to introduce as little extra code as possible. More code means more places for bugs to appear, more abstractions, and more maintenance work in the future when there needs to be changes to these abstractions. We planned to spend a lot of time with Opty users to understand their use cases, so we could be sure which shared modules were really needed for the integrations.

In other words, we want to create abstractions through extraction, and not try to predict future needs. More of this will be described below.

The Work

In this section, we will explore the journey we took in integrating Optimizely Full Stack with Compass architecture. Each section will define an immediate challenge, our solution and any resulting challenges from the proposed solution.

Challenge: every project should be able to use Optimizely

Solution: In Compass, there are three types of clients which can call Optimizely; there are backend microservices, web clients and mobile clients. For services and web clients, Optimizely provides clients for a number of languages. The clients work pretty well: they continuously poll your Optimizely configuration (called the datafile), and no method depends on external calls, it’s all local and fast. Optimizely also provides very good mobile app clients, which we decided to use directly in both our iOS and Android apps.

diagram

Each microservice and application would run an Optimizely client.

New problem: most of our backend services are written in Go, and there was no Go client! [Go SDK made publicly available 01/10/2020.]

Challenge: Implement Optimizely as a service

Solution: Following Optimizely’s recommendation, we decided to implement an internal service that will run the Optimizely client. We went with Java since the Optimizely team told us it’s the most stable and mature of their clients. Also, this is the client they themselves use for their features, and we understand the impact on the quality of a product if the makers are also the users! [Optimizely Agent can be used to deploy Full Stack as a microservice – made publicly available 03/30/2020.]

At Compass, all our backend services communicate with each other via gRPC, and we have one service called APIv3 that exposes all our gRPC services as HTTP endpoints (Cameron Waeland talked about our architecture at QCon 2018, check it out). In this architecture, the service can be very efficient and fast, and is reachable from both the backend and frontend. At this point, we started believing every project would use this service, and only a few would need to actually implement the client themselves mostly for performance reasons.

diagram

“Optimizely as a service” strategy

New problem(s):

  1. Defining a good interface: every project that interacts with your service will do it through the interface. A bad interface can make every other project worse, and introducing breaking changes is very costly once the project is being used in so many different places.
    (Check out this book by John Ousterhout that explains really well why good interfaces are key to creating good software design)
  2. Getting the service to production: we want to get continuous integration running early, and we also want to deal with DevOps issues early. These issues usually involve different teams with conflicting priorities, so requesting for help and communicating early is key.
    (The benefits of getting anything early to production are well outlined in this blog post)

Challenge: defining the service’s interface

Solution: following our principles, we decided the service would be a “thin wrapper” over Optimizely’s client. For each client method, we would expose a single RPC/endpoint, with the same name and same arguments. In the HTTP interface, we even created a convention: all requests will be a POST, with the required arguments in the URL, and optional in the body.
For example, this call from the client to check if a feature is enabled

optimizelyClientInstance.isFeatureEnabled( 'price_filter', // feature key 'some_user_id', // id { some_attribute: 'some_value' }, // attributes );

Would translate to:

POST /opty/is_feature_enabled/price_filter/some_user_id "{ "attributes": { "some_attribute": "some_value" } }"

This implementation would only need a few lines of code, as we wanted by principle. We would not need to create any abstractions, which would mean we would impose just a small cognitive load on our users. A big bonus is that users will be able to directly consult the Optimizely docs themselves, if they have questions.

The only “con” to that decision is that we are not hiding at all the fact that we’re using Optimizely, which could lead to a vendor lock-in problem. We decided to incur that risk, for the following reasons:

  1. People would be interacting with the Optimizely dashboard anyway, so it was already explicit that we were using it. We did not plan on abstracting the dashboard!
  2. People will want to use Optimizely to its fullest potential, and any abstraction we created on top of it would need to take each and every feature into consideration. That means our abstraction would be at least as complex as Optimizely itself, which is definitely not in sync with our principles.
  3. If we decide to change vendors we will most probably incur a big breaking change migration either way.

So “thin wrapper” abstraction it is!

Now that we have a service on the way, some future users started asking us “what should we use as userId for our feature rollouts and experiments?”. Turns out that was not a simple question to answer!

New problem: which user ID should we send to Optimizely?

Challenge: identifying our users

Solution: most of our users will be using Optimizely to roll out features. That means the isFeatureEnabled method is going to be used a lot. Analyzing its arguments, it needs a featureKey and a userId.

The feature key is easy to identify, it would just be whatever we defined in the dashboard. What about user id? We went through some possibilities:

  1. User ID from the database: this would work if we are on a page or service that we can guarantee to have one (i.e., an authenticated route). This would also guarantee the same user sees the same variant in different devices. BUT, we also want to roll out features to logged out users, so we needed to use a different id.
  2. Segment’s Anonymous ID: we use Segment for event tracking, and it generates and sets an Anonymous ID for every user that visits our site. This id does not change if the user logs in, which is great since we don’t want to toggle a feature differently after the user authenticates. The problem is Segment wipes the anonymous id once the user logs out manually from the website/app, and we end up with more than one Anonymous ID per user.
  3. Create a new ID that’s unique per device: creating a new id can be problematic because we have to update lots of apps and pages, but this one has all the properties we wanted, and some data teams also showed interest in having it.

table

Summary of the three ID possibilities

The Unique Device ID was the winner, and we decided to create a new ID. After an internal poll, it was named Highlander ID (there can be only one, lives forever, the fun nerdy reference game 😃). We generated this id on page view if it did not already exist, and passed it along our api requests as a custom header.

Feature rollouts are looking pretty good! Next step would be supporting Feature Tests, which is when we actually compare two different variants of a feature with our users. Feature Tests in Optimizely are implemented in a similar way to Feature Rollouts, but we also need to send “success” events to Optimizely via the track method. Optimizely will then track how many impressions each variant has, and statistically, compare how many of those impressions generated these “success” events.

One reason why we chose Optimizely was because they offered Segment integration. Since we use it, we had the option to set up Optimizely as an event destination, so we won’t need to manually call Optimizely’s track method. This is great, it’s even less friction for our users! But nothing is that easy, right? And it turns out that Segment’s integration only works if we use Segment’s user ID or anonymous ID.

New problem: integration with Segment.

Challenge: Segment and Optimizely integration

Solution: everything in software development is a trade-off, so after careful consideration, we decided that using Segment’s anonymous ID was an acceptable price to pay to use Segment’s integration. We needed to, ironically, “kill” Highlander ID. It’s best to have breaking changes when we still don’t have a lot of users (but it’s still a pain to do so!).

Next we needed to integrate Segment with Optimizely on the browser side, and it turned out it was also not that simple: Segment in the browser expects the page to be running a version of Optimizely client, and its integration is simply calling the track method in that client. None of our pages or JS apps have Optimizely’s client running on them, and we did not want to impose it on the teams.

After going through Segment’s source code and talk to Segment’s developer support, we came up with an interesting strategy to mitigate this: create a stub of the Optimizely client in the frontend that has the track method implemented as a fetch call to our service. It works like a charm, is easy to set up, and only adds a couple of kilobytes to our bundles!

// all our packages are prefixed @uc, // and "opty" is Optimizely's nickname in our codebase import {initOptyTrackInBrowser} from '@uc/opty'; // call this when running in the browser: initOptyTrackInBrowser(window); // this will create a "fake" version of the Optimizely client // that Segment will call for every event triggered! // // window.optimizelyClientInstance = { // track: (...args) = > { // fetchCallToOptyService(transformArgs(args)); // } // }

At this point, we were already working closely with some projects. One common concern was, obviously, QA and testing. Users want to run their apps and quickly be able to see the different variants being rendered. The strategy that worked best for the teams was using query parameters to toggle features and override variable values, and after implementing it for the third time we decided to extract it into a shared Node middleware module.

New problem: how should we test or QA our apps that are using Optimizely?

Challenge: enable testing and QA

At Compass we use Koa for our web applications, and we share Koa middlewares among our apps. Our Optimizely middleware ended up: fetching features and variables data from our Optimizely service, and enabling QA and testing via query parameters. Here’s an example of the middleware in use:

import { loadOpty, VariableTypes, FEATURE_ONLY } from '@uc/opty'; // Initialize your Koa app... // Setup the middleware: // Pass a `features` dictionary. As values, pass either a dictionary of the // desired Variables, or the `FEATURE_ONLY` value if you don't want // to fetch any variables. app.use(loadOpty({ features: { 'my_feature': { 'my_integer_variable': VariableTypes.INTEGER, 'my_string_variable': VariableTypes.STRING, }, 'my_other_feature': FEATURE_ONLY, } })) // Now ctx.state.opty is populated: app.use((ctx, next) => { console.log(ctx.state.opty); return next(); }) // Output: // { // features: { // 'my_feature': true, // 'my_other_feature': false // }, // variables: { // 'my_feature': { // 'my_integer_variable': 123, // 'my_string_variable': 'Hello World!' // } // } // }

And the most important part: the middleware will override any feature or variable by adding a query parameter to the requested URL! For instance, we could override the my_integer_variable value by pointing our browser to www.mywebsite.com/some_route?opty_my_integer_variable=456.

What Went Well

  1. Time to delivery. After four months, all our projects are able to roll out and test features using Optimizely, and by the time this post was written, it was already being used in production on three different projects!
  2. Learning together. We are proud of the quality of the code we produced, and of the benefits it is bringing to different teams. We believe this is a direct result of all of us planning, working, learning together most of the time, and reaching out to other experts around us whenever needed.
  3. Helping other teams. Remember we talked about trying to get a service to production as early as possible? Well, we ended up bumping into a lot of obstacles and had to talk to lots of different teams. To sum up our findings for service infra and Optimizely, we added the following items to improve developer experience:
    a. One of the results of our project is also a “boilerplate” service that’s deployed to production, with CI and runtime monitoring setup, and other teams can refer to it when they go through the same challenges.
    b. Hosted weekly office hours and training sessions to answer any questions, conduct code review, pair programming with developers for Optimizely integration.
    c. Created a slack channel to make it easier for developers to reach us for Optimizely questions.
  4. Iterate on interfaces early. Even though we were thinking really hard about interfaces, things do not go well on the first try. It’s costly to do breaking changes in interfaces, but we were glad to be able to find these issues early and iterate on them more easily. We attribute this to the fact we chose to work really close to a couple of projects and find the problems before having multiple users.
  5. No tech debt. None of the code produced had us saying “hmm, we should probably refactor this when we have time” 😃

What Could Have Been Better

  1. More planning and studying third party integrations. We thought it would be easy to integrate Optimizely and Segment, by looking at the docs. It turns out it wasn’t, and the issues around this were probably what caused most of the delays and early breaking changes in our projects.
  2. More integration with the backend services. Since we were one of the first Java service to run on our new Kubernetes cluster, there were many unknowns; this created some difficulty for backend services to integrate with our service. Although we released packages and middleware for frontend apps, the backend callers did not get any helpful modules to work with.

Conclusion

We consider the project as a success: lots of unexpected issues happened, but we were able to deal with them early and in a timely manner. Our main objective was achieved, and we have a clear deprecation strategy for the old Experiments API, which was our secondary mission. We are also looking forward to applying the principles to different projects at Compass.

If you’re interested in joining Compass and solving new challenges, we’re hiring!