Over the past years, I’ve worked in multiple teams adopting very different strategies when it comes to feature flags. I’ve seen the pros and cons of both, and over time, I found myself disagreeing with any fundamentalist position on their use. There is a lot of nuance to this topic, and I think it is worth considering more carefully the various scenarios where feature flags do and do not make sense.

The Reasons For

There are a few major scenarios where feature flags make a lot of sense. The first is when it’s used for A/B testing, where you absolutely do want different behaviors for different users, based on their randomly assigned treatment. I’ve seen this strategy employed extremely well at Amazon where new features are gated by a “feature flag” that is actually controlled by an internal A/B testing framework. The framework randomly exposes some Amazon customers to the new feature, and then monitors their subsequent behavior in order to estimate the business impact of launching the feature.

I was initially skeptical, but was soon won over by how easy the framework was to use, and the valuable insights it provided on the benefits (or drawbacks) of certain features. “Flavor of the month” decisions were replaced with real data. And none of this is possible without the use of “feature flags” to dynamically toggle new features.

Another great use case for feature flags, is when you’re working on a very complex epic that require many different sub-tasks to be completed in different parts of the system. Sub-tasks that are too numerous and invasive to be done in a single pull-request.

In such cases, trying to keep all these disparate changes in side-branches and coordinating a simultaneous merge and deployment, is a recipe for disaster. It’s far more manageable to gate any disruptive changes behind a master flag, merge and deploy all the sub-commits incrementally, and do a flag-flip once all the pieces are in place.

One last use case for feature flags, is when you do not have control over your deployments. For example, consider the Facebook Android app, which contains code contributed by hundreds of different teams, all combined and deployed as a single binary. In such scenarios, performing rollbacks can be infeasible. For practical, political, bureaucratic or even marketing reasons. In such cases, feature flags allow your team to toggle new functionality or mitigate risky changes, without having to rollback or deploy any new binaries.

Someone on Reddit pointed out a similar use-case for feature flags: targeting a very specific launch date for marketing reasons, while still deploying your code much earlier, in order to ensure stability. You can then have a “dynamic” feature flag that automatically enables itself at a specific time. This is also a great use-case for similar reasons – changing functionality in situations where deploying a new binary is impractical.

Risk Aversion

The above are all fantastic use cases for feature flags, but I’ve also seen teams get bogged down by policies that overreach in their use. For example, mandating that every single code change should be behind a feature flag, “just in case we made a mistake”.

Risk management should indeed be a priority for all teams. But there are better ways of doing this than relying on feature flags, especially if your team has control over its own deployments. The vast majority of your bugs should be caught by your automated test suite and/or QA process. And the last few stragglers should be handled using incremental deployments, production alarms and rollbacks.

Besides, as soon as any problem is detected, the recommendation at places like Google is to rollback first and investigate the problem later:

We have seen this at Google any number of times, where a hastily deployed roll-forward fix either fails to fix the original problem, or indeed makes things worse. Even if it fixes the problem it may then uncover other latent bugs in the system; you’re taking yourself further from a known-good state, into the wilds of a release that hasn’t been subject to the regular strenuous QA testing. At Google, our philosophy is that “rollbacks are normal.” When an error is found or reasonably suspected in a new release, the releasing team rolls back first and investigates the problem second

When things are on fire, the last thing you want to do is root-cause the bug and figure out which flag-flip will safely fix the problem. And that may not even fix things – there’s no guarantee that even if your teammate tried to put his changes behind a feature flag, he didn’t inadvertently introduce a bug that cannot be solved by a flag-flip.

Feature flags are a poor man’s alternative to binary rollbacks, and they definitely aren’t a substitute for having a great automated test suite and a robust QA process. If you’re relying on feature flags to remedy production bugs, you should stop and evaluate your team’s practices. Risk aversion is often a smell of your team entering into a doom loop which will only get worse and worse with time.

Death By Feature Flags

You may be wondering at this point why we shouldn’t use feature flags anyway. After all, “defense in depth” … and it never hurts to have more fine-grain flexibility right?

While feature flags are great in some cases, we should also keep in mind their costs. Software engineering is primarily an exercise in managing complexity. And each feature flag immediately doubles the universe of corner cases that your programmers have to understand, and your code is required to handle. “But what would happen if Foo is enabled, Bar is disabled, and we do independent A/B tests on Baz and Kaz on the same day?” In my experience, this combinatorial explosion in complexity can and will lead to bugs. Not to mention slowing down the speed at which your team can make any changes.

To quote an amusing anecdote shared online: “A flag that hasn’t been set to off in a year can be masking a major regression. At my last job we had two major outages in as many years from defunct flags defaulting to ‘off’ when the feature flag system failed to return flag states.”

“But these feature flags are only temporary. You should be removing them as soon as possible!”

Sure, and we should also not allow our tech debt to accumulate and we should follow every single best-practice religiously. Unfortunately, this never happens in any corporate environment. Even in great teams, tech debt often gets de-prioritized in the face of new requests. Newcomers to the team or those on their way out, aren’t always disciplined enough to clean up their flags after a successful rollout. And sometimes, these tasks simply slip through the cracks and get forgotten.

Someone on HackerNews has pointed out that as of Sept-2020, the release version of Windows “has roughly 2500 feature flags. Some are permanently jammed into the on position, some off position, and the rest are configurable by its experimentation frameworks and hackers”.

There is no better illustration of this than the KCG debacle where a financial firm lost half a billion dollars and almost went bankrupt in 30 minutes, partly due to dead code that was behind a feature flag.

The cause of the failure was due to multiple factors. However, one of the most important was that a flag which had previously been used to enable Power Peg… Power Peg had been obsolete since 2003, yet still remained in the codebase some eight years later. In 2005, an alteration was made to the Power Peg code which inadvertently disabled safety-checks which would have prevented such a scenario. However, this update was deployed to a production system at the time, despite no effort having been made to verify that the Power Peg functionality still worked

Feature flags are a powerful tool that can help you experiment with new features, manage the rollout of complex epics, and mitigate the problems associated with not controlling your team’s deployments.

But they come at a significant cost, in the form of code complexity, tech debt, slower development speeds, and inevitably, bugs.

As tempting as it may be, there is no silver bullet here. Weigh the pros against the cons, and use this tool judiciously when it makes sense to do so.

Discussion thread on /r/programming

Discussion thread on Hacker News