Scaling Feature Flags With Zookeeper

This is a blog about the development of Yeller, The Exception Tracker with Answers Read more about Yeller here

I love feature flags. They’re one of those tools that make running complicated systems much more manageable. For those not in the know, feature flags are a way of adding a conditional to your code that lets a configurable number of users or requests through, originally designed for restricting new features to internal users before rolling them out to the rest of the userbase. Yeller uses its own clojure library for feature flags: shoutout, which is mostly just a clojure port of James Golick’s rollout.

For high throughput systems (such as Yeller’s exception ingest), feature flags offer a way to control the performance of the system at runtime. This is critical - Yeller operates either in a state of being very quiet, or operating at very high throughput because one of its customers broke something really badly. Under very high throughput events, the ability to dynamically control features without redploying is crucial.

Feature flags also give you a method to ramp up writes to new services or storage backends, so that migrations can be done incrementally. Yeller hasn’t needed this yet, but doubtless it will at some point. Twitter use feature flags for rolling out new infrastructure extensively, as do a few other well known companies (Etsy, Instagram, to name the examples I can think of off the top of my head).

Feature Flags are very often implemented using redis - James Golick’s rollout is a great sample of this. However, using redis for feature flags means you’re neccesarrily bottlenecking your system on a single core in your cluster (or you’re operating clustered redis, which can work, but uh, I don’t want to do it). One of the things I truly believe in for Yeller, is this tweet, by @littleidea:

If a customer breaks something really badly, they really don’t want to break their exception tracker and be unable to diagnose what they broke. Plus, I’ve operated redis (as a single instance) for the past few years and I’ve hit enough problems with it that I didn’t want to introduce it here as well.

Zookeeper

So, yeller uses zookeeper for its feature flags. This might seem counter-intuitive: reads and writes in zookeeper are much slower than redis (yeller sees a 99th percentile latency of around 10ms to zookeeper, which isn’t good enough for the needs of the feature flag implementation). However, zookeeper ships with an incredibly useful feature, watches, which means that feature flag reads scale incredibly well.

Zookeeper watches let a client machine get notified of changes to an individual key (or set of keys in a folder). So, to scale feature flags, you can use an in memory hash map to store the various settings for each flag, and use a zookeeper watch to update the in memory hash map. This means checking the state of a feature flag is a pure in memory lookup in the local process, and you never need network roundtrips at read time.

I have, in the past scaled feature flags by putting a in memory cache in front of redis. However, that thing makes significant tradeoffs with how fast your services respond to feature flag changes - typically the cache is set to expire after a certain number of seconds, and that expiry delimits how fast updates propogate. With a watch, updates propogate as fast as zookeeper can replicate them, and reads still stay in process. Furthermore, the zookeeper (vs cached redis) wins significantly when you start caring about higher latency percentiles - the zookeeper watch approach never hits the network for reads, wheras the redis one does when the cache expires.

Tradeoffs

As with all software decisions, there are distinct sets of tradeoffs. For this particular case (using feature flags backed by redis vs the zookeeper watch approach), there are a few drawbacks:

zookeeper is significantly more complex to set up than redis (for Yeller this is a moot point - various other parts of the infrastructure already used zookeeper)

Likewise, zookeeper is more complex to operate. Again, for Yeller this is a moot point, adding redis would actually mean more moving parts in the infrastructure, not fewer.

It adds a modicum of load on your zookeeper servers (one new watch for each client jvm). This won’t scale to super large systems, but certainly with the tens to hundreds of nodes limit I expect Yeller to stay within, it’ll be ok.

Apart from those two points, using zookeeper for feature flags is a big win - throughput and latency are dramatically improved over redis, at no cost that matters all that much.

Lastly, if you’re using zookeeper from the JVM, I’d highly recommend using Apache Curator as your client - curator already builds in code for watching a node and getting notified when it changes (the googleable term is NodeCache for that). This meant implementing this was extremely easy: just 50 lines of code. If you’re in ruby, there’s already a gem for backing rollout with this: papertrail/rollout-zk.

Nothing in this is new at all - I learned of this technique from Eric Lindvall (who wrote the above ruby gem) via https://gist.github.com/eric/5522399, and thought I’d write it up in full, with a list of tradeoffs.

This is a blog about the development of Yeller, the Exception Tracker with Answers. Read more about Yeller here