On Friday January 18, 2019, at roughly 10:08 UTC Elastic Cloud customers with deployments in the AWS eu-west-1 (Ireland) region experienced severely degraded access to their clusters for roughly 3 hours. During this same timeframe, there was an approximately 20 minute period during which all deployments in this region were completely unavailable.

Given the duration of this service interruption, we want to take the time to explain what happened. Such incidents are not acceptable to us, and we’re sincerely sorry if your business was impacted.

After any incident, we perform a detailed postmortem to understand what went wrong, find its root causes and, most importantly, discover what we can do to ensure that it doesn’t happen again.

We’d like to share below the root cause analysis we’ve performed for this incident, the actions we have already taken, and those we intend to take in the future to prevent such class of incidents from happening again.

If you have any additional concerns or questions, don’t hesitate to reach out to us through support@elastic.co.

Sincerely,

The Elastic Cloud team

Summary of Incident

Background

Generally speaking, the Elastic Cloud backend in each supported region consists of three main layers:

Allocators layer: Hosts for Elasticsearch and Kibana instances.

Coordination layer: This layer holds the system state, connected allocators, and maintains the actual location of each cluster’s nodes and Kibana instances. This is implemented by a 3-node Apache ZooKeeper ensemble.

Proxy/routing layer: Routes traffic to cluster nodes and Kibana instances based on that cluster’s virtual host name, fronted by a load-balancing layer which load balances between multiple proxy instances.

Chain of Events

All times stated in UTC:

2019-01-18 09:09: Heavy CPU load was exhibited on the leader server of our coordination layer in the Elasticsearch Service AWS eu-west-1 region.

2019-01-18 09:35: Alerts were received by our on-call SREs related to service degradation in this layer. Following that, a production incident was initiated following our internal protocol.

2019-01-18 10:08: Following the issues in the coordination layer, our proxy/routing layer went into a degraded state. The proxy/routing layer depends on the coordination layer to maintain its routing table consistency, and is designed to work while disconnected from ZooKeeper. However, the layer was facing connection flakiness to Zookeeper that our client-side library did not handle well. Despite mitigating actions, the ZooKeeper ensemble powering our coordination layer failed to achieve stability.

2019-01-18 10:30: The proxy layer deteriorated further, which led to complete cluster availability disruption in the eu-west-1 region between 2019-01-18 12:42 and 2019-01-18 13:02. SREs used that window as an opportunity to scale up and out the proxy layer to accommodate the surging client traffic hitting the load balancing layer, as well as triggering a leader election, and stabilizing the ZooKeeper ensemble powering the coordination layer.

2019-01-18 13:03: Cluster availability was restored, and after a monitoring period elapsed, the incident was marked as resolved.

Root Cause Analysis

The proxy layer acts as a smart routing service behind our load balancers and in front of our allocator layer. It is responsible (amongst other things) for routing an HTTP request and TCP connections to the appropriate destination within a certain deployment’s container nodes. Our current proxy layer implementation is a direct client to the Apache ZooKeeper ensemble, using the Apache Curator framework. ZooKeeper stores data in the form of nodes in a tree (called znodes), and the Curator framework uses TreeCache to keep all data of all child nodes of a ZooKeeper path locally cached. Then, Watches are used to keep the local TreeCache up to date.

Our proxies watch the ZooKeeper subtree that contains all znodes, which represent each deployment, its individual nodes (running in Docker containers), and their respective properties.

What do we use TreeCache for?

As elaborated in its documentation, TreeCache is actually a full mirror and not a cache. It mirrors a ZooKeeper hierarchy and the entire watched subtree, keeping it up to date and registering watches on all nodes encountered. Using TreeCache, we are able to register a listener that will get notified when changes occur in the znodes of the targeted ZooKeeper hierarchy.

Our proxies almost never fetch data directly from ZooKeeper. We set up TreeCaches at startup and use them to maintain a consistent mirror of its data. Furthermore, we don’t use the caching feature of TreeCache, but only the event listening part. These events are used to maintain application-level mirrors (rather than mirroring raw binary data), or just to react to events.

What’s wrong with TreeCache?

TreeCache has some logic to make sure the mirror is kept up to date, even in case of disconnection from the ZooKeeper server. When the connection is lost and then re-established, TreeCache crawls the entire mirror tree and sends a request per node to ZooKeeper to reload the data and register a watcher, all in one go, in one big batch. Some TreeCaches may have tens of thousands of znodes. If the connection is lost and re-established during a tree refresh, the same big batch of requests is sent again.

This leads to major issues in case of connection loss, and more so in the case of connection instability. Flakiness has to be understood here from the client’s (in this case the proxy) perspective: if the client fails to receive heartbeat messages within a certain time frame, it considers the connection lost. This means that this flakiness can be caused by the ZooKeeper server itself being overloaded and experiencing long GC pauses.

This leads to major issues:

Proxies hammer the ZooKeeper servers with big batches of data refresh requests.

In case of connection flakiness (from the proxy’s point of view), we send these big batches several times, with some bad consequences:

It can hammer the ZooKeeper servers badly, fill their incoming request queue, and worsen the GC situation.



As the ZooKeeper servers struggle to send responses, requests pile up in the client’s unbounded outgoing queue and lead to out of memory errors in these clients.

So, if the ZooKeeper server is loaded and causes heartbeat timeouts because of GC pauses, TreeCache will start flooding ZooKeeper with requests, making the situation worse and leading to a chain reaction that prevents the ZooKeeper servers from recovering, and can also kill client services (such as the proxy).

Relating to this specific incident, at 2019-01-18 09:09 ZooKeeper ensemble leader hit a high CPU threshold mark, which caused client disconnections, signaling alarms to our on-call SREs. Due to the issues described above, mitigating actions like stopping clients from reconnecting and re-establishing an initialisation phase were not helpful enough for the ZooKeeper quorum to stabilize.

A turning point was 2019-01-18 09:35 where proxies acting as ZooKeeper clients using TreeCache started experiencing resource starvation due to refresh requests piling up and leading to out of memory conditions in that layer, resulting in a death spiral that lead to a complete outage at 2019-01-18 12:42. Recovery from that point took 20 minutes, cluster availability was restored at 2019-01-18 13:03.

* All times specified in UTC

Mitigation

In order to fix this issue, we’ve been working on a replacement for TreeCache called TreeWatcher, which is a complete rewrite of TreeCache, and is API-compatible so that it can act as a drop-in replacement (plus some simplified data structures as we only use the event listener part of TreeCache and not its mirroring features). The major difference between TreeCache and TreeWatcher is the way in which the tree is refreshed.

Distinguishing resumed session from new sessions

When the connection is lost with a client, the ZooKeeper server keeps the session state for that client for a certain time, and resumes the session if the client reconnects within that delay. Session resumption includes restoring the watchers, which will trigger if changes happened while the session was suspended.

This leads to a first optimization: TreeWatcher will not refresh the tree if the session is resumed when the connection is re-established, and will just let watchers be triggered. As flakiness induced by GC pauses is likely to lead to session resumptions, this eliminates a large part of the tree refreshes.

Letting the ZooKeeper server set the pace for tree refreshes

When a tree refresh is needed, TreeWatcher will crawl the tree and send refresh requests, but it doesn’t send one big batch for the entire tree. Tree traversal is driven by server responses: a node’s children are refreshed only after TreeWatcher has received the response to that parent node’s refresh request.

This means that there’s now a back-pressure from the ZooKeeper server, as progression in the tree crawl is driven by the server responses. We only send requests to the ZooKeeper server when it has answered the previous ones.

Not piling up refresh requests

If the client session is lost (and not resumed) while a tree refresh is in progress, the current refresh cycle is canceled and a new one is started. This means that connection flakiness will never lead to clients going out of memory as we won’t add additional big batches of outgoing requests on top of those that are waiting for an answer (as is the case with TreeCache).

Rolling out the fix to production

TreeWatcher had been under development for a few months prior to the incident on January 18, and our team has thoroughly tested it in our QA and staging environments. Following a canary phase, a gradual rollout to production commenced in December 2018, targeting our management and control plane followed by our data plane as a final step. The proxy layer cutover from TreeCache to TreeWatcher was scheduled to hit production environments the week of January 21, 2019.

As of January 24, 2019 we have replaced TreeCache in favor of TreeWatcher across all our Elasticsearch Service providers and regional points of presence, and are confident that this is a big milestone for improving our platforms resiliency.

Now that TreeWatcher has been production-tested, we also plan to contribute it upstream to the Apache Curator project.

Future Improvements

In addition to the above fix, our team has been actively working on the next version of our proxy layer, which among other things will have a less direct coupling to ZooKeeper to maintain the routing table. Instead, it will cache the routing table in a versioned local file. So even if the connection to ZooKeeper is lost or exhibits connection flakiness, this stronger decoupling will allow the proxy to route traffic successfully using the locally cached routing table. Over the coming months, we will gradually roll out this new proxy implementation to all of our supported regions, and expect this to further increase our platform stability.