At Plaid, we make heavy use of Amazon-hosted ElasticSearch for real time log analysis — everything from finding the root cause of production errors to analyzing the lifecycle of API requests.

The ElasticSearch cluster is one of the most widely used systems internally. If it is unavailable, many teams can’t do their work effectively. As such, ElasticSearch availability is one of the top SLAs that our team — the Data Science and Infrastructure (DSI) team — is responsible for.

So, you can imagine the urgency and seriousness when we experienced repeated ElasticSearch outages over a two-week span in March of 2019. During that time, the cluster would go down multiple times a week as a result of data nodes dying, and all we could see from our monitoring was JVM Memory Pressure spikes on the crashing data nodes.

This blog post is the story of how we investigated this issue and ultimately addressed the root cause. We hope that by sharing this, we can help other engineers who might be experiencing similar issues and save them a few weeks of stress.

What did we see?

During the outages, we would see something that looked like this:

Graph of ElasticSearch node count during one of these outages on 03/04 at 16:43.

In essence, over the span of 10–15 minutes, a significant % of our data nodes would crash and the cluster would go into a red state.

The cluster health graphs in the AWS console indicated that these crashes were immediately preceded by JVMMemoryPressure spikes on the data nodes.

JVMMemoryPressure spiked around 16:40, immediately before nodes started crashing at 16:43.

Breaking down by percentiles, we could infer that only the nodes that crashed experienced high JVMMemoryPressure (the threshold seemed to be 75%). Note this graph is unfortunately in UTC time instead of local SF time, but it’s referring to the same outage on 03/04.

In fact, zooming out across multiple outages in March (the spikes in this graph), you can see every incident we encountered then had a corresponding spike in JVMMemoryPressure.

After these data nodes crashed, the AWS ElasticSearch auto recovery mechanism would kick in to create and initialize new data nodes in the cluster. Initializing all these data nodes could take up to an hour. During this time, ElasticSearch was completely unqueryable.

After data nodes were initialized, ElasticSearch began the process of copying shards to these nodes, then slowly churned through the ingestion backlog that was built up. This process could take several more hours, during which the cluster was able to serve queries, albeit with incomplete and outdated logs due to the backlog.

What did we try?

We considered several possible scenarios that could lead to this issue:

Were there shard relocation events happening around the same time? (The answer was no.)

Could fielddata be what was taking up too much memory? (The answer was no.)

Did the ingestion rate go up significantly? (The answer was also no.)

Could this have to do with data skew — specifically, data skew from having too many actively indexing shards on a given data node? We tested this hypothesis by increasing the # of shards per index so shards are more likely to be evenly distributed. (The answer was still no.)

At this point, we suspected, correctly, the node failures are likely due to resource intensive search queries running on the cluster, causing nodes to run out of memory. However, two key questions remained:

How could we identify the offending queries? How could we prevent these troublesome queries from bringing down the cluster?

As we continued experiencing ElasticSearch outages, we tried a few things to answer these questions, to no avail:

Enabling the slow search log to find the offending query . We were unable to pinpoint it for two reasons. First of all, if the cluster was already overwhelmed by one specific query, the performance of other queries during that time would also degrade significantly. Secondly, queries that did not complete successfully wouldn’t show up in the slow search log — and those turned out to be what brought down the system.

to find the offending query We were unable to pinpoint it for two reasons. First of all, if the cluster was already overwhelmed by one specific query, the performance of other queries during that time would also degrade significantly. Secondly, queries that did not complete successfully wouldn’t show up in the slow search log — and those turned out to be what brought down the system. Changing the Kibana default search index from * (all indices) to our most popular index, so that when people ran an ad-hoc query on Kibana that only really needed to run on a specific index, they wouldn’t unnecessarily hit all the indices at once.

from * (all indices) to our most popular index, so that when people ran an ad-hoc query on Kibana that only really needed to run on a specific index, they wouldn’t unnecessarily hit all the indices at once. Increasing memory per node. We did a major upgrade from r4.2xlarge instances to r4.4xlarge. We hypothesized that by increasing the available memory per instance, we could increase the heap size available to the ElasticSearch Java processes. However, it turned out that Amazon ElasticSearch limits Java processes to a heap size of 32 GB based on ElasticSearch recommendations, so our r4.2xlarge instances with 61GB of memory were more than enough and increasing instance size would have no impact on the heap.

So… how did we actually diagnose the problem?

As it turned out, finding the root cause required accessing logs that weren’t available in the AWS console.

After filing multiple AWS support tickets and getting templated responses from the AWS support team, we (1) started looking into other hosted log analysis solutions outside of AWS, (2) escalated the issue to our AWS technical account manager, and (3) let them know that we were exploring other solutions. To their credit, our account manager was able to connect us to an AWS ElasticSearch operations engineer with the technical expertise to help us investigate the issue at hand (thanks Srinivas!).

Several phone calls and long email conversations later, we identified the root cause: user-written queries that were aggregating over a large number of buckets. When these queries were sent to the ElasticSearch, the cluster tried to keep an individual counter for every unique key it saw. When there were millions of unique keys, even if each counter only took up a small amount of memory, they quickly added up.

Srinivas on the AWS team came to this conclusion by looking at logs that are only internally available to the AWS support staff. Even though we had enabled error logs, search slow logs, and index slow logs on our ElasticSearch domain, we nonetheless did not (and do not) have access to these warning logs that were printed shortly before the nodes crashed. But if we had access to these logs, we would have seen:

[2019–03–19T19:48:11,414][WARN][o.e.d.s.a.MultiBucketConsumerService] This aggregation creates too many buckets (10001) and will throw an error in future versions. You should update the [search.max_buckets] cluster setting or use the [composite] aggregation to paginate all buckets in multiple requests.

The query that generated this log was able to bring down the cluster because:

We did not have a limit on the # of buckets an aggregation query was allowed to create. Since each bucket took up some amount of memory on the heap, when there were a lot of buckets, it caused the ElasticSearch Java process to OOM.

We did not configure ElasticSearch circuit breakers to correctly prevent per-request data structures (in this case, data structures for calculating aggregations during a request) from exceeding a memory threshold.

How did we fix it?

To address the two problems above, we needed to:

Configure the request memory circuit breakers so individual queries have capped memory usages, by setting indices.breaker.request.limit to 40% and indices.breaker.request.overhead to 2 . The reason we want to set the indices.breaker.request.limit to 40% is that the parent circuit breaker indices.breaker.total.limit defaults to 70% , and we want to make sure the request circuit breaker trips before the total circuit breaker. Tripping the request limit before the total limit means ElasticSearch would log the request stack trace and the problematic query. Even though this stack trace is viewable by AWS support, it’s still helpful to for them to debug. Note that by configuring the circuit breakers this way, it means aggregate queries that take up more memory than 12.8GB (40% * 32GB) would fail, but we are willing to take Kibana error messages over silently crashing the entire cluster any day. Limit the number of buckets ElasticSearch will use for aggregations, by setting search.max_buckets to 10000 . It’s unlikely having more than 10K buckets will provide us useful information anyway.

Unfortunately, AWS ElasticSearch does not allow clients to change these settings directly by making PUT requests to the _cluster/settings ElasticSearch endpoint, so you have to file a support ticket in order to update them.

Once the settings are updated, you can double check by curling _cluster/settings . Side note: if you look at _cluster/settings , you’ll see both persistent and transparent settings. Since AWS ElasticSearch does not allow cluster level reboots, these two are basically equivalent.

Once we configured the circuit breaker and max buckets limitations, the same queries that used to bring down the cluster would simply error out instead of crashing the cluster.

One more note on logs

From reading about the above investigation and fixes, you can see how much the lack of log observability limited our abilities to get to the bottom of the outages. For the developers out there considering using AWS ElasticSearch, know that by choosing this instead of hosting ElasticSearch yourself, you are giving up access to raw logs and the ability to tune some settings yourself. This will significantly limit your ability to troubleshoot issues, but it also comes with the benefits of not needing to worry about the underlying hardware, and being able to take advantage of AWS’s built-in recovery mechanisms.

If you are already on AWS ElasticSearch, turn on all the logs immediately — namely, error logs , search slow logs , and index slow logs . Even though these logs are still incomplete (for example, AWS only publishes 5 types of debug logs), it’s still better than nothing. Just a few weeks ago, we tracked down a mapping explosion that caused the master node CPU to spike using the error log and CloudWatch Log Insights.

Thank you to Michael Lai, Austin Gibbons, Jeeyoung Kim, and Adam McBride for proactively jumping in and driving this investigation. Giving credit where credit is due, this blog post is really just a summary of the amazing work that they’ve done.

Want to work with these amazing engineers? We are hiring!