I write this following a particularly frustrating day of thumb twiddling and awaiting slack messages from the AWS support team. Our Elasticsearch cluster was down for the better part of a day, and we were engaged with AWS support the whole time.

At my previous job working for Loggly, my team and I maintained a massive, multi-cluster Elasticsearch deployment. I learned many lessons and have a lot of tricks up my sleeves for dealing with Elasticsearch’s temperaments. I feel equipped to deal with most Elasticsearch problems, given access to administrative Elasticsearch APIs, metrics and logging.

AWS’s Elasticsearch offers access to none of that. Not even APIs that are read-only, such as the /_cluster/pending_tasks API, which would have been really handy, given that the number of tasks in our pending task queue had steadily been climbing into the 60K+ region.

This accursed message has plagued me ever since AWS’s hosted Elasticsearch was foisted on me a few months ago:

{

"Message":"Your request: '/_cluster/pending_tasks' is not allowed."

}

Thanks, AWS. Thanks….

Without access to logs, without access to admin APIs, without node-level metrics (all you get is cluster-level aggregate metrics) or even the goddamn query logs, it’s basically impossible to troubleshoot your own Elasticsearch cluster. This leaves you with one option whenever anything starts to go wrong: get in touch with AWS’s support team.

9 times out of 10, AWS will simply complain that you have too many shards.

It’s bitterly funny that they chide you for this because by default any index you create will contain 5 shards and 1 replica. Any ES veteran will say to themselves: heck, I’ll just update the cluster settings and lower the default to 1 shard! Nope.

{

"Message": "Your request: '/_cluster/settings' is not allowed for verb: GET"

}

Well, fuck (although you can work around this by using index templates).

Eventually, AWS support suggested that we update the instance size of our master nodes, since they were not able to keep up with the growing pending task queue. But, they advised us to be cautious because making any change at all will double the size of the cluster and copy every shard .

That’s right. Increasing the instance size of just the master nodes will actually cause AWS’s middleware to double the size of the entire cluster and relocate every shard in the cluster to new nodes. After which, the old nodes are taken out of the cluster. Why this is necessary is utterly beyond me.

Adding an entry to the list of IP addresses that have access to the cluster will cause the cluster to double in size and migrate every stinking shard.

In fact, even adding a single data node to the cluster causes it to double in size and all the data will move.

Don’t believe me? Here is the actual graph of our node count as we were dealing with yesterday’s issue:

The node count increased by 10x for a period of time

Back at Loggly, we would never have considered doing this. Relocating every shard in any respectably sized cluster all-at-once obliterates the master nodes and would cause both indexing and search to come to a screeching halt. Which is precisely what happens whenever we make any change to our Elasticsearch cluster in AWS.

This is probably why AWS is always complaining about the number of shards we have… Like, I know Elasticsearch has an easy and simple way to add a single node to a cluster. There is no reason for this madness given the way Elasticsearch works.

I often wonder how much gratuitous complexity lurks in AWS’s Elasticsearch middleware. My theory is that their ES clusters are multi-tenant. Why else would the pending tasks endpoint be locked down? Why else would they not give you access to the ES logs? Why else would they gate so many useful administrative API behind the “not allowed” Cerberus?

I must admit though, it is awfully nice to be able to add and remove nodes from a cluster with the click of a button. You can change the instance sizes of your nodes from a drop-down; you get a semi-useful dashboard of metrics; when nodes go down, they are automatically brought back up; you get automatic snapshots; authentication works seamlessly within AWS’s ecosystem (but makes your ES cluster obnoxiously difficult to integrate with non-AWS libraries and tools, which I could spend a whole ‘nother blog post ranting about), and when things go wrong, all you have to do is twiddle your thumbs and wait on slack because you don’t have the power to do anything else.

Elasticsearch is a powerful but fragile piece of infrastructure. Its problems are nuanced. There are tons of things that can cause it to become unstable, most of which are related to query patterns, the documents being indexed, the number of dynamic fields being created, imbalances in the sizes of shards, the ratio of documents to heap space, etc. Diagnosing these problems is a bit of an art, and one needs a lot of metrics, log files and administrative APIs to drill down and find the root cause of an issue.

AWS’s Elasticsearch doesn’t provide access to any of those things, leaving you no other option but to contact AWS’s support team. But AWS’s support team doesn’t have the time, skills or context to diagnose non-trivial issues, so they will just scold you for the number of shards you have and tell you to throw more hardware at the problem. Although hosting Elasticsearch on AWS saves you the trouble of needing a competent devops engineer on your team, it absolutely does not mean your cluster will be more stable.

So, if your data set is small, if you can tolerate endless hours of downtime, if your budget is too tight, if your infrastructure is too locked in to AWS’s ecosystem to buy something better than AWS’s hosted Elasticsearch: AWS Elasticsearch is for you. But consider yourself warned…

Update — 6/19/2017: Since publishing this, the engineers on the AWS Elasticsearch team have personally reached out to us to better understand our use cases. They’re planning on improving the experience for “power-users”, and gathered a lot of feedback from us. I sincerely appreciate AWS’s willingness to face these issues head on and I’m impressed how quickly they addressed this. So, hats off to them!

Thanks for reading! If you like what you read, hold the clap button below so that others may find this. You can follow me on Twitter.