I used Elasticsearch in various projects: to add rich search functionality to applications as well as to collect and analyze logs with the help of Kibana. In both cases, either your users or your operators rely on the Elasticsearch infrastructure. In one of my past projects, the team used Elasticsearch to store logs of EC2 instances. Over time, more and more applications were moved to AWS. Therefore, the volume of logs shipped to Elasticsearch also increased. One day, it was a Sunday, the Elasticsearch cluster became suddenly unavailable, and the log shippers were throwing errors. Luckily, the log shippers were monitored, and someone was paged to look at the issue. It took some time to find out that the Elasticsearch cluster had no disk space available. Situations like this are avoidable. Monitor available disk space and you can react before the disks are full.

Amazon Elasticsearch provides Elasticsearch as a Service. The fully managed service covers a lot of the challenges of operating a search engine (e.g., cluster management, patching the operating system and the search engine, …). But you are still responsible for some operational aspects: sizing and performance optimizations. Therefore, you need to monitor every Elasticsearch domain that serves production workloads.

Monitoring your whole cloud infrastructure is a complex task, as Andreas pointed out in his AWS Monitoring Primer. In this blog post, I will focus on the relevant parts for monitoring your Elasticsearch domain:

I guide you to the relevant AWS monitoring services and features offered by AWS. I present best practices based on real-world client projects. I provide a CloudFormation template that implements all ideas in the post. You can use the template to monitor any Elasticsearch domain in a minute.

Let’s get started!

Identifying important CloudWatch metrics

Each Elasticsearch domain sends metrics to CloudWatch.

The most important metrics are:

area metric description relevance Storage FreeStorageSpace The free space, in megabytes, for all data nodes in the cluster. ES throws a ClusterBlockException when this metric reaches 0. CPU CPUUtilization The maximum percentage of CPU resources used for data nodes in the cluster. 100% CPU utilization isn't uncommon, but sustained high averages are problematic. CPU CPUCreditBalance The remaining CPU credits available for data nodes in the cluster (only applies to the t2 family) If you run out of burst credits, performance will drop significantly. CPU MasterCPUUtilization The maximum percentage of CPU resources used by the dedicated master nodes. Because of their role in cluster stability, dedicated master nodes should have lower average usage than data nodes. CPU MasterCPUCreditBalance The remaining CPU credits available for dedicated master nodes in the cluster (only applies to the t2 family). If you run out of burst credits, performance will drop significantly. Memory JVMMemoryPressure The maximum percentage of the Java heap used for all data nodes in the cluster. The cluster could encounter out of memory errors if usage increases. Memory MasterJVMMemoryPressure The maximum percentage of the Java heap used for all dedicated master nodes in the cluster. Because of their role in cluster stability, dedicated master nodes should have lower average usage than data nodes. Cluster ClusterStatus.yellow At least one replica shard is not allocated to a node Your high availability is compromised to some degree. If more shards disappear, you might lose data. Think of yellow as a warning that should prompt investigation. Cluster ClusterStatus.red At least one primary shard is not allocated to a node. You are missing data: searches will return partial results, and indexing into that shard will return an exception. Cluster ClusterIndexWritesBlocked Indicates whether your cluster is accepting or blocking incoming write requests. A value of 1 means that the cluster is blocking write requests. Cluster AutomatedSnapshotFailure The number of failed automated snapshots for the cluster. A value of 1 indicates that no automated snapshot was taken for the domain in the previous 36 hours. Cluster KibanaHealthyNodes A health check for Kibana. A value of 0 indicates that Kibana is inaccessible. Cluster KMSKeyError Indicates whether your cluster can use the configured KMS key. A value of 1 indicates that the KMS customer master key used to encrypt data at rest has been disabled. Cluster KMSKeyInaccessible Indicates whether your cluster can use the configured KMS key. A value of 1 indicates that the KMS customer master key used to encrypt data at rest has been deleted or revoked its grants to Amazon ES.

Once important metrics are identified, you can use them to understand how a healthy system differs from an impacted system.

Defining thresholds

One of the hardest parts of monitoring is to define what healthy means. For each metric, you have to define a threshold between healthy and impacted. E.g., you regard CPU utilization under 80% as healthy because the application was never impacted when the CPU was not utilized. Thresholds are defined based on observations from the past. They might need adjustment in the future.