Since its early versions, Elasticsearch (ES) has risen in popularity not only as a search engine but as a document store and analytics platform as well.

Easy to implement, performant and developer friendly, ES is naturally one of the top choices for indexing and querying large structured and unstructured data sets.

However, popularity comes with expectations regarding responsiveness, reliability and scale. ES is built to scale dynamically and this means that adding additional resources to an existing cluster is as easy as possible. It is very convenient to let ES clusters evolve “naturally” during the life of an application, and therefore quite common to see clusters grow to 100s or 1000s of nodes.

When running such a cluster, which will certainly be part of an application, the goal is to achieve the best balance of responsiveness, availability and cost efficiency for your requirements.

User experience will degrade if the indexed data sets keep growing without any addition of processing power. The same applies when the cluster is massively over-allocating shards per node, or does not have sufficient replica data to compensate outages of some of the nodes. But the decision to scale horizontally might not have the desired effect because the problem is based on the application design and how it queries, indexes, and updates the cluster.

It doesn’t make sense to scale a cluster too early, adding unnecessary power without thinking about distribution of data sets. Also it’s unsuitable to scale the cluster too late, resulting in loss of user acceptance.

The goal is to scale out at the right time while adding the right amount of capacity to the cluster. Therefore one would need to get real time insight into the different types of components that affect the ES cluster and the workload patterns from the various services making requests to the cluster.

The Architecture of an Elasticsearch Cluster

To get a good understanding of when you should consider changing an ES cluster and adjust it to the new workload, it’s important to know what exactly makes up a cluster.

Shards: the Units of Scale

When thinking about the layout and systematics of storing and distributing data, one should start with the smallest unit of scale: the shard.



The above figure shows an ES Index with three shards, distributed in a cluster with three nodes. A shard is a Lucene index. Lucene is a well known Apache Open Source project which is used as a high performance search engine in many products. The ES Index is the sum of all the shards. When an application queries an ES Index, the request will be routed to each of those shards by Elasticsearch, so one doesn’t need to worry about which shard their data has been stored.

However, what one should be concerned about is how many shards they want to allocate for an index. It is important to know that each shard will be processed by an individual thread that is taken out of a node local thread pool. Having shards distributed across nodes is fine in terms of parallelism, but having too many shards on a single node can have a negative effect on the cluster’s performance. This is because with higher load, the thread pool can get a serious bottleneck effect.

Furthermore, when starting a new index it is important to know that the associated shards cannot be changed without reindexing all the documents to a new, more organized index. So a common scenario is to start out with a little over-allocation of shards, so that the benefit of adding new nodes to a cluster has a direct impact on the performance and data distribution. Elasticsearch will automatically try to find the optimal distribution of shards, but the number of possible nodes you can scale an index to will be the same as the number of shards.

The image above displays an index consisting of primary shards only. Elasticsearch differentiates between primary and replica shards. A replica shard is basically a copy of any given primary shard. Documents are first indexed to one of the primary shards and then forwarded to the corresponding replica. The main reason for having replica shards is resilience. When a node holding a primary shard goes offline, one of its replicas will be promoted to primary.



The above image displays the concept of the distribution of replica shards. When one of the three nodes fails, the data it holds is still available on one of the other nodes. This way, queries can still be served as if there was no problem. However, resilience by replica has its limits. When data becomes unavailable, ES is not able to recover it from the available replicas and the index will not respond anymore because at least one primary shard is missing.

The second reason for replica shards is to serve read requests. When using a search-heavy index, replica shards can boost performance, but this is only true when the replica is on extra hardware.

From a monitoring perspective, the availability and usage of shards are important elements to observe. Unassigned primary shards in your cluster put your data at risk, as ES routes documents to shards by a hash. When the shard which a document should be routed to is not available, the document won’t be indexed.

What should also be considered is the balance of the number of shards assigned to a node and the load a node has to handle (in terms of actions to be executed and threads being used). If a node is getting unresponsive, the reason could simply be that it is overloaded and can’t handle the number of requests per shard.

Indices: the Units of Work

All operations in Elasticsearch are issued against indices which are distributed by means of sharding in a given cluster. When working with indices, there are some things to keep in mind.

Elasticsearch offers the capability to create an index by just indexing a document. This is very convenient. On the other hand, it also applies some basic behaviour to every use case and workload, when in some situations this might not be suitable. When creating an index based on a document, ES will create an automatic mapping. Every field of the document will be indexed with the default behaviour, but sometimes the default isn’t the desired way of mapping. Preferably there should be a specific way. This can be done by explicitly putting the mapping before indexing documents, or by defining templates. Mapping lets you control which fields get indexed and, even more important, which fields will not get indexed.

Indexing fields which will never be part of searches doesn’t make sense and is just a waste of space and processing time. When the structure of documents varies but one still wants to store documents with the same type in the same index, the auto-generated mapping will explode. It would be better to disable such fields in the index’s mapping beforehand.

Another very important thing to consider is the layout of the index. Often one needs to add fresh data and delete obsolete data. But it should be clear that deleting single documents is a very expensive operation. This is true for updates as well, as an update is nothing more than deleting and reindexing a given document. When working with such time based data another option is to create a new index for a time window (i.e. per day) and to delete obsolete indexes.

Deleting single documents will mark them in the corresponding Lucene segments and they will get purged when the segments get merged or optimized. As long as this doesn’t happen, the data set keeps growing even when deleting documents. Search performance will decrease and more I/O will happen at a later point in time.

Deleting a whole index is cheap. It’s deleting the Lucene files associated with the shards of the index. When you leverage aliases for indexes, you can still query a bunch of indices under the same name.

Elasticsearch Operations and Thread Pools

ES uses a bunch of configurable thread pools for different operations to be applied to indices. The types of threads for indexing, bulking, merging, refreshing and flushing are tightly coupled to how ES works and how it processes the data. ES defines very sensible values for the thread pools, meaning that each pool gets assigned the number of available cores on a given node. The search threads get a little more – actually it’s (Cores*3)/2+1 – but they still stay in a strong relation to the number of cores.

And there is a good reason for that. With eight cores, a machine cannot execute more than eight threads simultaneously so why try it? The result of setting higher numbers here is context switching. This does not allow for higher throughput – it will lead to performance loss. So the best thing to do with ES’ thread pools is to keep them as they are and watch for blocked and rejected threads. If there are rejected threads, increasing the queue sizes is a valid option. But if one is concerned about processing time and latency has to be kept at a low level, it’s better to add more nodes.



While increasing pool sizes is never a really good idea, watching the usage of thread pools can give valuable hints on what to change in a cluster.

Memory Usage

As ES runs on a JVM, a common mistake is to assume that it runs better when one assigns more heap to it. This is not the case. Although heap is vital, off-heap is important too. Over-allocating heap can have undesired side effects. Lucene – the basis for ES – runs off-heap and uses the underlying OS’ structures and mechanisms for caching hot data segments in memory. Giving all the available heap to ES will decrease the off-heap capacity for Lucene and thus break the smart caching mechanics and other optimizations which make Lucene so strong.

The basic guideline is to reserve approximately half of the RAM for the heap and leave the rest for Lucene. But it is not a “one size fits all” kind of rule. By watching the JVM’s memory usage patterns and those of the host, one can find better suiting configurations to make the cluster more efficient.

When doing heavy aggregations or using field data frequently, there is a need for heap space. If this is not the workload to be expected, a good choice could be to give less heap to the JVM and more to Lucene.

Even when heap is more important than off-heap space, there are limits for sane values. ES uses the CMS garbage collector by default, as recommended. However, CMS requires two stop-the-world phases and gets less efficient with larger heaps.

So the baseline is, that one can get better performance from ES and Lucene if you watch and optimize memory and GC settings.

Monitoring an Elasticsearch Cluster

So with this complexity in mind, how can a large cluster be monitored in a way that doesn’t flood you with false positives, a common problem monitoring modern distributed systems? How can one keep a watchful eye on the real health of the cluster and react before any health degradation negatively impacts the whole system?

The first thing the operator of an ES cluster is definitely concerned about is its overall performance in terms of response times and throughput of queries.

In most cases search performance is of central interest, but there are of course times when write performance is the focus. When it comes to tweaking the write performance, the first step is to look at which threads are running in the cluster. Some threads might be queued up, or even rejected.



While a cluster queueing up threads for a short time is not a problem in itself, rejections point to a problem which deserves further inspection.

Threads and latency are not only interesting for the cluster as a whole, but also for each individual node. This is because there might be a problem with shard-routing and therefore issues may only be visible on specific nodes.

More metrics to thoroughly look at are on the OS and JVM level. How does the heap consumption evolve over time for the JVM? How often is the GC running, and how long does it take (keeping in mind that CMS stops the world!)? Is it freeing up memory?



What does the swapping activity of the host look like? i.e. is the OS swapping out memory?

What about the network? Are there peaks of throughput, maybe pointing to replication issues or relocating shards due to outages? Are there any retransmissions or other problems which might indicate a problem with the interconnects?

As you can see from the images above, the metrics we’ve been talking about so far are readily available on Instana’s dashboards for both, nodes and the cluster. But monitoring a big cluster by looking at even well-organized and aggregated dashboards is time consuming and there are certain things that will not hit the eye immediately. Deriving issues and action points from a flood of metrics need a great amount of expertise and are still error prone.

Adding Intelligence to the Game

At Instana we believe that you should be able to spend your time on being productive, not constantly looking at metrics and graphs, and sorting out alerts and letting the day go by playing “catch up”. Solving the wrong problem – one that does not really impact your services or one that already did the damage – is neither fun nor efficient.

The basic need when monitoring ES is to get a quick overview of what is happening in the cluster. A first approach is Instana’s physical view, where the user can see and interact with every component. These components are being monitored all the way from the host, to a runtime environment, to a specific piece of software or technology.

Behind the scenes, Instana is calculating health scores for each of these component by inspecting, watching, and correlating the metrics talked about above.

When Instana sees a problematic situation with an ES node or cluster, this will be immediately visible on the map.

Let’s have a look at some rules Instana currently applies to the collected one second metrics for Elasticsearch:

ES Cluster Split-Brain

This rule will hit when there is more than one node claiming to be the cluster’s master, possibly caused by network glitches/outages or severe misconfiguration. A split-brain can lead to inconsistent indices and even data loss.

This rule will hit when there is more than one node claiming to be the cluster’s master, possibly caused by network glitches/outages or severe misconfiguration. A split-brain can lead to inconsistent indices and even data loss. Cluster Rebalancing

An ES Cluster can dynamically rebalance the data (i.e. relocate some of the shards) at any moment. Normally this happens after a node leaves or rejoins the cluster after being unreachable for whatever reason, or when adding nodes. However, depending on the workload the relocation could overload any of the nodes, resulting in poor performance. This rule takes several metrics into account, not just the fact that shards are relocating. Stan also looks at the memory consumption, processor load and network traffic to determine whether there is a dangerous situation or not.

An ES Cluster can dynamically rebalance the data (i.e. relocate some of the shards) at any moment. Normally this happens after a node leaves or rejoins the cluster after being unreachable for whatever reason, or when adding nodes. However, depending on the workload the relocation could overload any of the nodes, resulting in poor performance. This rule takes several metrics into account, not just the fact that shards are relocating. Stan also looks at the memory consumption, processor load and network traffic to determine whether there is a dangerous situation or not. High Heap Usage

ES itself uses the heap for its work. But ES’ search is based on Lucene, which operates off-heap. So using too much heap for the ES-JVM reduces the capacity left for Lucene operations (which is the core of ES). To determine whether there might be a problem or not, Stan looks at the available memory, the workload of the node (for instance, is it more indexing or search heavy?) and response times. There are several levels at which this rule might issue an alert.

ES itself uses the heap for its work. But ES’ search is based on Lucene, which operates off-heap. So using too much heap for the ES-JVM reduces the capacity left for Lucene operations (which is the core of ES). To determine whether there might be a problem or not, Stan looks at the available memory, the workload of the node (for instance, is it more indexing or search heavy?) and response times. There are several levels at which this rule might issue an alert. Cluster Health Status

Stan simply looks at the cluster’s health reported by the ES-API itself. A cluster being in a yellow state (meaning that some of replica shards are not assigned) might be normal in some scenarios, but will only be visible for a couple of seconds. If this situation doesn’t resolve after a longer period, this might point to an infrastructural problem (i.e. overload or overcommitment/allocation).

A bigger problem is a red status, as this means some of the primary shards failed to be assigned or allocated and thus the consistency of the cluster is in danger.

So as you can see from the above examples, Instana doesn’t just display metrics on a dashboard, it also takes the metrics, correlates them over time and gives sensible feedback and propositions to solve an issue arising in an ES cluster.

You can read more about notifications, events and incidents here.

With the hierarchical organisation of components which depend on one another, Instana can point the user to where the problem arises, which degrades the ES clusters performance.

Understanding ES as stand alone component and enabling its management is crucial. However, ES is usually not deployed just to run by itself – normally it is a piece of a broader business system, thus it should be monitored and understood within its context.

From Physical to Semantical Monitoring

As stated above, ES is not an isolated piece, but typically acts as a service to your application and is thus part of the application’s health.

For example, imagine adding a new node to an existing cluster. The cluster’s state will turn yellow because the cluster will start to relocate shards to the new node.

You will see this with Instana as well. But does it affect the health of your application from a user perspective? In most cases the answer is no.

But if you started out with a single node, over-allocated shards and had no replicas, then the answer would be yes. But in that case, the cluster would turn red.

But this is just an assumption. How can Stan actually know for sure that your application’s performance isn’t impacted and still healthy?

Enter the Dynamic Graph and Distributed Tracing

Instana organizes components in a Dynamic Graph, which grows and evolves together with your systems. From your hardware, over OS and processes, to the JVM and finally to ES node and ES cluster, it knows about everything that is needed to let ES perform. And by organizing the entities in a graph, dependencies become clear and impacts of failures in single entities can be derived for the whole system.

Distributed traces of application requests will be sent to the back end and – surprise! – correlated with metrics collected from all monitored components through Instana’s graph model. What does that mean to you?



It means that you get information about every call your applications issue against ES. Therefore you now know how long operations such as indexing and searching take and what their impact on transactions are. Along with that, you get the actual query which has been issued so that you can further investigate and work out improved ways of interaction and validate them once they are in place, again through tracing.

Even more, Instana lets you look at you applications from a logical perspective. In this perspective a host does not communicate with another one. Instead, the service ‘product search’ communicates with the ES Index’s ‘products’. One can still drill down to the actual physical entities, but the logical dependencies are a lot clearer in this view. As Instana extracts all this information from the traces it gets from the agents, the logical view will immediately adapt and update when the application’s behaviour changes. And as Instana processes every single trace, it can calculate KPIs for these relations.

These KPIs let Instana know if replica shards in an ES cluster that are unavailable on one of the hosts impacts the performance of a system – and in what way and to which degree. Instana sees and understands what happens in an application and can immediately react with subtle changes.

But again, Instana will not only tell the user whether the cluster is in a healthy state or not. It will inform on health of the cluster and the health of the applications and services using it. And if the ES cluster is what is degrading some services’ performance, Instana will tell you which service is impacted and how. Furthermore, whenever you are notified about a problem with one of your applications or services, you will also be told which underlying components are likely to be the cause and why.

This brings you a lot closer to finding the actual reason for issues in your system, while spending less time on looking at metrics.