If you are here, I do not need to tell you that Elasticsearch is awesome, fast and mostly just works.

If you are here, I also do not need to tell you that Elasticsearch can be opaque, confusing, and seems to break randomly for no reason. In this post I want to share my experiences and tips on how to set up Elasticsearch correctly and avoid common pitfalls.

I am not here to make money so I will mostly just jam everything into one post instead of doing a series. Feel free to skip sections.

The basics: Clusters, Nodes, Indices and Shards

If you are really new to Elasticsearch (ES) I want to explain some basic concepts first. This section will not explain best practices at all, and focuses mainly on explaining the nomenclature. Most people can probably skip this.

Elasticsearch is a management framework for running distributed installations of Apache Lucene, a Java-based search engine. Lucene is what actually holds the data and does all the indexing and searching. ES sits on top of this and allows you to run potentially many thousands of lucene instances in parallel.

The highest level unit of ES is the cluster. A cluster is a collection of ES nodes and indices.

Nodes are instances of ES. These can be individual servers or just ES processes running on a server. Servers and nodes are not the same. A VM or physical server can hold many ES processes, each of which will be a node. Nodes can join exactly one cluster. There are different Types of node. The two most interesting of which are the Data Node and the Master-eligible Node. A single node can be of multiple types at the same time. Data nodes run all data operations. That is storing, indexing and searching of data. Master -eligible nodes vote for a master that runs the cluster and index management.

Indices are the high-level abstraction of your data. Indices do not hold data themselves. They are just another abstraction for the thing that actually holds data. Any action you do on data such as INSERTS, DELETES, indexing and searching run against an Index. Indices can belong to exactly one cluster and are comprised of Shards.

Shards are instances of Apache Lucene. A shard can hold many Documents. Shards are what does the actual data storage, indexing and searching. A shard belongs to exactly one node and index. There are two types of shards: primary and replica. These are mostly the exact same. They hold the same data, and searches run against all shards in parallel. Of all the shards that hold the same data, one is the primary shard. This is the only shard that can accept indexing requests. Should the node that the primary shard resides on die, a replica shard will take over and become the primary. Then, ES will create a new replica shard and copy the data over.

At the end of the day, we end up with something like this:

A more in-depth look at Elasticsearch

If you want to run a system, it is my belief that you need to understand the system. In this section I will explain the parts of Elasticsearch I belief you should understand if you want to manage it in production. This will not have any recommendations in it those come later. Instead it aims purely at explaining necessary background.

Quorum

It is very important to understand that Elasticsearch is a (flawed) democracy. Nodes vote on who should lead them, the master. The master runs a lot of cluster-management processes and has the last say in many matters. ES is a flawed democracy because only a subclass of citizens, the master-eligible nodes, are allowed to vote. Master-eligible are all nodes that have this in their configuration:

node.master: true

On cluster start or when the master leaves the cluster, all master-eligible nodes start an election for the new master. For this to work, you need to have 2n+1 master-eligible nodes. Otherwise it is possible to have a split-brain scenario, with two nodes receiving 50% of the votes. This is a split brain scenario and will lead to the loss of all data in one of the two partitions. So don’t have this happen. You need 2n+1 master-eligible nodes.

How nodes join the cluster

When an ES process starts, it is alone in the big, wide world. How does it know what cluster it belongs to? There are different ways this can be done. However, these days the way it should be done is using what is called Seed Hosts.

Basically, Elasticsearch nodes talk with each other constantly about all the other nodes they have seen. Because of this, a node only needs to know a couple other nodes initially to learn about the whole cluster.

//EDIT, 25th Feb 2020:

Dave Turner mentioned in the comments:

This isn’t really a constant process: nodes only share information about other nodes they have discovered when they’re not part of a cluster. Once they’ve joined a cluster they stop this and rely on the cluster’s elected master to share any changes as they occur, which saves a bunch of unnecessary network chatter. Also in 7.x they only really talk about the master-eligible nodes they have seen — the discovery process ignores all the master-ineligible nodes. davecturner

Lets look at this example of a three node cluster:

Initial state.

In the beginning, Node A and C just know B. B is the seed host. Seed hosts are either given to ES in the form of a config file or they are put directly into elasticsearch.yml.



Node A connects and exchanges information with B

As soon as node A connects to B, B now knows of the existence of A. For A, nothing changes.

Node C connects and shares information with B

Now, C connects. As soon as this happens, B tells C about the existence of A. C and B now know all nodes in the cluster. As soon as A connects to B again, it will also learn of the existence of C.

Segments and segment merging

Above I said that shards store data. This is only partially true. At the end of the day, your data is stored on a file system in the form of.. files. In Lucene, and with that also Elasticsearch, these files are called Segments. A shard will have between one and multiple thousand segments.

Again, a segment is an actual, real file you can look at in the data directory of your Elasticsearch installation. This means that using a segment is overhead. If you want to look into one, you have to find and open it. That means if you have to open many files, there will be a lot of overhead. The problem is that segments in Lucene are immutable. That is fancy language for saying they are only written once and cannot be changed. This in turn means that every document you put into ES will create a segment with only that single document in it. So clearly, a cluster that has a billion documents has a billion segments which means there are a literal billion files on the file system, right? Well, no.

In the background, Lucene does constant segment merging. It cannot change segments, but it can create new ones with the data of two smaller segments.

This way, lucene constantly tries to keep the number of segments, which means the number of files, which means the overhead, small. It is possible to force this process by using a force merge.

Message routing

In Elasticsearch, you can run any command against any node in a cluster and the result will be the same. That is interesting because at the end of the day a document will live in only one primary shard and its replicas, and ES does not know where. There is no mapping saying a specific document lives in a specific shard.

If you are searching, then the ES node that gets the request will broadcast it to all shards in the index. This means primary and replica. These shards then look into all their segments for that document.

If your are inserting, then the ES node will randomly select a primary shard and put the document in there. It is then written to that primary shard and all of its replicas.

//EDIT, 25th Feb 2020:

Dave Turner mentioned in the comments:

The word “shard” is ambiguous here and although I think you understand this correctly it’d be easy for a reader to misinterpret how this is written. Each search only fans out to one copy of each shard in the index (either a primary or a replica, but not both). If that fails then the search will of course try again on another copy, but that’s normally very rare davecturner

So how do I run Elasticsearch in production?

Finally, the practical part. I should mention that I managed ES mostly for logging. I will try to keep this bias out of this section, but will ultimately fail.

Sizing

The first question you need to ask and subsequently answer yourself, is about sizing. What size of ES cluster do you actually need?

RAM

I am talking about RAM first, because your RAM will limit all other resources.

Heap

ES is written in Java. Java uses a heap. You can think of this as java-reserved memory. There is all kind of stuff that is important about heap which would triple this document in size so I will get down to the most important part which is heap size.

Use as much as possible, but no more than 30G of heap size.

Here is a dirty secret many people don’t know about heap: every object in the heap needs a unique address, an object pointer. This address is of fixed length, which means that the amount of objects you can address is limited. The short version of why this matters is that at a certain point, Java will start using compressed object pointers instead of uncompressed ones. That means that every memory access will have additional steps involved and be much slower. You 100% do not want to get over this threshold, which is somewhere around 32G.

I once spend an entire week locked into a dark room doing nothing else but using esrally to benchmark different file systems, heap sizes, FS and BIOS settting combinations of Elasticsearch. Long story short here is what it had to say about heap size:

Index append latency, lower is better

The naming convention is fs_heapsize_biosflags. As you can see, starting at 32G of heap size performance suddenly starts getting worse. Same with throughput:

Index append median throughput. Higher is better.

Long story short: use 29G of RAM or 30 if you are feeling lucky, use XFS, and use hardwareprefetch and llc-prefetch if possible.

FS cache

Most people run Elasticsearch on Linux, and Linux uses RAM as file system cache. A common recommendation is to use 64G for your ES servers, with the idea that it will be half cache, half heap. I have not tested FS cache. However, it is not hard to see that large ES clusters, like for logging, can benefit greatly from having a big FS cache. If all your indices fit in heap, not so much.

//EDIT, 25th Feb 2020:

Dave Turner mentioned in the comments:

Elasticsearch 7.x uses a reasonable amount of direct memory on top of its heap and there are other overheads too which is why the recommendation is a heap size no more than 50% of your physical RAM. This is an upper bound rather than a target: a 32GB heap on a 64GB host may not leave very much space for the filesystem cache. Filesystem cache is pretty key to Elasticsearch/Lucene performance, and smaller heaps can sometimes yield better performance (they leave more space for the filesystem cache and can be cheaper to GC too). davecturner

CPU

This depends on what you are doing with your cluster. If you do a lot of indexing, you need more and faster CPUs than if you just do logging. For logging, I found 8 cores to be more than sufficient, but you will find people out there using way more since their use case can benefit from it.

Disk

Not as straightforward as you might think. First of all, if your indices fit into RAM, your disk only matters when the node is cold. Secondly, the amount of data you can actually store depends on your index layout. Every shard is a Lucene instance and they all have memory requirement. That means there is a maximum number of shards you can fit into your heap. I will talk more about this in the index layout section.

Generally, you can put all your data disks into a RAID 0. You should replicate on Elasticsearch level, so losing a node should not matter. Do not use LVM with multiple disks as that will write only to one disk at a time, not giving you the benefit of multiple disks at all.

Regarding file system and RAID settings, I have found the following things:

Scheduler: cfq and deadline outperform noop. Kyber might be good if you have nvme but I have not tested it

QueueDepth: as high as possible

Readahead: yes, please

Raid chunk size: no impact

FS block size: no impact

FS type: XFS > ext4

Index layout

This highly depends on your use case. I can only talk from a logging background, specifically using Graylog.

Shards

Short version:

for write heavy workloads, primary shards = number of nodes

for read heavy workloads, primary shards * replication = number of nodes

more replicas = higher search performance

Here is the thing. If you write stuff, the maximum write performance you can get is given by this equation:

node_throughput*number_of_primary_shards

The reason is very simple: if you have only one primary shard, then you can write data only as quickly as one node can write it, because a shard only ever lives on one node. If you really wanted to optimize write performance, you should make sure that every node only has exactly one shard on it, primary or replica, since replicas obviously get the same writes as the primary, and writes are largely dependent on disk IO. Note: if you have a lot of indexing this might not be true and the bottleneck could be something else.

If you want to optimize search performance, search performance is given by this equation:

node_throughput*(number_of_primary_shards + number_of_replicas)

For searching, primary and replica shards are basically identical. So if you want to increase search performance, you can just increase the number of replicas, which can be done on the fly.

Size

Much has been written about index size. Here is what I found:

30G of heap = 140 shards maximum per node

Using more than 140 shards, I had Elasticsearch processes crash with out-of-memory errors. This is because every shard is a Lucene instance, and every instance requires a certain amount of memory. That means there is a limit for how many shards you can have per node.

If you have the amount of nodes, shards and index size, here is how many indices you can fit:

number_of_indices = (140 * number_of_nodes) / (number_of_primary_shards * replication_factor)

From that and your disk size you can easily calculate how big the indices have to be

index_size = (number_of_nodes * disk_size) / number_of_indices

However, keep in mind that bigger indices are also slower. For logging it is fine to a degree but for really search heavy applications, you should size more towards the amount of RAM you have.

Segment merging

Remember that every segment is an actual file on the file system. More segments = more overhead in reading. Basically for every search query, it goes to all the shards in the index, and from there to all the segments in the shards. Having many segments drastically increases read-IOPS of your cluster up to the point of it becoming unusable. Because of this it’s a good idea to keep the number of segments as low as possible.

There is a force_merge API that allows you to merge segments down to a certain number, like 1. If you do index rotation, for example because you use Elasticsearch for logging, it is a good idea to do regular force merges when the cluster is not in use. Force merging takes a lot of resources, and will slow your cluster down significantly. Because of this it is a good idea to not let for example Graylog do it for you, but do it yourself when the cluster is used less. You definitely want to do this if you have many indices though. Otherwise, your cluster will slowly crawl to a halt.

Cluster layout

For everything but the smallest setups it is a good idea to use dedicated master-eligible nodes. The main reasons is that you should always have 2n+1 master-eligible nodes to ensure quorum. But for data nodes you just want to be able to add a new one at any time, without having to worry about this requirement. Also, you don’t want high load on the data nodes to impact your master nodes.

Finally, master nodes are ideal candidates for seed nodes. Remember that seed nodes are the easiest way you can do node discovery in Elasticsearch. Since your master nodes will seldomly change, they are the best choice for this, as they most likely already know all other nodes in the cluster.

//EDIT, 25th Feb 2020:

Dave Turner mentioned in the comments:

Master-eligible nodes are the only possible candidates for seed nodes in 7.x, because master-ineligible nodes are ignored during discovery. davecturner

Master nodes can be pretty small, one core and maybe 4G of RAM is enough for most clusters. As always, keep an eye on actual usage and adjust accordingly.

Monitoring

I love monitoring, and I love monitoring Elasticsearch. ES gives you an absolute ton of metrics and it gives you all of them in the form of JSON, which makes it very easy to pass into monitoring tools. Here are some helpful things to monitor:

number of segments

heap usage

heap GC time

avg. search, index, merge time

IOPS

disk utilization

Conclusion

After around 5 hours of writing this, I think I dumped everything important about ES that is in my brain into this post. I hope it saves you many of the headaches I had to endure.

Resources

https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-node.html

https://www.elastic.co/guide/en/elasticsearch/reference/master/modules-discovery-quorums.html

https://github.com/elastic/rally

https://tech.ebayinc.com/engineering/elasticsearch-performance-tuning-practice-at-ebay/