I have been using Elasticsearch for an year and more. Quite often I hear people running into bottlenecks while configuring their ES cluster, it may be slow response times, nodes going down, heap going out of memory, etc.

Configuring the right settings for the indices and the server is a never-ending process of experimenting and monitoring. I want to present here key methods to debug/detect bottlenecks in any Elasticsearch cluster. This cheat sheet provides answers to most of the questions I was asked, and provides a gist about everything you should know while designing an Elasticsearch cluster. After you have read till here, let me stop you right there.

There is no universal way to design a perfect ES cluster, if anyone claims so, he is making a fool out of you. Period.

Before we get into the crux of cluster designing, one thing you need to know is that you can’t design the cluster without knowing what the use-case is. There are different ways of configuring the cluster to cater to different use-cases like does your site need full-text search only?, Is your site heavily dependent on aggregations, Do you need only autocomplete for your site, etc. One more thing you need to keep in mind is, you can never get it right the first time.

Memory

Memory allocation or management issues are by far the worst bottlenecks a DevOps guy may face. Many a times, the JVM runs out of memory and you may encounter too many young and old garbage collections (GC’s) occuring.

Thumb rule 1:

Allocate less than half of the RAM as heap size to Elasticsearch. The rest of the RAM is used for system level process and file caches. Don’t exceed 32GB of heap size.

Thumb rule 2:

The more heap you allocate to Elasticsearch, the more time JVM spends garbage collecting.

Its important to strike the balance between the JVM heap required by Elasticsearch and RAM required for different caches and other system level processes. If we allocate more heap size, it may result in “stop world” situations where the cluster goes unresponsive till JVM is restarted.

Lucene — the core search engine of every Elasticsearch-Shard is designed to leverage the underlying OS for caching in-memory data structures. This makes them very cache friendly, and the underlying OS will happily keep hot “segments” resident in memory for faster access. These segments include both the inverted index (for fulltext search) and doc values (for aggregations).

Lucene’s performance relies heavily on the interaction with OS. If you allocate all the memory to Elasticsearch’s heap, there won’t be any memory left for OS which in turn affects Lucene’s performance. Ideally leave 50% of the RAM free, Lucene will gobble up everything that’s leftover on offer.

Swap

Swapping is a performance killer for JVM’s heap. You should disable swapping on machine (dangerous sometimes) or at least for Elasticsearch process. Just set this property:

bootstrap.memory_lock: true

in your elasticsearch.yml. This makes the JVM to lock it’s allocated memory and prevent it from being swapped by the OS. Sometimes the above setting might not disable swap altogether. You will see meaningful errors in the logs on failed start. Then disable these limits by including following lines:

elasticsearch soft memlock unlimited

elasticsearch hard memlock unlimited

into your ‘/etc/security/limits.conf’ file (On CentOS). Also add

elasticsearch hard nofile 65536

elasticsearch soft nofile 65536

Will cover CPU, Cache, Pool sizes, Network considerations, etc. in the upcoming posts.