Many of us might have worked on Elastic Search, but are we aware of the internal workings of the same?! How does elastic search actually store data to provide us with the real time analytics? What data structures and algorithms empower them?! So, take a deep breath.. time to open up this black box.

What does this blog contain?

Since the content is too large and complex, have divided this topic in two blogs. This blog will majorly cover primary data structures being used by elastic search. In the next blog, will go through the major algorithms which makes elastic search what it is today.

Let’s get started…

Let’s start with the basics. As you all might know, inverted Index is the primary data structure of elastic search. In layman terminology, whenever a document is saved, a mapping is created between all the terms vs the document. This data structure is the core of the search engine and makes lookup possible. In case you aren’t aware of this, you can google it up, there are thousands of articles on the same. Now once the basic is done, lets deep dive into the much more interesting parts 😀

A document is (mostly) a json structure containing many fields. Each json field has a separate inverted index data structure. So, more the number of fields, more the number of inverted indexes present in your system. And yes you guessed it right, that's why there is a limit of 1000 indexed fields.

Inverted Index is the primary data structure which gets the job done!

Some people have misconceptions, that Inverted Index is just the mapping of word and document Ids. But, it also contains many more information like the number of times the term occurred in the document, length of the document, etc.. which ultimately helps it in defining the relevancy of the documents and thus the score.

Now the next concept is by far the most important concept. Inverted Index is immutable. Yes, you heard it right! But, why is that so?! Because no updating means, no cache invalidation. Also, no mutation means, no locking the database for avoiding the race conditions, which means faster databases. But, if the inverted index is immutable, how do we add more data 🤔? The answer is we create a new inverted index every time a document is being saved. But, that would be very costly, right?!

Have you ever heard about Segments?!

So, Elastic search buffers the documents for some time, and then create one inverted index for all those documents. This “inverted index” is called Segment and this “sometime” is called Refresh Time. This refreshing time is usually of 1 second. Since every document takes the minimum of refresh time to get indexed, that’s why Elastic Search is called near-real-time analytics. The above optimisation might save us some cost, but again, will it be scalable?!

Each inverted index disk sync takes considerable resources. And because of this bottleneck, Elastic Search couldn’t have scaled to support billions of documents. Umm.. so how did they fix it?! Yea, Caching — the saviour. Instead of directly syncing in the disk, add it a cache first and then flush the cache later in one go. This time is called Flush Time. But, what if my elastic search fails in between? Where will my cache data go?! Will I lose them forever?! Probably yes.. But, elastic search has a solution for that as well..!! Wait and Read..

The solution is.. Translog.. Whenever a segment is being inserted in the file cache, all the operations are being recorded in a lightweight file as well. So, in case of system stops in between, data can be recreated with this translog file.

That’s how elastic search manages our inverted index for us.. but, there is one more data structure without which elastic search couldn’t have worked..

Think.. Think..!!

And the answer is… FieldData.. What if we want to know, which terms are present in which documents?! Via inverted index, it would be very time-consuming right?! So, that’s why a reverse mapping is also kept called field-data. This is again kept in memory. Again, will it be scalable?!. So, Elastic search keeps it in disk as well as doc-values. This data structure is the engine behind aggregation, sorting, etc.

Doc values are the on-disk data structure, built at document index time, which makes aggregation and sorting possible.

That’s it for data structure blog. Will cover algorithm part in the next blog. I have tried to keep it concise and have added only the hard to find knowledge, which I have gathered from various sources. In case, something has changed or you feel otherwise or you want to add more knowledge, feel free to contact me. I’m always open to learning new things.

In case, you have an inclination towards learning new things and want to keep up with the trend, subscribe to my newsletter Trillion Dollar Tech. I promise you won't be disappointed.