This post is part of a series covering the underlying architecture and prototyping examples with a popular distributed search engine, Elasticsearch. In this post, we’ll be discussing the underlying storage model and how CRUD (create, read, update and delete) operations work in Elasticsearch.

Elasticsearch is a very popular distributed search engine used at many companies like GitHub, SalesforceIQ, Netflix, etc. for full text search and analytical applications. At Insight Data Engineering Fellows Program, Fellows have used Elasticsearch for its various different features like:

Full-text search

E.g., Finding the most relevant Wikipedia article for a search term.

Aggregations

E.g., Visualizing a histogram of bids for a search term on an ad network.

Geospatial API

E.g., A ridesharing platform to match the closest driver and rider.

Since, Elasticsearch is so popular in the industry and among our Fellows, I decided to take a closer look. In this post, I wanted to share what I learned about its storage model and how the CRUD operations work.

Now, when I think of how a distributed system works, I think of the picture below:

The part above the surface is the API and the part below is the actual engine where all the magic happens. In this post, we will be focusing on the part below the surface. Mainly, we will be looking at:

Is it a Master/Slave architecture or Master-less architecture?

What is the storage model?

How does a write work?

How does a read work?

How are the search results relevant?

Before we dive deep into these concepts, let’s get familiar with some terminology.

The confusion between Elasticsearch Index and Lucene Index + other common terms…

An Elasticsearch index is a logical namespace to organize your data (like a database). An Elasticsearch index has one or more shards (default is 5). A shard is a Lucene index which actually stores the data and is a search engine in itself. Each shard can have zero or more replicas (default is 1). An Elasticsearch index also has “types” (like tables in a database) which allow you to logically partition your data in an index. All documents in a given “type” in an Elasticsearch index have the same properties (like schema for a table).

Figure a shows an Elasticsearch cluster consisting of three primary shards with one replica each. All these shards together form an Elasticsearch index and each shard is a Lucene index itself. Figure b demonstrates the logical relationship between Elasticsearch index, shards, Lucene index and documents.

Analogy to relational database terms

Elasticsearch Index ~ Database

Types ~ Tables

Mapping ~ Schema

NOTE: The analogies above are for equivalence purposes only and not for equality. I would recommend reading this blog to help decide when to choose an index or a type to store data.

Now, that we are familiar with the terms in Elasticsearch world, let’s see the different kinds of roles nodes can have.

Types of nodes

An instance of Elasticsearch is a node and a group of nodes form a cluster. Nodes in an Elasticsearch cluster can be configured in three different ways:

Master Node

It controls the Elasticsearch cluster and is responsible for all clusterwide operations like creating/deleting an index, keeping track of which nodes are part of the cluster and assigning shards to nodes. The master node processes one cluster state at a time and broadcasts the state to all the other nodes which respond with confirmation to the master node.

A node can be configured to be eligible to become a master node by setting the node.master property to be true (default) in elasticsearch.yml .

property to be (default) in . For large production clusters, it’s recommended to have a dedicated master node to just control the cluster and not serve any user requests.

Data Node

It holds the data and the inverted index. By default, every node is configured to be a data node and the property node.data is set to true in elasticsearch.yml. If you would like to have a dedicated master node, then change the node.data property to false.

Client Node:

If you set both node.master and node.data to false, then the node gets configured as a client node and acts as a load balancer routing incoming requests to different nodes in the cluster.

The node in the Elasticsearch cluster that you connect with as a client is called the coordinating node. The coordinating node routes the client requests to the appropriate shard in the cluster. For read requests, the coordinating node selects a different shard every time to serve the request in order to balance the load.

Before we start reviewing how a CRUD request sent to the coordinting node propagates through the cluster and is executed by the engine, let’s see how Elasticsearch stores data internally to serve results for full-text search at low latency.

Storage Model

Elasticsearch uses Apache Lucene, a full-text search library written in Java and developed by Doug Cutting (creator of Apache Hadoop), internally which uses a data structure called an inverted index designed to serve low latency search results. A document is the unit of data in Elasticsearch and an inverted index is created by tokenizing the terms in the document, creating a sorted list of all unique terms and associating a list of documents with where the word can be found.

It is very similar to an index at the back of a book which contains all the unique words in the book and a list of pages where we can find that word. When we say a document is indexed, we refer to the inverted index. Let’s see how inverted index looks like for the following two documents:

Doc 1: Insight Data Engineering Fellows Program

Doc 2: Insight Data Science Fellows Program

If we want to find documents which contain the term “insight”, we can scan the inverted index (where words are sorted), find the word “insight” and return the document IDs which contain this word, which in this case would be Doc 1 and Doc 2.

To improve searchability (e.g., serving same results for both lowercase and uppercase words), the documents are first analyzed and then indexed. Analyzing consists of two parts:

Tokenizing sentences into individual words

Normalizing words to a standard form

By default, Elasticsearch uses Standard Analyzer, which uses

Standard tokenizer to split words on word boundaries

Lowercase token filter to convert words to lowercase

There are many other analyzers available and you can read about them in the docs.

In order to serve relevant search results, every query made on the documents is also analyzed using the same analyzer used for indexing.

NOTE: The standard analyzer also uses stop token filter but it is disabled by default.

As the concept of inverted index is clear now, let’s review CRUD operations. We’ll begin with writes.

Anatomy of a Write

(C)reate

When you send a request to the coordinating node to index a new document, the following set of operations take place:

All the nodes in the Elasticsearch cluster contain metadata about which shard lives on which node. The coordinating node routes the document to the appropriate shard using the document ID (default). Elasticsearch hashes the document ID with murmur3 as the hash function and mods by the number of primary shards in the index to determine which shard the document should be indexed in.

shard = hash(document_id) % (num_of_primary_shards)

As the node receives the request from the coordinating node, the request is written to the translog (we’ll cover translog in a follow up post) and the document is added to memory buffer. If the request is successful on the primary shard, the request is sent in parallel to the replica shards. The client receives acknowledgement that the request was successful only after the translog is fsync’ed on all primary and replica shards.

The memory buffer is refreshed at a regular interval (defaults to 1 sec) and the contents are written to a new segment in filesystem cache. This segment is not yet fsync’ed, however, the segment is open and the contents are available for search.

The translog is emptied and filesystem cache is fsync’ed every 30 minutes or when the translog gets too big. This process is called flush in Elasticsearch. During the flush process, the in-memory buffer is cleared and the contents are written to a new segment. A new commit point is created with the segments fsync’ed and flushed to disk. The old translog is deleted and a fresh one begins.

The figure below shows how the write request and data flows.

(U)pdate and (D)elete

Delete and Update operations are also write operations. However, documents in Elasticsearch are immutable and hence, cannot be deleted or modified to represent any changes. Then, how can a document be deleted/updated?

Every segment on disk has a .del file associated with it. When a delete request is sent, the document is not really deleted, but marked as deleted in the .del file. This document may still match a search query but is filtered out of the results. When segments are merged (we’ll cover segment merging in a follow up post), the documents marked as deleted in the .del file are not included in the new merged segment.

Now, let’s see how updates work. When a new document is created, Elasticsearch assigns a version number to that document. Every change to the document results in a new version number. When an update is performed, the old version is marked as deleted in the .del file and the new version is indexed in a new segment. The older version may still match a search query, however, it is filtered out from the results.

After the documents are indexed/updated, we would like to perform search requests. Let’s see how search requests are executed in Elasticsearch.

Anatomy of a (R)ead

Read operations consist of two parts:

Query Phase

Fetch Phase

Let’s see how each phase works.

Query Phase

In this phase, the coordinating node routes the search request to all the shards (primary or replica) in the index. The shards perform search independently and create a priority queue of the results sorted by relevance score (we’ll cover relevance score later in the post). All the shards return the document IDs of the matched documents and relevant scores to the coordinating node. The coordinating node creates a new priority queue and sorts the results globally. There can be a lot of documents which match the results, however, by default, each shard sends the top 10 results to the coordinating node and the coordinating creates a priority queue sorting results from all the shards and returns the top 10 hits.

Fetch Phase

After the coordinating node sorts all the results to generate a globally sorted list of documents, it then requests the original documents from all the shards. All the shards enrich the documents and return them to the coordinating node.

The figure below shows how the read request and the data flows.

As mentioned above, the search results are sorted by relevance. Let’s review how relevance is defined.

Search Relevance

The relevance is determined by a score that Elasticsearch gives to each document returned in the search result. The default algorithm used for scoring is tf/idf (term frequency/inverse document frequency). The term frequency measures how many times a term appears in a document (higher frequency == higher relevance) and inverse document frequency measures how often the term appears in the entire index as a percentage of the total number of documents in the index (higher frequency == less relevance). The final score is a combination of the tf-idf score with other factors like term proximity (for phrase queries), term similarity (for fuzzy queries), etc.

What next?

These CRUD operations are supported by some internal data structures and techniques which are very important to understand how Elasticsearch works. In a follow up post, I will be going through some of these concepts and some of the gotchas while using Elasticsearch.