So in this blog, we will familiarise ourselves with aspects such as terms, tokens and stems as part of a similarity algorithm to learn how text is handled and processed in Elasticsearch.

In order to make our searches more effective and accurate, it is necessary that we know some key aspects of Lucene which are the foundation for Elasticsearch.

Remember that Elasticsearch is a text based search engine built on top of Lucene. The wide range of operations available in Lucene is made easily usable and applicable in Elasticsearch by encapsulating them effectively into simple APIs.

There are times for a beginner in Elasticsearch, when one fires the perfect query but yields only partial result, no result or sometimes totally unexpected results. Since the query is flawless, it might cause us to be curious about what is happening behind the curtains.

Inverted Index

Elasticsearch employs Lucene’s index structure called the “inverted index” for its full-text searches. It is a very versatile, easy to use and agile structure which provides fast and efficient text search capabilities to Elasticsearch.

An inverted index consists of:

1. A list of all the unique words, called terms, that appear in any document

2. A list of the documents in which the words appear

3. A term frequency list, which shows how many times a word has occurred

In order to get a good grasp of how an inverted index is populated, we will consider two documents to be indexed with the following contents in it.

Document 1: “elasticsearch is cool”

Document 2: “Elasticsearch is great”

Now to create the inverted index for the above two documents, we split the contents of each document into separate words. After then we create the lists of unique words and the document ids in which they occur and also the word frequency list. So our inverted index generated for the above two documents would like below. Here we generally refer the unique words occurring in the documents as “terms”.

In the above inverted index table, we can see there are 4 terms, and the documents in which they occur and the frequency of the terms occurring listed.

Let’s see how a simple basic search operation works. Suppose we need to search for the term “great”, so when we fire the query, Elasticsearch will look into this inverted index table and will find that the required data, which is the query term (in this case “great”), occurs in document 1 and it will then show us that document.

But this inverted index has a flaw, which results in an erroneous search result. Suppose we search for “Elasticsearch”, as per this table, the result would only return us document 1 and would also tells us that the term frequency in the entire documents set is only 1. But we know that the terms “elasticsearch” and “Elasticsearch” are the same and we expect both to be considered as the same term. But this is not happening here and it significantly affects our search accuracy. So how can we rectify inaccuracies like this and provide more accurate and relevant search results? Let’s try to address these problems in the coming sections.

Analyzing in Elasticsearch

The problem mentioned above arises because when we prepare the inverted index, the same words with different cases are processed as different words. The solution to this problem is to lowercase the entire text and then to index it, so we can get the required results. The case mentioned above is one of the most simple issues to be addressed, but there are a lot more issues, both simple and complex, that could affect our search results. Such as mixed html elements with text,custom separation of terms, etc.

These issues can be resolved if the text to be indexed is processed according to our requirements prior to the splitting into terms. This process is called analysis and is performed by analyzers.

Elasticsearch Analyzers

As mentioned above, the analyzing process is done by analyzers. Which involve the following process:

1. Split the piece of text into individual terms or token

2. Standardize the individual terms so they become more searchable.

This process is an aggregation of the following three functions:

1. Character Filtering

We can apply the default character filters or custom character filters on the input text string to filter out the unwanted terms. For example, if we index a string with HTML tags in it, appropriate character filtering would tidy up our string of the HTML tags included in it. Likewise, by using different character filters, we can do tasks like replacing a particular character with words, on each occurrence of that character.

2. Tokenisation

Now after the character filtering, the string is passed to the tokenizer. The job of the tokenizer is to split the string into individual words. This process is called tokenization and the individual words obtained after tokenization are called “terms”. Here we also have the option to use a default tokenizer or a custom tokenizer to suit our specific purpose. A simple tokenizer would split the input string whenever it encounters a whitespace or a punctuation. Also, by using custom tokenizers like pattern tokenizers, we can look for patterns and split the input string accordingly.

3. Token Filters

The next process in line is the filtering of the terms generated after tokenisation. The terms are passed through a token filter, which transforms the terms to the standard the user wants. In some cases we need to just lower case the terms after tokenisation. Although in some cases we might need to remove commonly used words (such as “a”,”the”,”for” etc). So we apply the required token filter to it and we get the required tokens generated.

The entire process happening in an analyzer can be explained using the flow diagram below:

The role of Elasticsearch analyzers

Elasticsearch analyzers serve as a great tool for improving search accuracy and relevance. To understand the working of analyzers, it is good to know the role of analyzers and how and when to use them.

Let us start by looking into the definition of two terms: the exact-value fields and the full-text fields.

The exact-values are the values that don’t make sense if they’re splitted. For example, there is no point in splitting and tokenising the email id of a user, or the date fields, because it is always better to search for an intact date or email id. While the full-text values are mainly the human generated textual content. Like an article in a blog or a comment in a forum. From full-text values we expect data results that make sense to humans.

So when we index a document, there will be two types of values for the fields in it as mentioned above. Now when we query these fields, the following cases happen:

1.When a query is fired against a an exact-value field, the query string is not analyzed and the field is checked for the exact match. This is done because exact- value fields are useless without any tokenisation or processing.

2. But when we fire a query against a full-text field, the query string has to undergo the analysis procedure defined for the field. This is to generate the precise set of terms to search for.

How to apply Elasticsearch analyzers

How to apply analyzers depends on the type of use we have. There are multiple options for the “char filters”,”tokenizers” and “token filters” available within Elasticsearch. And also, for those components, customisation is possible. This will be covered extensively in the follow up blogs of this series.

A simple example

Now let’s try to understand the flow of analyzers using a simple example. Let’s create an index and apply the following analyzer to it and see the flow of processing that happens within each component.

Let’s create an index named testindex and configure it with the following :

curl -X PUT "http://localhost:9200/testindex" -d '{ "index": { "analysis": { "analyzer": { "simpleAnalyzer": { "type": "custom", "char_filter": [ "html_strip" ], "tokenizer": "standard", "filter": "lowercase" } } } } }'

Here we have made a custom analyzer named "simpleAnalyzer" with the following components:

1. "html_strip" – character filter

This makes sure that it will remove all the html elements ,if there are any present in the input string given

2. "standard" – tokenizer

Standard tokenizer will split the words whenever it encounters a whitespace or a punctuation.

3. "lowercase" – filter

The lowercase filter will convert all the tokens entering to it to lowercase.

Now let us input a string to the index testindex , we just created. We’ll use a text that consists of some HTML elements and uppercase characters separated by commas so that we can check if the filters are applied or not.

The sample text is <h2>Paris</h2>,the city of fashion

Now in order to apply this filter and see the results in the terminal, type in the following command:

curl -XGET "localhost:9200/testindex/_analyze?analyzer=simpleAnalyzer&pretty=true" -d '<h2>Paris</h2>,the city of fashion'

After typing in the above command, you can see the respective tokens in the terminal and verify the results.

So let’s look how the components have been applied to the sample text we have given.

Conclusion

In this blog, we have seen a basic introduction into Elasticsearch analyzers and inverted index generation. In the coming blogs we will see different types of tokenizers/filters in the analyzers and when to apply them.

Stay tuned for the next posts!