The last two blogs in the analyzer series covered a lot of topics ranging from the basics of the analyzers to how to create a custom analyzer for our purpose with multiple elements. In this blog we are going to see a few special tokenizers like the email-link tokenizers and token-filters like edge-n-gram and phonetic token filters.

These tokenizers and filters provide very useful functionalities that will be immensely beneficial in making our search more precise.

Email-link tokenizer

In scenarios where we have URLs, emails, or links to be indexed, a problem comes up when we use the standard tokenizer. Let’s see what happens when we have such values for fields in our index.

Consider we index the following two documents in an index:

curl -XPOST '<a href="http://localhost:9200/analyzers-blog-03/emails/1">http://localhost:9200/analyzers-blog-03-01/emails/1</a>' -d '{<br /> "email": "stevenson@gmail.com"<br />}'

curl -XPOST '<a href="http://localhost:9200/analyzers-blog-03/emails/2">http://localhost:9200/analyzers-blog-03-01/emails/2</a>' -d '{<br />"email": "jennifer@gmail.com"<br />}'

Here you can see that we only have email ids in each document. Now, run a query as below:

curl -XPOST '<a href="http://localhost:9200/analyzers-blog-03/emails/_search?&pretty=true&size=5">http://localhost:9200/analyzers-blog-03-01/emails/_search?&pretty=true&size=5</a>' -d '{ "query": { "match": { "email": "stevenson@gmail.com" } } }'

What we expect from this query is to return only the 1st document but as you can see in the terminal, the response consists of all the other documents. That was not the expected result.

Why did this happen?

The default tokenizer, that is the standard tokenizer, split the values in the field "email" at the "@" character. This means that "stevenson@gmail.com" is split into "stevenson" and "gmail.com." This happens for all the documents and this is why all the documents have "gmail.com" as a common term.

The previous query is also undergoing the same tokenization, so it searches for "stevenson" OR "gmail.com" in the entire documents of the index. And like with the earlier search, all the documents have "gmail.com" as a common term, so Elasticsearch returns them all as the response.

How to Solve the Issue:

We can solve this issue by using the "UAX_Email_URL" tokenizer instead of the default tokenizer. The UAX_Email_URL tokenizer works the same as the standard tokenizer, but it can recognize URLs and emails and will output them as single tokens.

Let's see how we can define an analyzer with this tokenizer to the index. But before we define the analyzer, we should delete the index and then use the same name for the new index. You can use the following curl:



curl -X DELETE "<a href="http://localhost:9200/analyzers-blog-03">http://localhost:9200/analyzers-blog-03</a>-01"

Here we define the analyzer with a custom name "urls-links-emails" and it can be done as below:

curl -X PUT "<a href="http://localhost:9200/analyzers-blog-03">http://localhost:9200/analyzers-blog-03</a>-01" -d '{ "analysis": { "analyzer": { "urls-links-emails": { "type": "custom", "tokenizer": "uax_url_email" } } } }'

Now, we need to map this analyzer to the "email" field of the documents, which can be done by the mapping below:

curl -X PUT "<a href="http://localhost:9200/analyzers-blog-03/emails/_mapping">http://localhost:9200/analyzers-blog-03-01/emails/_mapping</a>" -d '{ "emails": { "properties": { "email": { "type": "string", "analyzer": "urls-links-emails" } } } }'

Now re-index the three documents like before and run the same query. The resulting response in the terminal shows us only the document which has "stevenson@gmail.com" as the value of the "email" field.

Edge-n-gram token filter

Most of our searches are single word queries, but it might not be the same in all circumstances. For example, if we are implementing an autocomplete feature, we might want to have the feature of substring matching too. So if we are searching for "prestige," the words "pres", "prest" etc should match against it.

In Elasticsearch, this is possible with the "Edge-Ngram" filter.

So let's create the analyzer with "Edge-Ngram" filter as below:

curl -X PUT "<a href="http://localhost:9200/analyzers-blog-03">http://localhost:9200/analyzers-blog-03</a>-02" -d '{ "index": { "number_of_shards": 1, "number_of_replicas": 1 }, "analysis": { "filter": { "ngram": { "type": "edgeNGram", "min_gram": 2, "max_gram": 50 } }, "analyzer": { "NGramAnalyzer": { "type": "custom", "tokenizer": "standard", "filter": "ngram" } } } }'

The analyzer has been named "NGramAnalyzer," this analyzer would create substrings of all the tokens from one end of the token with the minimum length of two (min_gram) and the maximum length of 50 (max_gram). Let's check the analyzer on the word "prestige" by giving the following command to the terminal:

curl -XPOST 'localhost:9200/analyzers-blog-03-02/_analyze?analyzer=NGramAnalyzer&pretty' -d 'prestige'

From the response, we can see the tokens generated range from the terms "pr" to "prestige". This kind of filtering can be used to implement the autocomplete feature or the instant search feature in our application.

Phonetic token filter

Sometimes we make small typos while searching, for example, we may type "grammer" instead of "grammar." These words are phonetically the same but in a dictionary search environment if "grammer" is searched there will not be returning results.

But what if there were a feature that could use the phonetic property of the words and improve search results? That would greatly amplify the user experience factor. We could search for "Kanada" and still see the results for "Canada." Elasticsearch makes use of the Phonetic token filter to achieve these results.

Let's have a look at how to setup and use the Phonetic token filter.

The phonetic token filter is a plugin and can be installed by typing in the following command in the terminal from inside the Elasticsearch build folder:

bin/plugin install elasticsearch/elasticsearch-analysis-phonetic/VERSION

After the installation, you might have to restart the Elasticsearch instance in your local, which can be done by the following command:

sudo service elasticsearch restart

Now the Phonetic token filter setup can be done by following these settings:

curl -X PUT "<a href="http://localhost:9200/analyzers-blog-031">http://localhost:9200/analyzers-blog-03</a>-03" -d '{ "index": { "number_of_shards": 1, "number_of_replicas": 1 }, "analysis": { "filter": { "my_metaphone": { "type": "phonetic", "encoder": "metaphone", "replace": false } }, "analyzer": { "metaphone": { "type": "custom", "tokenizer": "standard", "filter": "my_metaphone" } } } }'

Let's check the working of the defined token filter. We can check the how the token filter works for the words "grammar," "grammer," and "grammr" by typing into the command line like below:

curl -XPOST 'localhost:9200/analyzers-blog-03-03/_analyze?analyzer=metaphone&pretty' -d 'grammar'

curl -XPOST 'localhost:9200/analyzers-blog-03-03/_analyze?analyzer=metaphone&pretty' -d 'grammer'

curl -XPOST 'localhost:9200/analyzers-blog-03-03/_analyze?analyzer=metaphone&pretty' -d 'grammr'

For each of the results to the above commands, you can see they were mapped to a token "KRMR" on the basis of similar pronunciation.

In the filter setup, we have used the "metaphone" encoder for this purpose. There are many other encoders like this and for various purposes other than the phonetic word matching. For example, the Double metaphone filter expands the phonetic matching to other languages too. Also, filters like "Coverphone" and "Beider-Morse" for matching names.

Conclusion

In this blog, we have seen the working of very specific tokenizers and filters in detail. They can be used to make searches in our database more effective. In the next blog in this analyzer series, we will be dealing with a very commonly used token filter called the synonym token filter, which enables us to do the synonym searches within our database.



