In this tutorial, we’ll show you how to use the phrase suggester to correct spellings in phrases, which offers “did you mean” search functionality in Elasticsearch.

Phrase suggester is an advanced version of the term suggester. The additional functionality, which phrase suggester employs, is the selection of entire corrected phrases instead of individual words. This is based on the ngram-language modeling, and phrase suggesters can make better choices of tokens based on both frequency and concurrency.

Sample Data Indexing

To help demonstrate phrase suggestion, let’s start this tutorial by indexing some sample data. Below are the four documents we are going to index:

Document 1

curl -XPOST localhost:9200/phrase-suggester/my-type/1 -H "Content-Type: application/json" -d '{"tagline": "The windshield got misty"}'

Document 2

curl -XPOST localhost:9200/phrase-suggester/my-type/2 -H "Content-Type: application/json" -d '{"tagline": "The misty windshield and the dash"}'

Document 3

curl -XPOST localhost:9200/phrase-suggester/my-type/3 -H "Content-Type: application/json" -d '{"tagline": "windhshield was broken and moist"}'

Document 4

curl -XPOST localhost:9200/phrase-suggester/my-type/4 -H "Content-Type: application/json" -d '{"tagline": "days of misty windshield"}'

Phrase Suggester Working

Let’s have a look at the working of the phrase suggester. Let us search for a phrase with three words and two typos. This phrase is "windsheild got mitsy" . You can see we have the typos in the 1st and the 3rd words. Query the index using the phrase suggester and see what is returns:

curl -XPOST 'localhost:9200/phrase-suggester/_search?pretty' -H "Content-Type: application/json" -d '{ "size":0, "suggest": { "text":"windsheild got mitsy", "my-phrase-suggestion": { "phrase": { "field": "tagline" }} } }'

As you can see, the query above is similar to the term suggester query which we used in the previous article with the exception that the "term" parameter is replaced by the "phrase" parameter.

Learn About Our Enterprise Kubernetes Support

The response for the above query is as follows:

{ "took" : 248, "timed_out" : false, "_shards" : { "total" : 4, "successful" : 4, "skipped" : 0, "failed" : 0 }, "hits" : { "total" : 0, "max_score" : 0.0, "hits" : [ ] }, "suggest" : { "my-phrase-suggestion" : [ { "text" : "windsheild got mitsy", "offset" : 0, "length" : 20, "options" : [ { "text" : "windshield got misty", "score" : 0.116809 }, { "text" : "windhshield got misty", "score" : 0.09306483 }, { "text" : "windshield got mitsy", "score" : 0.078104876 }, { "text" : "windsheild got misty", "score" : 0.07421008 }, { "text" : "windhshield got mitsy", "score" : 0.06222823 } ] } ] } }

Much like in the terms suggester, the suggestions are listed under the "options" array of the response. In the response, we can see that the first element of the options list is the correct phrase we were looking for with the highest match score. For the second element in the options list, only one typo was corrected, whereas the last element on the options list returned the searched phrase as is.

Phrase Suggester with Options

In the above example, we have seen the most basic usage of phrase suggester query. However, the query can be configured with many settings like highlighting, confidence, collate etc. designed to fine-tune our search results. Let us explore a few of the most common options.

Consider this query for phrase suggestion:

curl -XPOST localhost:9200/phrase-suggester/_search -H "Content-Type: application/json" -d '{ "size": 0, "suggest": { "text": "windsheild got mitsy", "phrase-suggestion-demo-01": { "phrase": { "field": "tagline", "real_word_error_likelihood": 0.95, "max_errors": 0.5, "confidence": 0, "highlight": { "pre_tag": "<em>", "post_tag": "</em>" }, "collate": { "query": { "inline": { "match": { "": "" } } }, "params": { "field_name": "tagline" }, "prune": true } } } } }'

In the above query, you can see new parameters included. Let us explore each:

real_word_error_likelihood – The default value for this option is 0.95. This options tells Elasticsearch that 5 percent of the terms that are in the index are misspelled. This means that as the value of this parameter gets lower, Elasticsearch will treat more and more terms existing in the index as misspelled, even though they are correct. max_errors – the maximum percentage of the terms that at most considered to be misspellings in order to form a correction. The default value is 1. confidence – The default value is 1.0 and the maximum value, too. This value acts as a threshold relating to the score of the suggestions. Only those suggestions having the scores exceeding this value will be shown. For instance a confidence level of 1.0 will only return suggestions that score higher than the input phrase. highlight – One of the most helpful search features is the highlighting feature. It can be also enabled in the phrase suggester. The corrected words would be highlighted using this keyword. As shown in the above query, we can also employ which tag to be used to highlight (here we have used the <em> tag). collate – tells Elasticsearch to check each suggestion against the query specified to prune suggestions for which no matching documents exist in the index. In this case, it is a match query. Since this query is a template query, the search query is the current suggestion, which is under the parameter in the query. Further fields can be added in the "params" object under the query. Also when the parameter "prune" is set to true, we will have an additional field "collate_match" in the response, indicating whether there was the match of all the corrected keywords in the suggested results.

The response for the above query can be found below:

{ "suggest": { "simple_phrase": [ { "text": "windsheild got mitsy", "offset": 0, "length": 20, "options": [ { "text": "windshield got misty", "highlighted": "<em>windshield</em> got <em>misty</em>", "score": 0.13930021, "collate_match": true }, { "text": "windshield got mitsy", "highlighted": "<em>windshield</em> got mitsy", "score": 0.11107826, "collate_match": true }, { "text": "windsheild got misty", "highlighted": "windsheild got <em>misty</em>", "score": 0.1055392, "collate_match": true }, { "text": "windsheild got mitsy", "highlighted": "windsheild got mitsy", "score": 0.08415717, "collate_match": false }, { "text": "windhshield got mitsy", "highlighted": "<em>windhshield</em> got mitsy", "score": 0.058400385, "collate_match": true } ] } ] } }

In the above response, we can see the highlighted texts for the suggested corrections. There is also the collate_match where in response to the “prune” parameter in the query. You can see that one result which has neither of the corrected keywords has the value false.

You can change the values of the “confidence” , ”real_word_error_likelihood” and the “max_errors” parameter and compare the changes for better understanding of these parameters.

Conclusion

In this blog post, we have shown how to use phrase suggester in Elasticsearch. Unlike term suggester, phrase suggester can match queries against the entire phrases returning the most relevant results. The Suggest API for phrase suggestion supports a wide variety of useful features that help fine-grain your suggest queries and customize search results.

Other Helpful Tutorials

Give It a Whirl!

It’s easy to spin up a standard hosted Elasticsearch cluster on any of our 47 Rackspace, Softlayer, Amazon, or Microsoft Azure data centers. And you can now provision your own AWS Credits on Qbox Private Hosted Elasticsearch.

Questions? Drop us a note, and we’ll get you a prompt response.

Not yet enjoying the benefits of a hosted ELK stack enterprise search on Qbox? We invite you to create an account today and discover how easy it is to manage and scale your Elasticsearch environment in our cloud hosting service.