1. Structured querying

Also called term-level queries, structured queries are a group of querying methods that checks if a document should be selected or not. Therefore, there is no real need for relevance score in many cases — document either going to match or not (especially numerics).

Term-level queries are still queries, so they will return the score.

Term query

Returns the documents where the value of a field exactly matches the criteria. The term query is somewhat an alternative of SQL select * from table_name where column_name =...

The term query goes directly to the inverted index which makes it fast. It is preferred to use term only for keyword fields when working with text data.

GET /_search

{

"query": {

"term": {

"<field_name>": {

"value": "<your_value>"

}

}

}

}

The term query is run in the query context by default, therefore, it will calculate the score. Even if the score will be identical for all documents returned, additional computing power will be involved.

Term query with a filter

If we want to speed up term query and get it cached then it should be wrapped up in a constant_score filter.

Remember the rule of thumb? Use this method if you do not care about the relevance score.

GET /_search

{

"query": {

"constant_score" : {

"filter" : {

"term" : {"<field_name>" : "<your_value>"}

}

}

}

}

Now, the query is not calculating any relevance score, therefore, it is faster. Moreover, it is automatically cached.

Quick advise — use match instead of term for text fields.

Remember, the term query goes directly to the inverted index. Term query takes the value you provide and searches for it as it is, that is why it suits well for querying keyword fields that are stored without any transformations.

Terms query

As you could have guessed, the term query allows you to return documents which are matching at least one exact term.

Term query is somewhat an alternative of SQL select * from table_name where column_name is in...

Important to understand that querying field in Elasticsearch might be a list, for example { "name" : ["Odin", "Woden", "Wodan"] } . If you perform a terms query that contains one f the following names then this record will be matched — it does not have to match all the values in the field, but only one.

GET /_search

{

"query" : {

"terms" : {

"name" : ["Frigg", "Odin", "Baldr"]

}

}

}

Terms set query

Same as terms query but this time you can specify how many exact terms should be in the queried field.

You specify how many have to match — one, two, three or all of them. However, this number is another numeric field. Therefore, each document should contain this number (specific to this particular document).

Range query

Returns documents in which queried field’s value is within the defined range.

Equivalent of SQL select * from table_name where column_name is between...

Range query has its own syntax:

gt is greater than

is greater than gte is greater than or equal to

is greater than or equal to lt is less than

is less than lte is less than or equal to

An example where the field’s value should be ≥ 4 and ≤ 17:

GET _search

{

"query": {

"range" : {

"<field_name>" : {

"gte" : 4,

"lte" : 17

}

}

}

}

The range query also works well with dates.

Regexp, wildcard and prefix queries

Regexp query returns the documents in which fields match your regular expression.

If you have never used regular expression then I highly advise you to get at least some understanding about what it is and when you could apply it.

Elasticsearch’s regexp is Lucene’s one. It has standard reserved characters and operators. If you worked already with Python’s re package then it should not be a problem to use it here. The only difference is that Lucene’s engine does not support anchor operators such as ^ and $ .

You may find the entire list for the regexp in the official documentation.

In addition to the regexp query Elsticsearch has wildcard and prefix queries. Logically, those two are just special cases of regexp.

Unfortunately, I could not find any information regarding the performance of those 3 queries, therefore, I decided to test it myself to see if I find any significant difference.

I could not find any difference in performance while comparing a wildcard query using rehexp and wildcard query. In case you know what is the difference, please, tweet me.

Exists query

Due to the fact that Elasticsearch is schemaless (or no strict schema limitation), it is a fairly common situation when different documents have different fields. As a result, there is a lot of use to know whether a document has any certain field or not.

GET /_search

{

"query": {

"exists": {

"field": "<your_field_name>"

}

}

}

2. Full-text querying

Full-text queries work well with unstructured text data. Full-text queries take advantage of the analyzer. Therefore, I will briefly outline the Elasticsearch’s analyzer so that we can better analyze full-text querying.

Elasticsearch’s analyzer pipe

Every time text type data is inserted into the Elasticsearch index it is analyzed and, then, stored at the inverted index. Depending on how you configure the analyzer will impact your searching capabilities because analyzer is also applied for full-text search.

Analyzer pipe consists of three stages:

Character filter (0+) → Tokenizer (1) → Token filter (0+)

There is always one tokenizer and zero or more character & token filters.

1) Character filter receives the text data as it is, then it might preprocess the data before it gets tokenized. Character filters are used to:

Replace characters matching given regular expression

Replace characters matching given strings

Clean HTML text

2) Tokenizer breaks text data received after character filter (if any) into tokens. For example, whitespace tokenizer simply breaks text by the whitespace (it is not the standard one). Therefore, Wednesday is called after Woden. will be split into [ Wednesday, is, called, after, Woden. ]. There are many build-in tokenizers that can be used to create custom analyzers.

Standard tokenizer breaks text by whitespace after removing the punctuation. It is the most neutral option for the vast majority of languages.

In addition to tokenization, tokenizer does the following:

keeps track of tokens order,

notes start and end of each word

defines the type of token

3) The token filter applies some transformation on the tokens. There are many different token filters that you might choose to add to your analyzer. Some of the most popular are:

lowercase

stemmer (exist for many languages!)

remove duplicate

transformation to the ASCII equivalent

workaround with patterns

limit on token count

stop list of tokens (removes tokens from the stop list)

Now, when we know what the analyzer consists of we might think about how we are going to work with our data. Then, we might compose an analyzer that fits our case the most by choosing proper components. The analyzer can be specified on a per-field basis.

Enough theory, let’s see how the default analyzer works.

The standard analyzer is the default one. It has 0 character filters, standard tokenizer, lowercase and stops token filters. You can compose your custom analyzer as you wish, but there are also few build-in analyzers.

Some of the most efficient out of a box analyzers are the language analyzers that are taking the specifics of each language to make a more advanced transformation. Therefore, if you know in advance the language of your data, I would recommend switching from the standard analyzer to the one of the data’s languages.

The full-text query will use the same analyzer that was used while indexing the data. More precisely, the text of your query will go through the same transformations as the text data in the searching field, so that both are at the same level.

Match query

Match query is the standard query for querying the text fields.

We might call match query an equivalent of the term query but for the text type fields (while term should be used solely for the keyword type field when working with text data).

GET /_search

{

"query" : {

"match" : {

"<text_field>" {

"query" : "<your_value>"

}

}

}

}

The string that is passed into the query parameter (required one), by default, going to be processed by the same analyzer as the one that has been applied to the searched field. Unless you specify the analyzer yourself using analyzer parameter.

When you specify your phrase to be searched for it is being analyzed and the result is always a set of tokens. By default, Elasticsearch will be using OR operator between all of those tokens. That means that at least one should match — more matches will hit a higher score though. You might switch this to AND in operator parameter. In this case, all of the tokens will have to be found in the document for it to be returned.

If you want to have something in between OR and AND you might specify minimum_should_match parameter which specifies the number of clauses that should match. It can be specified in both, number and percentage.

fuzziness parameter (optional) allows you to omit the typos. Levenshtein distance is used for calculations.

If you apply match query to the keyword field then it will perform the same as term query. More interestingly, if you pass the exact value of a token that is stored in an inverted index to the term query then it will return exactly the same result as match query but faster as it will go straight to the inverted index.

Match phrase query

Same as match but the sequence order and proximity are important. Match query is not aware of the sequence and proximity, therefore, it is only possible to achieve the phrase match with a different type of a query.

GET /_search

{

"query": {

"match_phrase" : {

"<text_field>" : {

"query" : "<your_value>",

"slop" : "0"

}

}

}

}

match_phrase query has slop parameter (default value 0) which is responsible for skipping the terms. Therefore, if you specify slop equal to 1 then one word out of a phrase might be omitted.

Multi-match query

Multi-match query does the same job as the match with the only difference that it is applied to more than one field.

GET /_search

{

"query": {

"multi_match" : {

"query": "<your_value>",

"fields": [ "<text_field1>", "<text_field2>" ]

}

}

}

fields names can be specified using wildcards

each field is equally weighted by default

each field’s contribution to the score can be boosted

if no fields specified in the fields parameter then all eligible fields will be searched

There are different types of multi_match . I am not going to describe them all in this post, but I will explain the most popular:

best_fields type (default) prefers results where tokens from searched value are found in one field to those results where searched tokens are split among different fields.

most_fields is somewhat opposite to best_fields type.

phrase type behaves as best_fields but searches for the entire phrase similar to match_phrase .

I highly recommend going through the official documentation to check how exactly the score is calculated for each of those fields.

3. Compound queries

Compound queries wrap together other queries. Compound queries:

combine te score

change behavior of wrapped queries

switch query context to filter context

any of above combined

Boolean query

Boolean query combines together other queries. It is the most important compound query.

Boolean query allows you to combine searches in query context with filter context searches.

The boolean query has four occurrences (types) that can be combined together:

must or “has to satisfy the clause”

or “has to satisfy the clause” should or “additional points to relevance score if clause is satisfied”

or “additional points to relevance score if clause is satisfied” filter or “has to satisfy the clause but relevance score is not calculated”

or “has to satisfy the clause but relevance score is not calculated” must_not or “inverse to must, does not contribute to relevance score”

must and should → query context

filter and must_not → filter context

For those who are familiar with SQL must is AND while should is OR operators. Therefore, each query inside the must clause has to be satisfied.

Boosting query

Boosting query is alike with boost parameter for most queries but is not the same. Boosting query returns documents that match positive clause and reduces the score for the documents that match negative clause.

Constant score query