Elasticsearch provides dictionary-based stemming via the Hunspell token filter . Hunspell is the spell checker used by OpenOffice, LibreOffice, Chrome, Firefox, Thunderbird, and many other open and closed source projects.

A dictionary stemmer should be able to return the correct root word for irregular forms such as feet and mice. Additionally, it must be able to recognize the distinction between words that are similar but have different word senses, for example, organ and organization.

Instead of applying a standard set of rules to each word, they simply look up the word in the dictionary. Theoretically, they could produce much better results than an algorithmic stemmer.

Algorithmic stemmers apply a series of rules to a word in order to reduce it to its root form, such as stripping the final s or es from plurals. They don’t have to know about individual words in order to stem them. The dictionary stemmers work differently from algorithmic stemmers.

For this post, we will be using hosted Elasticsearch on Qbox.io. You can sign up or launch your cluster here, or click “Get Started” in the header navigation. If you need help setting up, refer to “Provisioning a Qbox Elasticsearch Cluster.“

The main features of Hunspell Dictionary Stemmer are:

Extended support for language peculiarities like Unicode character encoding, compounding and complex morphology.

Improved suggestion using n-gram similarity, rule and dictionary based pronunciation data.

Morphological analysis, stemming and generation.

Hunspell is based on MySpell and works with MySpell dictionaries. MySpell was used as the default spell checker for OpenOffice prior to release 2.0.2.

Hunspell uses a special dictionary format that defines which characters, words and conjugations are valid in a given language. The examples below use the (default) “en_US” dictionary. Hunspell Spell checking consists of the following steps:

Parse a document by extracting (tokenizing) words that we want to check. Analyze each word by breaking it down in it’s root (stemming) and conjugation affix. Lookup in a dictionary if the word+affix combination is valid for the language. For incorrect words, suggest corrections by finding similar (correct) words in the dictionary (optional).

Hunspell is based on MySpell and is backward-compatible with MySpell and Aspell dictionaries.

Hunspell dictionaries can be obtained from the following links:

extensions.openoffice.org: Download and unzip the .oxt extension file.

addons.mozilla.org: Download and unzip the .xpi addon file.

OpenOffice archive: Download and unzip the .zip file.

A Hunspell dictionary consists of two files:

The [lang].aff file specifies the affix syntax for the language.

file specifies the affix syntax for the language. The [lang].dic file contains a wordlist formatted using syntax from the aff file.

The .dic file contains all the root words, in alphabetical order, plus a code representing all possible suffixes and prefixes (which collectively are known as affixes) and the .aff file contains the actual prefix or suffix transformation for each code listed in the .dic file

Typically both files are located in the same directory and share the same filename, for example en_US.aff and en_US.dic. The dictionary stemmer will search for these files in the current directory and standard system paths where dictionaries are usually installed.

Installing Hunspell Dictionary

The Hunspell token filter looks for dictionaries within a dedicated Hunspell directory, which defaults to ./config/hunspell/. The .dic and .aff files should be placed in a subdirectory whose name represents the language or locale of the dictionaries. The dictionaries must be installed on all nodes in case of multiple nodes or cluster. For instance, we could create a Hunspell stemmer for American English with the following layout:

config/ └ hunspell/ └ en_US/ ├ en_US.dic ├ en_US.aff └ settings.yml

The location of the Hunspell directory can be changed by setting indices.analysis.hunspell.dictionary.location in the config/elasticsearch.yml file. Here, en_US will be the name of the locale or language passed to the hunspell token filter. The settings.yml file contains settings that apply to all of the dictionaries within the language directory, such as these:

ignore_case: true strict_affix_parsing: true

The meaning of these settings is as follows:

ignore_case – Hunspell dictionaries are case sensitive by default and thus can complicate match queries:

Capitalized words at the beginning of a sentence may appear to be a proper noun.

The input text may be all uppercase resulting in almost zero words match.

Search for lowercase names will not return capitalized words.

As a general rule, it is a good idea to set ignore_case to true.

strict_affix_parsing – Lucene, by default, will throw an exception if it can’t parse an affix rule. If we need to deal with a broken affix file, we can set strict_affix_parsing to false to tell Lucene to ignore the broken rules.

Let’s now create and analyze the hunspell token filter:

curl -XPUT 'localhost:9200/hunspell_analyzer_index/' -d '{ "settings": { "analysis": { "filter": { "en_US": { "type": "hunspell", "language": "en_US" } }, "analyzer": { "en_US": { "tokenizer": "standard", "filter": [ "lowercase", "en_US" ] } } } } }'

Let’s check the output from the analyze API:

curl -XGET localhost:9200/hunspell_analyzer_index/_analyze?analyzer=en_US&text=”The quick fox jumped and the lazy dog kept snoring”

The response for above curl is:

{ "tokens": [ { "token": "the", "start_offset": 1, "end_offset": 4, "type": "<ALPHANUM>", "position": 0 }, { "token": "quick", "start_offset": 5, "end_offset": 10, "type": "<ALPHANUM>", "position": 1 }, { "token": "fox", "start_offset": 11, "end_offset": 14, "type": "<ALPHANUM>", "position": 2 }, { "token": "jump", "start_offset": 15, "end_offset": 21, "type": "<ALPHANUM>", "position": 3 }, { "token": "and", "start_offset": 22, "end_offset": 25, "type": "<ALPHANUM>", "position": 4 }, { "token": "the", "start_offset": 26, "end_offset": 29, "type": "<ALPHANUM>", "position": 5 }, { "token": "lazy", "start_offset": 30, "end_offset": 34, "type": "<ALPHANUM>", "position": 6 }, { "token": "dog", "start_offset": 35, "end_offset": 38, "type": "<ALPHANUM>", "position": 7 }, { "token": "kept", "start_offset": 39, "end_offset": 43, "type": "<ALPHANUM>", "position": 8 }, { "token": "snore", "start_offset": 44, "end_offset": 51, "type": "<ALPHANUM>", "position": 9 } ] }

The tokens emitted are :

The, quick, fox, jump, and, the, lazy, dog, kept, snore

Let’s design a custom english analyser using the hunspell stemmer token filter:

curl -XPUT 'localhost:9200/hunspell_analyzer_index -d '{ "settings": { "analysis": { "filter": { "english_stop": { "type": "stop", "stopwords": "_english_" }, "en_US": { "type": "hunspell", "language": "en_US" }, "english_possessive_stemmer": { "Type": "stemmer", "Language": "possessive_english" } }, "analyzer": { "english": { "Tokenizer": "standard", "filter": [ "english_possessive_stemmer", "lowercase", "english_stop", "en_US" ] } } } } }'

Our hunspell_analyzer_index is composed of:

The english_stop configures the default stop words for english language.

configures the default stop words for english language. The english analyzer uses two stemmers: the possessive_english and the english hunspell stemmer . The possessive stemmer removes ‘s from any words before passing them on to the lowercase , english_stop , and en_US .

Let’s check the output from the analyze API:

curl -XGET localhost:9200/hunspell_analyzer_index/_analyze?analyzer=english&text=”The quick fox jumped and the lazy dog kept snoring”

The response to above curl is:

{ "tokens": [ { "token": "quick", "start_offset": 5, "end_offset": 10, "type": "<ALPHANUM>", "position": 1 }, { "token": "fox", "start_offset": 11, "end_offset": 14, "type": "<ALPHANUM>", "position": 2 }, { "token": "jump", "start_offset": 15, "end_offset": 21, "type": "<ALPHANUM>", "position": 3 }, { "token": "lazy", "start_offset": 30, "end_offset": 34, "type": "<ALPHANUM>", "position": 6 }, { "token": "dog", "start_offset": 35, "end_offset": 38, "type": "<ALPHANUM>", "position": 7 }, { "token": "kept", "start_offset": 39, "end_offset": 43, "type": "<ALPHANUM>", "position": 8 }, { "token": "snore", "start_offset": 44, "end_offset": 51, "type": "<ALPHANUM>", "position": 9 } ] }

The tokens emitted are:

quick, fox, jump, lazy, dog, kept, snore

In case if we replace hunspell (en_US) token filter with english stemmer token filter, the emitted tokens would have been:

quick, fox, jump, lazi, dog, kept, snore

Note: Compare the stemming of lazy keyword as done by porter stemmer (lazi) and as done by hunspell stemmer (lazy).

Customising Hunspell Dictionaries

If multiple dictionaries (.dic files) are placed in the same directory, they will be merged together at load time. This allows us to tailor the downloaded dictionaries with our own custom word lists:

config/ └ hunspell/ └ en_US/ ├ en_US.dic ├ en_US.aff ├ custom.dic └ settings.yml

The above hunspell english stemmer consists of an american english dictionary and a custom dictionary. These both will be merged. Multiple .aff files are not allowed as they may result in conflicting rules.

Choosing the Right Stemmer

Performance

Hunspell stemmers have to load all words, prefixes, and suffixes into memory, which can consume a few megabytes of RAM. Depending on the quality of the dictionary, the process of removing prefixes and suffixes may be more or less efficient. Less-efficient forms can slow the stemming process significantly. Algorithmic stemmers, on the other hand, are usually simple, small, and fast.

Algorithmic stemmers are typically four or five times faster than Hunspell stemmers and are usually, but not always, faster than their Snowball equivalents. For instance, the porter_stem token filter is significantly faster than the Snowball implementation of the Porter stemmer.

Stemmer Quality

If a good algorithmic stemmer is available for our language, it makes sense to use it rather than Hunspell. It will be faster, will consume less memory, and will generally be as good or better than the Hunspell equivalent. However, If accuracy and customizability is important and we have the resources to maintain a custom dictionary, then Hunspell gives a greater flexibility than the algorithmic stemmers.

Hunspell requires an extensive, high-quality, up-to-date dictionary in order to produce good results and deal precisely with irregular words. Dictionaries of this caliber are few and rare. An algorithmic stemmer, on the other hand, will happily deal with new words that didn’t exist when the designer created the algorithm.

Stemmer Degree

Different stemmers overstem and understem to a different degree. If our search results are being consumed by a clustering algorithm, we may prefer to match more widely (and, thus, stem more aggressively). If our search results are intended for human consumption, lighter stemming usually produces better results. Stemming nouns and adjectives is more important for search than stemming verbs, but this also depends on the language.

Other Helpful Tutorials

Give it a Whirl!

It’s easy to spin up a standard hosted Elasticsearch cluster on any of our 47 Rackspace, Softlayer, or Amazon data centers. And you can now provision your own AWS Credits on Qbox Private Hosted Elasticsearch.

Questions? Drop us a note, and we’ll get you a prompt response.

Not yet enjoying the benefits of a hosted ELK-stack enterprise search on Qbox? We invite you to create an account today and discover how easy it is to manage and scale your Elasticsearch environment in our cloud hosting service.