In previous story, Norvig’s algorithm is introduced to correct spelling error. It used 4 operations (i.e. Deletion, Transposition, Replacement and Insertion). In this story, I would like to introduce another method which only using deletion operation to find the potential correct words.

When dealing with text, we may need to deal with incorrect text. Although we can still use character embeddings and word embeddings to compute a similar vectors. It is good for unseen data and out-of-vocabulary (OOV). However, it will be better if we can correct typo.

After reading this post, you will understand:

Symmetric Delete Spelling Correction (SymSpell)

Implementation

Take Away

Symmetric Delete Spelling Correction (SymSpell)

Garbe introduced Symmetric Delete Spelling Correction (SymSpell). It is a simple but useful approach to correct spelling error.

During offline training, pre-calculation is needed to execute to build a corpus. Generating word with edit distance (delete operation only) and connecting them with the original item. In order word, it use extra storage and memory resource to saving online prediction time. The new generated word will be used for searching and returning the original word if matched. During online prediction, input word go through same calculation and then searching it from pre-calculated result. You may check out this story for detail.

Performance in terms of speed is very good as:

Pre-calculation is done in offline.

Online prediction involved delete operation and index searching only (hash table)

Implementation

To facility the spell check, corpus is needed. For sake of easier for demonstration, I simply use dataset from sklearn library without pre-processing. You should use your domain specific dataset to build a better corpus for your data.

Build corpus

from collections import Counter

from sklearn.datasets import fetch_20newsgroups

import re corpus = []

for line in fetch_20newsgroups().data:

line = line.replace('

', ' ').replace('\t', ' ').lower()

line = re.sub('[^a-z ]', ' ', line)

tokens = line.split(' ')

tokens = [token for token in tokens if len(token) > 0]

corpus.extend(tokens) corpus = Counter(corpus) corpus_dir = '../../data/'

corpus_file_name = 'spell_check_dictionary.txt' symspell = SymSpell(verbose=10)

symspell.build_vocab(

dictionary=corpus,

file_dir=corpus_dir, file_name=corpus_file_name) symspell.load_vocab(corpus_file_path=corpus_dir+corpus_file_name)

Correction

results = symspell.correction(word='edwarda')

print(results)

The outputs are that it identify “edward” and “edwards” with 1 distance from “edwarda” while “count” refer to the frequency of original corpus.

[{'distance': 1, 'word': 'edward', 'count': 154}, {'distance': 1, 'word': 'edwards', 'count': 50}]

Besides single word correction, Symspell offers a compound words distance calculation as well.

results = symspell.corrections(sentence='Hello I am Edarda')

print(results)

Unlike single word correction, compound function support splitting and decompounding operation. For detail, you may check the symspell API.

The following output calculate total distance from the original sentence. Giving “Hello I am Edarda”, it found that the shortest distance is 3 if it corrects to “hello i am ed area”.

[{'distance': 3, 'word': 'hello i am ed area'}]

Take Away

To access all code, you can visit my github repo.

Same as Spell Corrector, SymSpell does not consider the context but just the spelling purely.

but just the spelling purely. Due to simple approach, the searching time complexity is O(1) which is a constant time.

which is a constant time. Allowing a larger edit distance introduce a larger vocabulary and causing a bigger consumption on harddisk and memory but it should be alright in current resource scale.

About Me

I am Data Scientist in Bay Area. Focusing on state-of-the-art in Data Science, Artificial Intelligence , especially in NLP and platform related. You can reach me from Medium Blog, LinkedIn or Github.

Reference

SymSpell in C# (Original)

SymSpell in python