Word Stemming and Lemmatization

The goal of both stemming and lemmatization is to reduce an inflected (or derived) word’s form to its root or base form. It’s essential in many NLP-related tasks such as information retrieval, text summarization, topic extraction, and more.

am, are, is => be

car, cars, car’s, cars’ => car

Even though the goal is similar, the process by which it’s done is different.

Stemming

Stemming is a heuristic process in which a word’s endings are chopped off in hope of achieving its base form. Stemming acts on words without knowing the context. Therefore, it’s faster but doesn’t always yield the desired result..

Stemming isn’t as easy as we presume. If it was, there would be only one implementation. Sadly, stemming is an imprecise science, which leads to issues such as understemming and overstemming.

Understemming is the failure to reduce words with the same meaning to the same root. For example, jumped and jumps may be reduced to jump , while jumpiness may be reduced to jumpi . Overstemming is the failure to keep two words with distinct meanings separate. For instance, general and generate may both be stemmed to gener .

NLTK provides several stemmers, the most prominent being PorterStemmer , which is based on the Porter Stemming Algorithm. This is mainly because it provides better results than the rest of the stemmers.

Other stemmers include SnowballStemmer and LancasterStemmer. It’s worth mentioning that SnowballStemmer supports other languages as well. The following code snippet compares the aforementioned stemmers.

Lemmatization

Lemmatization is a process that uses vocabulary and morphological analysis of words to remove the inflected endings to achieve its base form (dictionary form), which is known as the lemma.

It’s a much more complicated and expensive process that requires an understanding of the context in which words appear in order to make decisions about what they mean. Hence, it uses a lexical vocabulary to derive the root form, is more time consuming than stemming, and is most likely to yield accurate results.

Lemmatization can be done with NLTK using WordNetLemmatizer , which uses a lexical database called WordNet (a detailed explanation of the WordNet database will be in a later section).

NLTK provides an interface for the WordNet database. WordNetLemmatizer uses the interface to derive the lemma of a given word.

When using the WordNetLemmatizer, we should specify which part of speech should be used in order to derive the accurate lemma. Words can be in the form of Noun(n), Adjective(a), Verb(v), or Adverb(r). The following code snippet shows how lemmatization in action.

lemmatize is a function to demonstrate how the lemma changes with the part of speech given.

Stemming vs Lemmatization

Usage of either stemming or lemmatization will mostly depend on the situation at hand. If speed is required, it’s better to resort to stemming. But if accuracy is required it’s best to use lemmatization.

The following code snippet shows the comparison between stemming and lemmatization.

Part-Of-Speech (POS) Tagging

Part-Of-Speech tagging (or POS tagging) is also a very import component of NLP. The purpose of the POS tagging is to assign labels for each token (a word in this case) with its respective grammatical component, such as noun, verb, adjective, or adverb. Most POS are divided into sub-classes.

POS tagging can be identified as a supervised machine learning solution, mainly because it takes features like the previous word, next word, and capitalization of the first word into consideration when assigning a POS tag to a word.

The most popular tag set for POS tagging is Penn Treebank tagset. Most of the trained POS taggers for English are trained on this tag set. The following link shows the available POS Tags in Penn Treebank tagset.

NLTK provides a function called pos_tag , to perform POS tagging of sentences, but this requires the sentence to be tokenized first. The following code snippet shows how POS tagging can be performed with NLTK:

Chunking

Chunking or shallow parsing is a process that extracts phrases from a text sample. Here we extract chunks of sentences that constitute meaning rather than identifying the sentence’s structure. This is different and more advanced than tokenization because it extracts phrases instead of tokens.

As an example, the word “North America” can be extracted as a single phrase using chunking rather than two separate words “North” and “America” as tokenization does.

Chunking is a process that requires POS tagged input, and it provides chunks of phrases as output. Same as in POS tags, there is a standard set of chunk tags like Noun Phrase(NP), Verb Phrase (VP), etc.

As an example, let’s consider noun phrase chunking. In order to do this, we search for chunks corresponding to an individual noun phrase for a given rule. To create a NP chunk, we define the chunk grammar rule using POS tags. We will define this using a regular expression rule:

NP: {<DT>?<JJ>*<NN>} # NP

The rule states that whenever the chunk finds an optional determiner (DT) followed by any number of adjectives (JJ) and then a noun (NN), then the NP chunk should be formed.

This way we can use grammar rules to extract NPs from POS tagged sentences:

Stop Word Removal

Stop words are simply words that have very little meaning and are mostly used as part of the grammatical structure of a sentence. Words like “the”, “a”, “an”, “in”, etc. are considered stop-words.

Even though it doesn’t seem like much, stop word removal plays an important role when dealing with tasks such as sentiment analysis. This process is also used by search engines when indexing entries of a search query.

NLTK comes with the corpora stopwords which contains stop word lists for 16 different languages. No direct function is given by NLTK to remove stop words, but we can use the list to programmatically remove them from sentences.

If we are dealing with many sentences, first the text must be split into sentences using sent_tokenize . Then using word_tokenize , we can further break the sentences into words, and then remove the stop words using the list. The following code snippet depicts this process:

Named Entity Recognition

Named entity recognition (NER), is the process of identifying entities such as Names, Locations, Dates, or Organizations that exist in an unstructured text sample.

The purpose of NER is to be able to map the extracted entities against a knowledge base, or to extract relationships between different entities. Eg: Who did what? or Where something take place? or At what time something occur?

It’s a very important task when dealing with information extraction. Other applications where NER is used:

Classifying content (in news, law domains)

For efficient search algorithms

In content recommendation algorithms

Chatbots, voice assistants, etc.

For domain-specific entities, in a field like medicine or law, we’ll need to train our own NER algorithm.

For casual use, NLTK provides us with a method called ne_chunk to perform NER on a given text. In order to use ne_chunk , the text needs to first be tokenized into words and then POS tagged. After NER, the tagged words depict their respective entity type. In this case, Mark and John are of type PERSON, Google and Yahoo are of type ORGANIZATION, and New York City is of type GPE (which indicates location).

WordNet Interface

WordNet is a large English lexical database. Nouns, verbs, adjectives, and adverbs are grouped into sets of cognitive synonyms (synsets), each expressing a distinct concept.

Synset or “synonym set” is a collection of synonymous words.

NLTK provides an interface for the NLTK database, and it comes with the corpora module. WordNet is composed of approximately 155,200 words and 117,600 synonym sets that are logically related to each other.

As an example, in WordNet, a word like computer has two possible contexts (one being a machine for performing computation, and the other being a calculator: which is associated to computer in a lexical sense). It is identified by computer.n.01 (is known as the "lemma code name". And letter n depicts that the word is a noun).

wordnet.synsets("computer") OUTPUT: [Synset('computer.n.01'), Synset('calculator.n.01')]

We can further analyze the synset to find other words associated with it. As you can see all the words that are closely associated (and in the same context) with the word computer are listed:

wordnet.synset('computer.n.01').lemma_names() OUTPUT: ['computer',

'computing_machine',

'computing_device',

'data_processor',

'electronic_computer',

'information_processing_system']

Using WordNet, we’re able to find the definition of a particular word and also the usages of a word (the database may or may not contain usages for words):

syn.definition()

OUTPUT: 'a machine for performing calculations automatically' wordnet.synset("car.n.01").examples()

OUTPUT: ['he needs a car to get to work']

Also, we can use it to find synonyms and antonyms of words. The following snippet contains all the code mentioned here and also shows how to retrieve synonyms and antonyms for a particular word:

References

Conclusion

In this introductory article, we discussed how to use NLTK in order to perform some basic but useful tasks in Natural Language Processing. We learned tasks such as tokenization, stemming, lemmatization, stop word removal, POS tagging, chunking, named entity recognition, and some basics surrounding the WordNet interface.

Hope you found the article useful!

The source code that created this post can be found below.

If you have any problems or questions regarding this article, please do not hesitate to leave a comment below or drop me an email.

Email address: lahiru.tjay@gmail.com

Discuss this post on Hacker News and Reddit.