Natural language processing (NLP) is one of the most interesting sub-fields of data science, and data scientists are increasingly expected to be able to whip up solutions that involve the exploitation of unstructured text data. Despite this, many applied data scientists (both from STEM and social science backgrounds) lack NLP experience.

In this post, I explore some fundamental NLP concepts and show how they can be implemented using the increasingly popular spaCy package in Python. This post is for the absolute NLP beginner, but knowledge of Python is assumed.

spaCy, You Say?

spaCy is a relatively new package for "Industrial strength NLP in Python" developed by Matt Honnibal at Explosion AI. It is designed with the applied data scientist in mind, meaning it does not weigh the user down with decisions over what esoteric algorithms to use for common tasks and it's fast — incredibly fast (it's implemented in Cython). If you are familiar with the Python data science stack, spaCy is your numpy for NLP — it's reasonably low-level but very intuitive and performant.

So, What Can It Do?

spaCy provides a one-stop-shop for tasks commonly used in any NLP project, including:

Tokenization

Lemmatization

Part-of-speech tagging

Entity recognition

Dependency parsing

Sentence recognition

Word-to-vector transformations

Many convenient methods for cleaning and normalizing text

I'll provide a high-level overview of some of these features and show how to access them using spaCy.

Let's Get Started!

First, we load spaCy's pipeline, which by convention is stored in a variable named nlp . Declaring this variable will take a couple of seconds, as spaCy loads its models and data to it up-front to save time later. In effect, this gets some heavy lifting out of the way early so that the cost is not incurred upon each application of the nlp parser to your data. Note that here, I am using the English language model, but there is also a fully featured German model, with tokenization (discussed below) implemented across several languages.

We invoke NLP on the sample text to create a Doc object. The Doc object is now a vessel for NLP tasks on the text itself, slices of the text ( Span objects) and elements ( Token objects) of the text. It is worth noting that Token and Span objects actually hold no data. Instead, they contain pointers to data contained in the Doc object and are evaluated lazily (i.e. upon request). Much of spaCy's core functionality is accessed through the methods on Doc (n=33), Span (n=29), and Token (n=78) objects.

In[1]: import spacy ...: nlp = spacy.load("en") ...: doc = nlp("The big grey dog ate all of the chocolate, but fortunately he wasn't sick!")

Tokenization

Tokenization is a foundational step in many NLP tasks. Tokenizing text is the process of splitting a piece of text into words, symbols, punctuation, spaces, and other elements, thereby creating tokens. A naive way to do this is to simply split the string on the whitespace:

In[2]: doc.text.split() ...: Out[2]: ['The', 'big', 'grey', 'dog', 'ate', 'all', 'of', 'the', 'chocolate,', 'but', 'fortunately', 'he', "wasn't", 'sick!']

On the surface, this looks fine. But note that it disregards the punctuation and does not split the verb and adverb ("was", "n't"). Put differently, it is naive, and it fails to recognize elements of the text that help us (and a machine) understand its structure and meaning. Let's see how spaCy handles this:

In[3]: [token.orth_ for token in doc] ...: Out[3]: ['The', 'big', 'grey', 'dog', 'ate', 'all', 'of', 'the', 'chocolate', ',', 'but', 'fortunately', 'he', 'was', "n't", ' ', 'sick', '!']

Here, we access the each token's .orth_ method, which returns a string representation of the token rather than a SpaCy token object. This might not always be desirable, but it's worth noting. SpaCy recognizes punctuation and is able to split these punctuation tokens from word tokens. Many of SpaCy's token methods offer both string and integer representations of processed text: methods with an underscore suffix return strings and methods without an underscore suffix return integers. For example:

In[4]: [(token, token.orth_, token.orth) for token in doc] ...: Out[4]: [(The, 'The', 517), (big, 'big', 742), (grey, 'grey', 4623), (dog, 'dog', 1175), (ate, 'ate', 3469), (all, 'all', 516), (of, 'of', 471), (the, 'the', 466), (chocolate, 'chocolate', 3593), (,, ',', 416), (but, 'but', 494), (fortunately, 'fortunately', 15520), (he, 'he', 514), (was, 'was', 491), (n't, "n't", 479), ( , ' ', 483), (sick, 'sick', 1698), (!, '!', 495)] In[5]: [token.orth_ for token in doc if not token.is_punct | token.is_space] ...: Out[5]: ['The', 'big', 'grey', 'dog', 'ate', 'all', 'of', 'the', 'chocolate', 'but', 'fortunately', 'he', 'was', "n't", 'sick']

Cool, right?

Lemmatization

A related task to tokenization is lemmatization. Lemmatization is the process of reducing a word to its base form — its mother word, if you like. Different uses of a word often have the same root meaning. For example, practice, practiced, and practising all essentially refer to the same thing. It is often desirable to standardise words with similar meaning to their base form. With SpaCy, we can access each word's base form with a token's .lemma_ method:

In[6]: practice = "practice practiced practicing" ...: nlp_practice = nlp(practice) ...: [word.lemma_ for word in nlp_practice] ...: Out[6]: ['practice', 'practice', 'practice']

Why is this useful? An immediate use case is in machine learning, specifically text classification. Lemmatizing the text prior to, for example, creating a "bag-of-words" avoids word duplication and, therefore, allows for the model to build a clearer picture of patterns of word usage across multiple documents.

POS Tagging

Part-of-speech tagging is the process of assigning grammatical properties (i.e. noun, verb, adverb, adjective, etc.) to words. Words that share the same POS tag tend to follow a similar syntactic structure and are useful in rule-based processes.

For example, in a given description of an event, we may wish to determine who owns what. By exploiting possessives, we can do this (providing the text is grammatically sound!). SpaCy uses the popular Penn Treebank POS tags (see here). With SpaCy, you can access coarse and fine-grained POS tags with the .pos_ and .tag_ methods, respectively. Here, I access the fine-grained POS tag:

In[7]: doc2 = nlp("Conor's dog's toy was hidden under the man's sofa in the woman's house") ...: pos_tags = [(i, i.tag_) for i in doc2] ...: pos_tags ...: Out[7]: [(Conor, 'NNP'), ('s, 'POS'), (dog, 'NN'), ('s, 'POS'), (toy, 'NN'), (was, 'VBD'), (hidden, 'VBN'), (under, 'IN'), (the, 'DT'), (man, 'NN'), ('s, 'POS'), (sofa, 'NN'), (in, 'IN'), (the, 'DT'), (woman, 'NN'), ('s, 'POS'), (house, 'NN')]

We can see that the 's tokens are labelled as POS . We can exploit this tag to extract the owner and the thing that they own:

In[8]: owners_possessions = [] ...: for i in pos_tags: ...: if i[1] == "POS": ...: owner = i[0].nbor(-1) ...: possession = i[0].nbor(1) ...: owners_possessions.append((owner, possession)) ...: ...: owners_possessions ...: Out[8]: [(Conor, dog), (dog, toy), (man, sofa), (woman, house)]

This returns a list of owner-possession tuples. If you want to be super Pythonic about it, you can do this in a list comprehenion (which I think is preferable!):

In[9]: [(i[0].nbor(-1), i[0].nbor(+1)) for i in pos_tags if i[1] == "POS"] ...: Out[9]: [(Conor, dog), (dog, toy), (man, sofa), (woman, house)]

Here, we are using each token's .nbor method, which returns a token's neighboring tokens.

Entity Recognition

Entity recognition is the process of classifying named entities found in a text into predefined categories, such as persons, places, organizations, dates, etc. spaCy uses a statistical model to classify a broad range of entities, including persons, events, works of art, and nationalities/religions (see the documentation for the full list).

For example, let's take the first two sentences from Barack Obama's Wikipedia entry. We will parse this text, then access the identified entities using the Doc object's .ents method. With this method called on the Doc we can access additional Token methods, specifically .label_ and .label :

In[10]: wiki_obama = """Barack Obama is an American politician who served as ...: the 44th President of the United States from 2009 to 2017. He is the first ...: African American to have served as president, ...: as well as the first born outside the contiguous United States.""" ...: ...: nlp_obama = nlp(wiki_obama) ...: [(i, i.label_, i.label) for i in nlp_obama.ents] ...: Out[10]: [(Barack Obama, 'PERSON', 346), (American, 'NORP', 347), (the United States, 'GPE', 350), (2009 to 2017, 'DATE', 356), (first, 'ORDINAL', 361), (African, 'NORP', 347), (American, 'NORP', 347), (first, 'ORDINAL', 361), (United States, 'GPE', 350)]

You can see the entities that the model has identified and how accurate they are (in this instance). PERSON is self-explanatory, NORP is nationalities or religious groups, GPE identifies locations (cities, countries, etc.), DATE recognizes a specific date or date-range, and ORDINAL identifies a word or number representing some type of order.

While we are on the topic of Doc methods, it is worth mentioning spaCy's sentence identifier. It is not uncommon for NLP tasks to want to split a document into sentences. It is simple to do this with SpaCy by accessing a Doc's.sents method:

In[11]: for ix, sent in enumerate(nlp_obama.sents, 1): ...: print("Sentence number {}: {}".format(ix, sent)) ...: Sentence number 1: Barack Obama is an American politician who served as the 44th President of the United States from 2009 to 2017. Sentence number 2: He is the first African American to have served as president, as well as the first born outside the contiguous United States.

That's it for now. In later posts, I'll show how spaCy can be used in complex data mining and ML tasks.