A few weeks ago I started working on a text summarisation project and I needed a Natural Language Processing library with comprehensive features. The project had several potentially computationally expensive components where I wanted to try out different things. So I needed something flexible that was also fast enough for me to experiment with. The search brought me to spaCy. spaCy was very helpful and I decided to summarise its features in this two part guide. Full code for the post can be found on GitHub. If you are new to NLP perhaps you might consider reading this intro first — it explains many concepts refered to in this post.

Why I like spaCy:

It is fast because of Cython’s data structures and parallelism magic: so I can experiment with different approaches.

It is accurate: there are three pretrained english models to choose from, small, medium and large. As accuracy and speed change accordingly, this gives flexibility to balance them based on a given task.

It is user-friendly: the documentation is good for most of the package and there is a built-in explain function for quick access.

function for quick access. There is a free interactive four hour course to get one started.

It has mild learning curve: it was possible to get started after taking the course and learn as needed on the go.

Features:

Preprocessing: tokenisation, sentence segmentation, lemmatisation, stopwords

Linguistic features: part of speech tags, dependency parsing, named entity recogniser

Visualisers for dependency trees and named entity recogniser

Pre-trained word vectors and models

Flexibility: can augment or replace any pipeline component or add new components such as TextCategorizer.

Transfer learning with BERT-style pretraining

Downsides:

The documentation becomes rather hard to navigate through as one gets to training and pretraining of models.

There is some common NLP functionality missing, such as scikit-learn-style vectorisers for term-document or TFIDF matrices. Even though these are not necessary if you are training your models with spacy, they are still handy if you want to combine spaCy with other tools.

Getting started

First we need to install spaCy and download a pretrained model. There are three english language models, small, medium and large as well as the model with only GloVe words vectors. The medium and large models also come with GloVe words vectors. All en_core_web_* models come with tokeniser, tagger, parser and entity recogniser components but accuracy improves with model size. I will use the large model here. To get started run the following commands in terminal.

pip install spacy

python -m spacy download en_core_web_lg

Next we create an nlp object and load a model in it. The nlp object now has a tokeniser, tagger, parser and entity recogniser in its pipeline and we can use it to process a text and get all of those features.

Preprocessing

At this point some of the usual text preprocessing tasks are a breeze. The doc can be sliced with token indices to get single tokens or sequences of tokens (spans) and various token attributes such as text, lemma, index, pos, tag and etc. can be accessed. Some attributes extend to spans as well. Sentence segmentation is also available.

The medium and large english models also come with GloVe vectors and the vectors can be accessed through token/span/doc .vector attribute. Vector of a span or a doc is calculated by taking the average of vectors for all tokens in the span. spaCy also has built in similarity function.

The similarity function is the same as the cosine similarity, or the cosine of the angle between two vectors. Cosine similarity ignores vector lengths and, in the two extreme cases, vectors pointing in the same direction have similarity 1, while vectors pointing in the opposite direction have similarity -1. So larger the value higher the degree of similarity.

Linguistic features

We could also retrieve some linguistic features such as noun chunks, part of speech tags and dependency relations between tokens in each sentence. In order to understand what various tags such as token.pos_ , token.tag_ or token.dep_ mean, we can use spacy.explain() that will access annotation specifications.

Notice that token.tag_ carries extra morphological information compared with token.pos_ . These attribures, as well as some others, come in pairs with and without an underscore, corresponding to a unicode and integer values respectively. The .dep_ attribute is for dependency relations between words and is best understood via spaCy’s built-in customisable visualiser.

Dependency tree

Retrieving and visualising named entities is done very conveniently in spaCy.

You could introduce some custom options for visualisation part.

spaCy universe

spaCy is very flexible. It is possible to add new pipeline components or replace existing ones. People have been building on top of spaCy and there is a myriad of packages in the spaCy universe. I will only mention two of the pipeline extentions, spacy_langdetect and neuralcoref, but there are many other packages worth spending time to play with.

Since a lot of text data comes from mostly uncurated sources such as the web the language detector functionality comes especially handy. For example, combining the lang_detect with with spaCy’s .lang_ attribute (which points to the language of the model of the nlp object used to process the text) one could ensure that the correct language model is used. Notice ._. is used to access extensions.

Another common feature often necessary in text processing is Coreference Resolution, or linking together all phrases that mention the same thing. For example, in “I like ice cream because it is tasty” both “ice cream” and “it” refer to the same thing. Coreference Resolution is a key feature of language understanding and while humans are pretty good at it, machines are not. The neuralcoref extension integrates this functionality in spaCy as another pipeline component. We need to re-process doc objects that were created before adding new pipeline components to make neuralcoref available.

We see that ‘it’ is incorrectly resolved while resolution for ‘he’ is correct and the corresponding scores show that the model wasn’t as confident in it’s decision for ‘it’ as it was for ‘he’. We can tweak the greediness of the model to make it stricter about resolving coreferences. The default greediness is .5 and if we decrease it to .45 we see that the coreferences are now resolved correctly.

The neuralcoref also comes with a visualization client.

Conclusion