Version 2.1 of the spaCy Natural Language Processing library includes a huge number of features, improvements and bug fixes. In this post, we highlight some of the things we’re especially pleased with, and explain some of the most challenging parts of preparing this big release.

Our annotation tool Prodigy Prodigy is a fully scriptable annotation tool that complements spaCy extremely well. Most NLP projects are easier if you have a way to train models on exactly your data. This lets you improve accuracy, and customize the label set. Prodigy’s community has been growing quickly, allowing us to keep spaCy fully funded without any external investment.

spaCy is an open-source library for industrial-strength natural language processing in Python. It’s widely used in production and research systems for extracting information from text, developing smarter user-facing features, and preprocessing text for deep learning. We’ve been publishing alpha releases to spacy-nightly for months now, and encouraging users to try out the new version. Today we’re excited to finally publish spaCy v2.1.0. We’ve fixed almost every outstanding bug on the tracker, given the docs a huge makeover, improved both speed and accuracy, made installation significantly easier and faster, and developed some exciting new features. Check out the release notes for a full overview.

By far the biggest news in NLP research over 2018 was the success of language model pretraining. The basic intuition behind this has been obvious for a very long time. There’s never been much doubt that NLP models need to somehow import knowledge from raw text, as labelled training corpora tend to be too small to represent long-tailed knowledge about word meanings and usage. In 2011, deep learning methods were proving successful for NLP, and techniques for pretraining word representations were already in use. A range of techniques for pretraining further layers of the network were proposed over the years, as the deep learning hype took hold. However, no one objective for the pretraining seemed to be a knockout success on a wide range of tasks.

In 2018, a number of papers showed that a simple language modelling objective worked well for LSTM models. Devlin et al. then presented a neat modification that allowed bidirectional models to be pretrained as well. One of the major themes throughout these results was that pretraining allowed extremely large models to be used, even when the labelled data is fairly small. A team from OpenAI took this one step further, training an even larger version of Devlin et al.’s model, and showing it performs well on long-form text generation.

While these large models provide convincing demonstrations, they’re not suitable for spaCy’s main use-cases. The performance target we’ve set for ourselves is 10,000 words per second per CPU core. The v2.1 models currently run at around 8,000 words per second, so we’re already slightly behind. Clearly, we couldn’t use a model such as BERT or GPT-2 directly. But the same principle of pretraining should still apply, so long as we could find a way to scale it down.

The performance target we’ve set for ourselves is 10,000 words per second per CPU core.

Scaling down these language models to the sizes we use in spaCy posed an interesting research challenge. Language models typically use a large output layer, with one neuron per word in the vocabulary. If you’re predicting over a 10,000 word vocabulary, this means you’re predicting a vector with 10,000 elements. spaCy v2.1’s token vectors are 96 elements wide, so a naive softmax approach would be unlikely to work: we’d be trying to predict 100 elements of output for every 1 element of input. We could make the vocabulary somewhat smaller, but every word that’s out of vocabulary is a word the pretraining process will be unable to learn. Stepping back a little, the problem of so-called “one hot” representations posing representational issues for neural networks is actually quite familiar. This is exactly what algorithms like word2vec, GloVe and FastText set out to solve. Instead of a binary vector with one dimension per entry in the vocabulary, we can have a much denser real-valued representation of the same information.

The spacy pretrain command requires a word vectors model as part of the input, which it uses as the target output for each token. Instead of predicting a token’s ID as a classification problem, we learn to predict the token’s word vector. Inspired by names such as ELMo and BERT, we’ve termed this trick Language Modelling with Approximate Outputs (LMAO). Our first implementation is probably a good way to get acquainted with the idea – it’s extremely short.

As is often the case in research, it seems that LMAO is an idea whose time had come. Several other researchers have been working on related ideas independently. So far we’ve been using L2 loss in our experiments, but Kumar and Tsvetkov (2018), who were simultaneously working on a similar idea for machine translation, have developed a novel probabilistic loss using the von Mises-Fisher distribution, which they show performs significantly better than L2 in their experiments. Even more recently, Li et al. (2019) report experiments using an LMAO objective in place of the softmax layer in the ELMo pretraining system, with promising results. In our own preliminary experiments, we’ve found pretraining especially effective when limited training data is available. It helps most for text categorization and parsing, but is less effective for named entity recognition. We expect the pretraining to be increasingly important as we add more abstract semantic prediction models to spaCy, for tasks such as semantic role labelling, coreference resolution and named entity linking.

As a small example, we ran spacy pretrain for the English sm and lg models using 100,000 comments from the Reddit comments corpus:

Pretraining examples python -m spacy pretrain /input/reddit-100k.jsonl en_vectors_web_lg /output python -m spacy pretrain /input/reddit-100k.jsonl en_vectors_web_lg /output --use-vectors

We ran both pretraining jobs simultaneously on a Tesla V100, with each task training at around 50,000 tokens per second. We pretrained for 3 billion words (making several passes over the 100k comments), which took around 17 hours. The total cost of both jobs came out to about $40.00 on Google Compute Engine. We haven’t implemented resume logic yet, which will help decrease the cost of large scale jobs further, as it would allow the use of pre-emptible instances. This would take pretraining costs down to around $4 per billion words of training. The spacy pretrain command saves out a weights file after each pass over the data. To use the pretrained weights, we can simply pass them as an argument to spacy train :

python -m spacy train en /models/ /corpora/PTB_SD_3_3_0/train.gold.json /corpora/PTB_SD_3_3_0/dev.gold.json --n-examples 100 --pipeline parser --init-tok2vec pretrain-nv-model999.bin

We’re also pleased to report our first independent positive result for the spaCy pretrain command. Jari Bakken and Ole Henrik Skogstrøm have been working on Norwegian Bokmål support for spaCy, using NER annotations produced by the University of Oslo. Even with a small amount of pretraining using default settings, the spacy pretrain command resulted in much better performance for all three components, the tagger, parser and entity recognizer.

Pretraining POS UAS LAS NER P NER R NER F ❌ no 94.60 88.59 86.10 71.96 70.54 71.24 ✅ yes 95.07 90.14 87.82 78.92 78.69 78.81 Model details: Preliminary results for Norwegian Bokmål. Pretraining was performed with default settings on a 700m word corpus. Pretraining was allowed to run for 15 hours, during which 7 epochs were completed. The final trained model was 15mb. See this thread for more details.

Over the years, the rule-based Matcher has become one of spaCy’s most popular features. Statistical models are great to generalize based on the context and beyond specific examples – but they can’t always beat large terminology list and application-specific rules. Rule-based systems are especially powerful when they can leverage statistical predictions, like part-of-speech tags, syntactic dependencies or entity labels.

spaCy v2.1 ships with a new matcher engine, rewritten from scratch. It resolves various issues around the use of operators and quantifiers like "OP": "?" to make a token optional. The API also introduces new predicates to express set membership or rich comparison. The following pattern matches a sequence of two tokens: a pronoun whose lowercase form isn’t “i” or “it”, followed by a verb with the base form “like” or “love”:

pattern = [ { "POS" : "PRON" , "LOWER" : { "NOT_IN" : [ "i" , "it" ] } } , { "POS" : "VERB" , "LEMMA" : { "IN" : [ "like" , "love" ] } } , ]

What are extension attributes? As of v2.0, spaCy allows registering custom attributes on the Doc , Token and Span class that become available as the ._ property. Attributes can overwritten manually, or computed via a getter function.

The new match pattern API now also supports a "_" key, allowing patterns to specify custom extension attribute values to match on. In this case, a token if token._.number is greater than or equal to 20:

pattern = [ { "_" : { "number" : { ">=" : 20 } } } ]

When we introduced custom pipeline components in v2.0, many users took advantage of them to build their own rule-based entity recognizers powered by the Matcher . Whether it’s cities, gene names or units for oil drilling, many entity types can be expressed pretty unambiguously with terminology lists and token-based rules.

The EntityRuler is a useful new component that can do all of this out-of-the-box. If it’s added before the entity recognizer in the pipeline, the entities it sets directly influence the model’s predictions. The statistical entity recognizer will respect pre-defined entity spans and take them into account when predicting the entity tags for the remaining tokens, which can potentially give you a nice boost in accuracy. If the entity ruler is added after the statistical entity recognizer, it can “fill in the blanks” and catch entities that the model missed, or optionally overwrite existing predictions.

Using the entity ruler import spacy from spacy . pipeline import EntityRuler nlp = spacy . load ( "en_core_web_sm" ) weights_pattern = [ { "LIKE_NUM" : True } , { "LOWER" : { "IN" : [ "g" , "kg" , "grams" , "kilograms" , "lb" , "lbs" , "pounds" ] } } ] patterns = [ { "label" : "QUANTITY" , "pattern" : weights_pattern } ] ruler = EntityRuler ( nlp , patterns = patterns ) nlp . add_pipe ( ruler , before = "ner" ) doc = nlp ( "U.S. average was 2 lbs." ) print ( [ ( ent . text , ent . label_ ) for ent in doc . ents ] )

A pattern can either be a list of dictionaries describing the individual tokens, or an exact string match. If you’ve been using our annotation tool Prodigy, you might recognize this format from the pattern files you can load in to bootstrap new entity types and text categories. The formats are fully compatible, so you’ll be able to use your Prodigy patterns with the entity ruler, and vice versa.

The EntityRuler is also fully serializable, making it easy to package entity rules with your spaCy models. Patterns will be saved out to the model directory as a .jsonl file (newline-delimited JSON) and loaded back in when you load the model. We’re hoping that this component can be used to power models that rely on large domain-specific terminoloy lists.

Serializing the entity ruler nlp = spacy . load ( "en_core_web_sm" ) ruler = EntityRuler ( nlp , patterns = lots_of_patterns ) nlp . add_pipe ( ruler , before = "ner" ) nlp . to_disk ( "/path/to/model-with-rules" )

spaCy has always supported merging spans of several tokens into single tokens – for example, to merge a noun phrase into one word. However, the existing Doc.merge and Span.merge implementations were inefficient when merging in bulk, because the array had to be resized each time. On top of that, it was difficult to keep track of changing token indices, and easy to end up with incorrectly merged spans.

The new Doc.retokenize context manager is specifically optimized for bulk processing. Merges are collected and performed when the context manager exits.

Retokenization with merging doc = nlp ( "I moved from New York to Los Angeles" ) with doc . retokenize ( ) as retokenizer : retokenizer . merge ( doc [ 3 : 5 ] , attrs = { "LEMMA" : "New York" } ) retokenizer . merge ( doc [ 6 : 8 ] , attrs = { "LEMMA" : "Los Angeles" } )

In addition to merging, Doc.retokenize can also split one token into several. The process requires more settings, because you need to specify the text of the individual tokens, optional per-token attributes and how the new tokens should be attached to the existing syntax tree. To prevent mismatches, the heads can be provided as tokens, or (token, subtoken) tuples if the newly split token should be attached to another subtoken.

Retokenization with splitting doc = nlp ( "I live in NewYork" ) with doc . retokenize ( ) as retokenizer : heads = [ ( doc [ 3 ] , 1 ) , doc [ 2 ] ] attrs = { "POS" : [ "PROPN" , "PROPN" ] , "DEP" : [ "pobj" , "compound" ] } retokenizer . split ( doc [ 3 ] , [ "New" , "York" ] , heads = heads , attrs = attrs )

With better splitting and merging, we’re also well set up for better support for statistical tokenization. Tokenizing languages like English and German work fine with a rule-based approach, but for languages like Chinese, Vietnamese and Japanese, statistical models are definitely required. The v2.1 release also has some quiet improvements that will help set the stage for better support for these languages. The GoldParse class is now able to calculate many-to-many alignments between the tokenization in a Doc object and the gold-standard. When the Doc object over-segments, the parser can now learn to predict a special label that can mark the tokens for merging by a later component. With this approach, spaCy’s parser is now capable of jointly predicting tokenization, sentence segmentation and parsing, which should be very helpful for languages or genres with high mutual information between these problems.

The biggest issue were variable-width lookbehinds, caused by character classes which actually consist of multiple-character tokens (like ’’ ). This meant that they needed to be grouped into disjunctive expressions using | , rather than character classes using [ ] which only require a set lookup. Variable-width lookbehinds introduce a serious performance problem, because they can’t be recognized by a finite-state automaton. Essentially, you’re no longer dealing with “regular expressions” once you have these. Russ Cox (who wrote re2 ) has a very comprehensive overview of these issues.

The variable-width expressions crept in over time, once we had switched over to the regex library in order to make use of its better unicode support, especially for Python 2. Performance got worse bit by bit, as many of the regular expressions were adapted across a number of contributions that widened support to new languages and fixed specific problems. By the time we noticed the efficiency problems, refactoring the tokenization rules to remove the variable-width look-behinds had become a significant project, requiring focussed attention.

The problem was finally solved by Sofie Van Landeghem, in the first of what we hope will be many consulting projects for spaCy. The changes improve the efficiency of the tokenizer by two to three times, with equivalent accuracy when evaluated against the Universal Dependencies corpora. This also allows us to avoid depending on the regex module, and instead switch back to Python’s built-in re library.

Graphs, analysis and implementation by Sofie Van Landeghem

The first versions of spaCy used models trained with the Averaged Perceptron algorithm, one of the simplest machine learning models. This meant that prior to v2.0, we had no real need for a maths library – the performance bottlenecks were in the hash table and feature extraction.

All that changed in v2.0, when we switched over to neural network models. In a neural network model, the performance bottleneck is matrix multiplication. Not all matrix multiplication solutions are created equal. Using an implementation that’s well optimized for your hardware can deliver an order of magnitude better performance than a generic implementation. Worse still, different implementations have different bugs.

The upshot of all this is that if you have two numpy arrays and you write A @ B , you might find that your code runs 20 times slower on your server than your desktop, performance with pip is dramatically different than performance with conda, and your colleagues report intermittent crashes when running the code on their laptops, but only in some modules, and not in others.

We were pretty dissatisfied with that, so we set out to fix it. Our humble goal was to make sure that when spaCy multiplied its matrices together, that always called into the same library – regardless of your choice of operating system, and regardless of whether you installed spaCy using pip or conda.

Processing a bunch of text is embarrassingly parallel, so you want to make sure you’re parallelizing the outermost loop possible.

Our other humble aim was to make sure that spaCy jobs don’t launch a bunch of unwanted threads. The task of processing a bunch of text with spaCy is embarrassingly parallel, so you want to make sure you’re parallelizing the outermost loop possible. Nested parallelism is inefficient, which means the matrix multiplication library must not launch threads. This is something Accelerate, OpenBLAS and MKL all get terribly wrong.

Achieving these two humble aims turned out to be an enormous year-long struggle. First of all, what happens if we just do nothing, and use numpy? Well, numpy will delegate its matrix multiplications to a system library. The choice of system library depends on the state of your system during installation, and whether you installed numpy using pip or conda. On conda, numpy will usually be linked against Intel’s MKL library. On pip, your mileage may vary, but you’ll usually find yourself with a kernel from OpenBLAS if you’re using Windows or Linux, and the native Accelerate library on OSX. On my machines, the vendored OpenBLAS kernel often performs poorly, while the Accelerate kernel can crash when used in combination with Python’s multiprocessing module. People are working hard on all these problems, so the specifics may change within a month or two… But the basic unreliability of this approach will remain. If you can’t easily make your desktop, your colleague’s laptop and your server all run the same code, you’re going to have a bad time.

Another solution would be to pick a library, require it to be installed into the system, and link against it. On conda, this would work okay, as conda allows you to specify non-Python dependencies. With pip, the user experience from this approach is pretty bad, especially for Windows users. In order to install the system dependency, a Windows user would have to install and configure the correct compiler, and compile the library from source. This is likely to be at least a whole day of yak shaving misery. The specifics will probably also change over time, so the guides we provide will be constantly going out of date.

The only way to make sure that pip install spacy works correctly is to provide a self-contained package which includes the necessary matrix multiplication routines. This also solves the threading problems: we can make sure that no threads are launched unless we want them, without requiring unintuitive environment variables to be set.

Preparing this stand-alone package was one of the most joyless programming tasks imaginable. Many extremely unfun days were spent ensuring the solution worked successfully on Windows, OSX and Linux. Getting multilinux wheels built using the various CI solutions was another extremely frustrating saga.

At the end of it all, we’re relieved to now depend on our new package cython-blis . We’re very grateful to Field Van Zee and the rest of the Blis community for their work on these linear algebra routines, which we’ve found to offer a great blend of stability, performance and usability. We’re still waiting for our package to be merged on conda-forge , but we’ve been using cython-blis for months now on the spacy-nightly branch, and have had no problems.

spaCy is mostly written in Cython, and it relies on several other packages that make use of C extensions. In order for pip install spacy to work, you would need to make sure a compiler was installed and the Python development headers were available. If everything was working correctly, installation would then take a few minutes to complete. Last year, we managed to improve installation times significantly, by providing wheel installation files for spaCy and our other packages. To make this happen, we teamed up with Nathaniel Smith to build Wheelwright, a more user-friendly interface into Matthew Brett’s multibuild , an awesome contraption that uses layers of scripts and Docker containers to convince Travis and Appveyor to build wheel installation files that work on a wide variety of platforms.

For the v2.1 release, we’ve managed to consolidate or eliminate several of spaCy’s dependencies, allowing us to offer a fully wheeled installation. Here’s how spaCy’s dependencies look now:

requirements.txt cymem >= 2.0 .2 , < 2.1 .0 preshed >= 2.0 .1 , < 2.1 .0 thinc >= 7.0 .2 , < 7.1 .0 blis >= 0.2 .2 , < 0.3 .0 murmurhash >= 0.28 .0 , < 1.1 .0 wasabi >= 0.1 .3 , < 1.1 .0 srsly >= 0.0 .5 , < 1.1 .0 plac < 1.0 .0 , >= 0.9 .6 tqdm >= 4.10 .0 , < 5.0 .0 numpy >= 1.15 .0 requests >= 2.13 .0 , < 3.0 .0 jsonschema >= 2.6 .0 , < 3.0 .0

Both numpy and requests are so widely used they’re almost part of the Python standard library, and plac , tqdm and jsonschema are very small pure Python packages. All of the other requirements are in-house, allowing us to make sure wheel installation files are available.

Minimizing our third-party dependencies also greatly increases the library’s stability. Due to Python’s import semantics, only one version of a given package can be installed in an environment at a time. This means that every third-party dependency we add increases the chance that our users will wake up to broken builds.