The last months have been quite intense at HuggingFace 🤗 with crazy usage growth 🚀 and everybody hard at work to keep up with it 🏇, but we finally managed to free some time and update our open-source library ✨Neuralcoref while publishing the training code at the same time.

Since we launched v1 last summer, more than ten million 💯 coreferences have been resolved on Hugging Face. Also, we are stoked that our library is now used in production by a few other companies and some really smart researchers, and our work was featured in the latest session of Stanford’s NLP course! 💪

The training code has been updated to work with the latest releases of both PyTorch (v0.3) and spaCy v2.0 while the pre-trained model only depends on Numpy and spaCy v2.0.

This release’s major milestone: You will now be able to train ✨ Neuralcoref on your own dataset — e.g., another language than English! — provided you have an annotated dataset.

We have added a special section to the readme about training on another language, as well as detailed instructions on how to get, process and train the model on the English OntoNotes 5.0 dataset.

As before, ✨Neuralcoref is designed to strike a good balance between accuracy and speed/simplicity, using a rule-based mention detection module, a constrained number of features and a simple feed-forward neural network that can be implemented easily in Numpy.

In the rest of this blog post, I will describe how the coreference resolution system works and how to train it. Coreference resolution is a rather complicated NLP task 🐉 so bare with me, you won’t regret it!

Let’s have a quick look at a (public) dataset 📚

A good quality public dataset you can use to train the model on English is the CoNLL 2012 dataset. It is one of the largest freely available dataset with coreference annotations, having about 1.5M+ tokens spanning many fields like newswire, broadcast and telephone conversations as well as web data (blogs, newsgroups …).

In the repo we explain how to download and prepare this dataset if you want to use it. Once you are done with that, a typical CoNLL file will look like this: