This guide shows how to use NER tagging for English and non-English languages with NLTK and Standford NER tagger (Python). You can also use it to improve the Stanford NER Tagger.

A short introduction to Named-Entities Recognition

First and foremost, a few explanations: Natural Language Processing (NLP) is a field of machine learning that seek to understand human languages. It’s one of the most difficult challenges Artificial Intelligence has to face. NLP covers several problematic from speech recognition, language generation, to information extraction.

NLP provides specific tools to help programmers extract pieces of information in a given corpus. Here is a short list of most common algorithms: tokenizing, part-of-speech tagging, stemming, sentiment analysis, topic segmentation, and named entity recognition.

NLTK (Natural Language Toolkit) is a wonderful Python package that provides a set of natural languages corpora and APIs to an impressing diversity of NLP algorithms. It’s easy to use, complete, and well documented. Of course, it’s free, open-source and community-driven.

Let’s dive into Named Entity Recognition (NER). NER is about locating and classifying named entities in texts in order to recognize places, people, dates, values, organizations. As an example:

Twenty miles east of Reno, Nev., where packs of wild mustangs roam free through the parched landscape, Tesla Gigafactory 1 sprawls near Interstate 80. […] The Gigafactory, whose construction began in June 2014, is not only outrageously large but also on its way to becoming the biggest manufacturing plant on earth. Now 30 percent complete, its square footage already equals about 35 Costco stores. […] (NY Times, November 2017)

This guide will show you how to implement NER tagging for non-English languages using NLTK. Enjoy reading!

A step-by-step guide to non-English NER with NLTK

At Sicara, I recently had to build algorithms to extract names and organization from a French corpus. As NLTK comes along with the efficient Stanford Named Entities tagger, I thought that NLTK would do the work for me, out of the box.

But I was wrong: I forgot my corpus was French and Stanford NER tagger is designed for English language only.

The only way to get it done is to train your own NER model. Use cases :

you are working with a non-English corpus too (French, German and Dutch…) ;

you want to improve Stanford English model.

I hope this step-by-step guide will help you.

Step 1: Implementing NER with Stanford NER / NLTK

Let’s start!

Because Stanford NER tagger is written in Java, you are going to need a proper Java Virtual Machine to be installed on your computer.

To do so, install Java JRE 8 or higher. You can install Java JDK (developer kit) if you want because it contains JRE. For Linux users, you will find all needed information on this guide on How To Install Java with Apt-Get on Ubuntu 16.04. For other users, please have a look at Java official documentation.

Once installed, make sure your $JAVA_HOME environment is set:

echo $JAVA_HOME

Mine is /user/lib/jvm/java-8-oracle . That’s it for Java!

If you haven’t done it yet, create a virtual environment to work on:

mkvirtualenv .venv-ner --python=/usr/bin/python3 workon .venv-ner

Download NLTK:

pip install nltk

Get Stanford NER Tagger. Download zip file stanford-ner-xxxx-xx-xx.zip : see ‘Download’ section from The Stanford NLP website.

Unzip it and move ner-tagger ner-tagger.jar and gzipped English model english.all.3class.distsim.crf.ser.gz to your application folder:

cd /home/charles/Downloads/ unzip stanford-ner-2017-06-09.zip mv stanford-ner-2017-06-09/ner-tagger.jar {yourAppFolder}/stanford-ner-tagger/ner-tagger.jar mv stanford-ner-2017-06-09/classifiers/english.all.3class.distsim.crf.ser.gz {yourAppFolder}/stanford-ner-tagger/ner-model-english.ser.gz

We now have two files in our stanford-ner-tagger folder:

ner-tagger.jar : NER tagger engine properly said ;

ner-model-english.ser.gz : NER model trained on an english corpus.gi

Copy the following ner_english.py script to perform english Named Entities Recognition:

Run it:

python ner_english.py

Output should be:

[('Twenty', 'O'), ('miles', 'O'), ('east', 'O'), ('of', 'O'), ('Reno', 'ORGANIZATION') , (',', 'O'), ('Nev.', 'LOCATION') , (',', 'O'), ('where', 'O'), ('packs', 'O'), ('of', 'O'), ('wild', 'O'), ('mustangs', 'O'), ('roam', 'O'), ('free', 'O'), ('through', 'O'), ('the', 'O'), ('parched', 'O'), ('landscape', 'O'), (',', 'O'), ('Tesla', 'ORGANIZATION') , ('Gigafactory', 'ORGANIZATION') , ('1', 'ORGANIZATION') , ('sprawls', 'O'), ('near', 'O'), ('Interstate', 'LOCATION') , ('80', 'LOCATION') , ('.', 'O'), ('The', 'O'), ('Gigafactory', 'O'), (',', 'O'), ('whose', 'O'), ('construction', 'O'), ('began', 'O'), ('in', 'O'), ('June', 'DATE') , ('2014', 'DATE') , (',', 'O'), ('is', 'O'), ('not', 'O'), ('only', 'O'), ('outrageously', 'O'), ('large', 'O'), ('but', 'O'), ('also', 'O'), ('on', 'O'), ('its', 'O'), ('way', 'O'), ('to', 'O'), ('becoming', 'O'), ('the', 'O'), ('biggest', 'O'), ('manufacturing', 'O'), ('plant', 'O'), ('on', 'O'), ('earth', 'O'), ('.', 'O'), ('Now', 'O'), ('30', 'PERCENT') , ('percent', 'PERCENT') , ('complete', 'O'), (',', 'O'), ('its', 'O'), ('square', 'O'), ('footage', 'O'), ('already', 'O'), ('equals', 'O'), ('about', 'O'), ('35', 'O'), ('Costco', 'ORGANIZATION') , ('stores', 'O'), ('.', 'O')]

Not bad at all! However, it is not perfect :

it does not detect all values : but these can be easily extracted using Regex ;

if does not detect all Named Entities : if you want to go further, you will have to train a more complete (or dataset specific) model.

Step 2: Training our own (French) model

Now, you know how to run NER on an English corpus. What about other languages like French?

You need to train your own model. To do so, create a dummy-french-corpus.tsv file in {yourAppFolder}/stanford-ner-tagger/train with the following syntax:

En O 2017 DATE , O Une O intelligence O artificielle O est O en O mesure O de O développer O par O elle-même O Super PERSON Mario PERSON Bros PERSON . O Sans O avoir O eu O accès O au O code O du O jeu O , O elle O a O récrée O ce O hit O des O consoles O Nintendo ORGANIZATION . O Des O chercheurs O de O l'Institut ORGANIZATION de ORGANIZATION Technologie ORGANIZATION de O Géorgie LOCATION , O aux O Etats-Unis LOCATION , O viennent O de O la O mettre O à O l'épreuve O . O

Create a prop.txt file in the same folder too:

trainFile = train/dummy-french-corpus.tsv serializeTo = dummy-ner-model-french.ser.gz map = word=0,answer=1 useClassFeature=true useWord=true useNGrams=true noMidNGrams=true maxNGramLeng=6 usePrev=true useNext=true useSequences=true usePrevSequences=true maxLeft=1 useTypeSeqs=true useTypeSeqs2=true useTypeySequences=true wordShape=chris2useLC useDisjunctive=true

Train it, using:

cd stanford-ner-tagger/ java -cp "stanford-ner.jar:lib/*" -mx4g edu.stanford.nlp.ie.crf.CRFClassifier -prop train/prop.txt

This should output dummy-ner-model-french.ser.gz file. Create a new ner_french.py script to use it:

Run it:

python ner_french.py

The output seems to be right:

[('En', 'O'), ('2017', 'DATE') , (',', 'O'), ('une', 'O'), ('intelligence', 'O'), ('artificielle', 'O'), ('est', 'O'), ('en', 'O'), ('mesure', 'O'), ('de', 'O'), ('développer', 'O'), ('par', 'O'), ('elle-même', 'O'), ('Super', 'PERSON') , ('Mario', 'PERSON') , ('Bros.', 'O'), ('Sans', 'O'), ('avoir', 'O'), ('eu', 'O'), ('accès', 'O'), ('au', 'O'), ('code', 'O'), ('du', 'O'), ('jeu', 'O'), (',', 'O'), ('elle', 'O'), ('a', 'O'), ('récrée', 'O'), ('ce', 'O'), ('hit', 'O'), ('des', 'O'), ('consoles', 'O'), ('Nintendo', 'ORGANIZATION') , ('.', 'O'), ('Des', 'O'), ('chercheurs', 'O'), ('de', 'O'), ("l'Institut", 'ORGANIZATION') , ('de', 'ORGANIZATION') , ('Technologie', 'ORGANIZATION') , ('de', 'O'), ('Géorgie', 'LOCATION') , (',', 'O'), ('aux', 'O'), ('Etats-Unis', 'LOCATION') , (',', 'O'), ('viennent', 'O'), ('de', 'O'), ('la', 'O'), ('mettre', 'O'), ('à', 'O'), ("l'épreuve", 'O'), ('.', 'O')]

Congratulations, your model is trained! Of course, as the corpus we trained it on is ridiculous, you won’t succeed on a different text:

As you can see, none of the name entities have been caught:

[(‘La’, ‘O’), (‘première’, ‘O’), (‘Falcon’, ‘O’), (‘Heavy’, ‘O’), (‘de’, ‘O’), (“l’entreprise”, ‘O’), (‘SpaceX’, ‘O’), (‘,’, ‘O’), (‘la’, ‘O’), (‘plus’, ‘O’), (‘puissante’, ‘O’), (‘fusée’, ‘O’), (‘américaine’, ‘O’), (‘jamais’, ‘O’), (‘lancée’, ‘O’), (‘depuis’, ‘O’), (‘plus’, ‘O’), (‘de’, ‘O’), (‘quarante’, ‘O’), (‘ans’, ‘O’), (‘,’, ‘O’), (‘devrait’, ‘O’), (‘bien’, ‘O’), (‘emporter’, ‘O’), (‘le’, ‘O’), (‘roadster’, ‘O’), (‘de’, ‘O’), (“l’entrepreneur”, ‘O’), (‘américain’, ‘O’), (‘,’, ‘O’), (‘mais’, ‘O’), (‘sur’, ‘O’), (‘une’, ‘O’), (‘orbite’, ‘O’), (‘bien’, ‘O’), (‘différente’, ‘O’), (‘.’, ‘O’), (‘Elon’, ‘O’), (‘Musk’, ‘O’), (‘a’, ‘O’), (‘le’, ‘O’), (‘sens’, ‘O’), (‘du’, ‘O’), (‘spectacle’, ‘O’), (‘.’, ‘O’)]

You will need a bigger dataset to train on.

Step 3: Performing NER on French article

Two solutions:

You face a custom use case (you have specialized vocabulary or you are looking for high accuracy), and you write your own corpus.tsv file by labeling a big corpus by yourself;

You want to perform regular NER and you use an existing labeled corpus.

I have found this nice dataset (FR, DE, NL) that you can use: https://github.com/EuropeanaNewspapers/ner-corpora

Download enp_FR.bnf.bio file into your train folder. Adjust trainFile = train/enp_FR.bnf.bio and serializedTo=trained-ner-model-french-ser.gizin prop.txt file and train your model again (that may last 10 minutes or more) :

cd stanford-ner-tagger/ java -cp "stanford-ner.jar:lib/*" -mx4g edu.stanford.nlp.ie.crf.CRFClassifier -prop train/prop.txt

Run ner_french.py again:

[('La', 'O'), ('première', 'O'), ('Falcon', 'I-PER') , ('Heavy', 'I-PER') , ('de', 'O'), ("l'entreprise", 'O'), ('SpaceX', 'O'), (',', 'O'), ('la', 'O'), ('plus', 'O'), ('puissante', 'O'), ('fusée', 'O'), ('des', 'O'), ('Etats-Unis', 'I-LOC') , ('jamais', 'O'), ('lancée', 'O'), ('depuis', 'O'), ('plus', 'O'), ('de', 'O'), ('quarante', 'O'), ('ans', 'O'), (',', 'O'), ('devrait', 'O'), ('bien', 'O'), ('emporter', 'O'), ('le', 'O'), ('roadster', 'O'), ('de', 'O'), ("l'entrepreneur", 'O'), ('américain', 'O'), (',', 'O'), ('mais', 'O'), ('sur', 'O'), ('une', 'O'), ('orbite', 'O'), ('bien', 'O'), ('différente', 'O'), ('.', 'O'), ('Elon', 'I-PER') , ('Musk', 'I-PER') , ('a', 'O'), ('le', 'O'), ('sens', 'O'), ('du', 'O'), ('spectacle', 'O'), ('.', 'O')]

Now, it looks better, while not perfect !

Note: Output shows ‘I-PER’ instead of ‘PERSON’. It depends on how your initial corpus is labeled.

Conclusions

After a few hours on the Internet, looking for tools or packages that could handle french NER tagging, I had to resign myself. The only software I found is FreeLing, which seems great but it seems rather hard to install and C++ written.

Neither NLTK, Spacy, nor SciPy handles french NER tagging out-of-the-box. Hopefully, you can train models for new languages but respective documentations are really light on that point.

Useful Links

Freeling: an NLP tool written in C++ that works for many languages including English, French, German, Spanish, Russian, Italian, Norwegian ;

Spacy: : really good NLP python package with a nice documentation. Here is a link to add new language in Spacy.

NLTK (Natural Language Toolkit) is a wonderful Python package that provides a set of natural languages corpora and APIs to an impressing diversity of NLP algorithms

Stanford NER tagger: NER Tagger you can use with NLTK open-sourced by Stanford engineers and used in this tutorial.

Thanks to Flavian Hautbois and Pierre-Henri Cumenge.