Whew, that’s a lot of steps!

Note: Before we continue, it’s worth mentioning that these are the steps in a typical NLP pipeline, but you will skip steps or re-order steps depending on what you want to do and how your NLP library is implemented. For example, some libraries like spaCy do sentence segmentation much later in the pipeline using the results of the dependency parse.

So how do we code this pipeline? Thanks to amazing python libraries like spaCy, it’s already done! The steps are all coded and ready for you to use.

First, assuming you have Python 3 installed already, you can install spaCy like this:

Then the code to run an NLP pipeline on a piece of text looks like this:

If you run that, you’ll get a list of named entities and entity types detected in our document:

London (GPE)

England (GPE)

the United Kingdom (GPE)

the River Thames (FAC)

Great Britain (GPE)

London (GPE)

two millennia (DATE)

Romans (NORP)

Londinium (PERSON)

You can look up what each of those entity codes means here.

Notice that it makes a mistake on “Londinium” and thinks it is the name of a person instead of a place. This is probably because there was nothing in the training data set similar to that and it made a best guess. Named Entity Detection often requires a little bit of model fine tuning if you are parsing text that has unique or specialized terms like this.

Let’s take the idea of detecting entities and twist it around to build a data scrubber. Let’s say you are trying to comply with the new GDPR privacy regulations and you’ve discovered that you have thousands of documents with personally identifiable information in them like people’s names. You’ve been given the task of removing any and all names from your documents.

Going through thousands of documents and trying to redact all the names by hand could take years. But with NLP, it’s a breeze. Here’s a simple scrubber that removes all the names it detects:

And if you run that, you’ll see that it works as expected:

In 1950, [REDACTED] published his famous article "Computing Machinery and Intelligence". In 1957, [REDACTED]

Syntactic Structures revolutionized Linguistics with 'universal grammar', a rule based system of syntactic structures.

Extracting Facts

What you can do with spaCy right out of the box is pretty amazing. But you can also use the parsed output from spaCy as the input to more complex data extraction algorithms. There’s a python library called textacy that implements several common data extraction algorithms on top of spaCy. It’s a great starting point.

One of the algorithms it implements is called Semi-structured Statement Extraction. We can use it to search the parse tree for simple statements where the subject is “London” and the verb is a form of “be”. That should help us find facts about London.

Here’s how that looks in code:

And here’s what it prints:

Here are the things I know about London: - the capital and most populous city of England and the United Kingdom.

- a major settlement for two millennia.

Maybe that’s not too impressive. But if you run that same code on the entire London wikipedia article text instead of just three sentences, you’ll get this more impressive result:

Here are the things I know about London: - the capital and most populous city of England and the United Kingdom

- a major settlement for two millennia

- the world's most populous city from around 1831 to 1925

- beyond all comparison the largest town in England

- still very compact

- the world's largest city from about 1831 to 1925

- the seat of the Government of the United Kingdom

- vulnerable to flooding

- "one of the World's Greenest Cities" with more than 40 percent green space or open water

- the most populous city and metropolitan area of the European Union and the second most populous in Europe

- the 19th largest city and the 18th largest metropolitan region in the world

- Christian, and has a large number of churches, particularly in the City of London

- also home to sizeable Muslim, Hindu, Sikh, and Jewish communities

- also home to 42 Hindu temples

- the world's most expensive office market for the last three years according to world property journal (2015) report

- one of the pre-eminent financial centres of the world as the most important location for international finance

- the world top city destination as ranked by TripAdvisor users

- a major international air transport hub with the busiest city airspace in the world

- the centre of the National Rail network, with 70 percent of rail journeys starting or ending in London

- a major global centre of higher education teaching and research and has the largest concentration of higher education institutes in Europe

- home to designers Vivienne Westwood, Galliano, Stella McCartney, Manolo Blahnik, and Jimmy Choo, among others

- the setting for many works of literature

- a major centre for television production, with studios including BBC Television Centre, The Fountain Studios and The London Studios

- also a centre for urban music

- the "greenest city" in Europe with 35,000 acres of public parks, woodlands and gardens

- not the capital of England, as England does not have its own government

Now things are getting interesting! That’s a pretty impressive amount of information we’ve collected automatically.

For extra credit, try installing the neuralcoref library and adding Coreference Resolution to your pipeline. That will get you a few more facts since it will catch sentences that talk about “it” instead of mentioning “London” directly.

What else can we do?

By looking through the spaCy docs and textacy docs, you’ll see lots of examples of the ways you can work with parsed text. What we’ve seen so far is just a tiny sample.

Here’s another practical example: Imagine that you were building a website that let’s the user view information for every city in the world using the information we extracted in the last example.

If you had a search feature on the website, it might be nice to autocomplete common search queries like Google does: