A Smattering of NLP in Python¶

by Charlie Greenbacker @greenbacker

Back in the dark ages of data science, each group or individual working in Natural Language Processing (NLP) generally maintained an assortment of homebrew utility programs designed to handle many of the common tasks involved with NLP. Despite everyone's best intentions, most of this code was lousy, brittle, and poorly documented -- not a good foundation upon which to build your masterpiece. Fortunately, over the past decade, mainstream open source software libraries like the Natural Language Toolkit for Python (NLTK) have emerged to offer a collection of high-quality reusable NLP functionality. These libraries allow researchers and developers to spend more time focusing on the application logic of the task at hand, and less on debugging an abandoned method for sentence segmentation or reimplementing noun phrase chunking.

This presentation will cover a handful of the NLP building blocks provided by NLTK (and a few additional libraries), including extracting text from HTML, stemming & lemmatization, frequency analysis, and named entity recognition. Several of these components will then be assembled to build a very basic document summarization program.

Initial Setup¶

Obviously, you'll need Python installed on your system to run the code examples used in this presentation. We enthusiatically recommend using Anaconda, a Python distribution provided by Continuum Analytics. Anaconda is free to use, it includes nearly 200 of the most commonly used Python packages for data analysis (including NLTK), and it works on Mac, Linux, and yes, even Windows.

We'll make use of the following Python packages in the example code:

nltk (comes with Anaconda)

readability-lxml

BeautifulSoup4 (comes with Anaconda)

scikit-learn (comes with Anaconda)

Please note that the readability package is not distributed with Anaconda, so you'll need to download & install it separately using something like easy_install readability-lxml or pip install readability-lxml .

If you don't use Anaconda, you'll also need to download & install the other packages separately using similar methods. Refer to the homepage of each package for instructions.

You'll want to run nltk.download() one time to get all of the NLTK packages, corpora, etc. (see below). Select the "all" option. Depending on your network speed, this could take a while, but you'll only need to do it once.

Java libraries (optional)¶

One of the examples will use NLTK's interface to the Stanford Named Entity Recognizer, which is distributed as a Java library. In particular, you'll want the following files handy in order to run this particular example:

stanford-ner.jar

english.all.3class.distsim.crf.ser.gz

Getting Started¶