Posted on //



This blog post is an introduction on how to make a key phrase extractor in Python, using the Natural Language Toolkit (NLTK).

But how will a search engine know what it is about? How will this document be indexed correctly? A human can read it and tell that it is about programming, but no search engine company has the money to pay thousands of people to classify the entire Internet for them. Instead they must reasonably predict what a human may decide to be the key points of a document. And they must automate this.

Remember how proper sentences need to be structured with a subject and a predicate? A subject could be a noun, or a adjective followed by a noun, or a pronoun… A predicate may be or include a verb… We can take a similar approach by defining our key phrases in terms of what types of words (or parts-of-speech) they are, and the pattern in which they occur.

But how do we know what words are nouns or verbs in an automated fashion?

Throughout this post I will use an excerpt from Zen and the Art of Motorcycle Maintenance as an example:

The Buddha, the Godhead, resides quite as comfortably in the circuits of a digital computer or the gears of a cycle transmission as he does at the top of a mountain or in the petals of a flower. To think otherwise is to demean the Buddha…which is to demean oneself.

Before proceeding, make a (mental) note of the key phrases here. What is the document about?

Tokenizing

In a program, text is represented as a string of characters. How can we go about moving one level of abstraction up, to the level of words, or tokens? To tokenize a sentence you may be tempted to use Python’s .split() method, but this means you will need to code additional rules to remove hyphens, newlines and punctuation when appropriate.

Thankfully the Natural Language Toolkit (NLTK) for Python provides a regular expression tokenizer. There is an example of it (including how it fares against Pythons regular expression tokenization method) in Chapter 3 of the NLTK book. It also allows you to have comments:

# Word Tokenization Regex adapted from NLTK book # (?x) sets flag to allow comments in regexps sentence_re = r'''(?x) # abbreviations, e.g. U.S.A. (with optional last period) ([A-Z])(\.[A-Z])+\.? # words with optional internal hyphens | \w+(-\w+)* # currency and percentages, e.g. $12.40, 82% | \$?\d+(\.\d+)?%? # ellipsis | \.\.\. # these are separate tokens | [][.,;"'?():-_`] '''

Once we have constructed our regex for defining what sort of format our words should be in, we call it like so:

import nltk #doc is a string containing our document toks = nltk.regexp_tokenize(doc, sentence_re) >>> toks ['The', 'Buddha', ',', 'the', 'Godhead', ',', 'resides', ...

Tagging

The next step is tagging. This uses statistical data to apply a Part-of-speech tag to each token, e.g. ADJ, NN (Noun), and so on. Since it is statistical, we need to either train our model or use a pre-trained model. NLTK comes with a pretty good one for general use, but if you are looking at a certain kind of document you may want to train your own tagger, since it may greatly affect the accuracy (think about very vocabulary-dense fields such as biology).

Note that to train your own tagger you will need a pre-tagged corpus (NLTK comes with some) or use a bootstrapped method (which can take a long time). Check out Streamhacker and Chapter 5 of the NLTK book for a good discussion on training your own (and how to test it empirically).

For the sake of this introduction, we will use the default one. The result is a list of token-tag pairs:

>>> postoks = nltk.tag.pos_tag(toks) >>> postoks [('The', 'DT'), ('Buddha', 'NNP'), (',', ','), ('the', 'DT'), ...

Chunking

Now we can use the part-of-speech tags to lift out noun phrases (NP) based on patterns of tags.

Note: All diagrams have been stolen from the NLTK book (which is available under the Creative Commons Attribution Noncommercial No Derivative Works 3.0 US License).

This is called chunking. We can define the form of our chunks using a regular expression, and build a chunker from that:

# This grammar is described in the paper by S. N. Kim, # T. Baldwin, and M.-Y. Kan. # Evaluating n-gram based evaluation metrics for automatic # keyphrase extraction. # Technical report, University of Melbourne, Melbourne 2010. grammar = r""" NBAR: # Nouns and Adjectives, terminated with Nouns {<NN.*|JJ>*<NN.*>} NP: {<NBAR>} # Above, connected with in/of/etc... {<NBAR><IN><NBAR>} """ chunker = nltk.RegexpParser(grammar) tree = chunker.parse(postoks)

It is also possible to describe a Context Free Grammar (CFG) to do this, and help deal with ambiguity – information can be found in Chapter 8 of the NLTK book. Chunk regexes can be much more complicated if needed, and support chinking, which allows you to specify patterns in terms what you don’t want – see Chapter 7 of the NLTK book.

The output of chunking is a tree, where the noun phrase nodes are located just one level before the leaves, which are the words that constitute the noun phrase:

To access the leaves, we can use this code:

def leaves(tree): """Finds NP (nounphrase) leaf nodes of a chunk tree.""" for subtree in tree.subtrees(filter = lambda t: t.node=='NP'): yield subtree.leaves()

Walking the tree and Normalisation

We can now walk the tree to get the terms, applying normalisation if we want to:

def get_terms(tree): for leaf in leaves(tree): term = [ normalise(word) for word, tag in leaf if acceptable_word(word) ] yield term

Normalisation may consist of lower-casing words, removing stop-words which appear in many documents (i.e. if, the, a…), stemming (i.e. cars $\rightarrow$ car), and lemmatizing (i.e. drove, drives, rode $\rightarrow$ drive). We normalise so that at later stages we can compare similar key phrases to be the same; 'the man drove the truck' should be comparable to 'The man drives the truck' . This will allow us to better rank our key phrases :)

Functions for normalising and checking for stop-words are described below:

lemmatizer = nltk.WordNetLemmatizer() stemmer = nltk.stem.porter.PorterStemmer() def normalise(word): """Normalises words to lowercase and stems and lemmatizes it.""" word = word.lower() word = stemmer.stem_word(word) word = lemmatizer.lemmatize(word) return word def acceptable_word(word): """Checks conditions for acceptable word: length, stopword.""" from nltk.corpus import stopwords stopwords = stopwords.words('english') accepted = bool(2 <= len(word) <= 40 and word.lower() not in stopwords) return accepted

And the result is:

>>> terms = get_terms(tree) >>> for term in terms: ... for word in term: ... print word, ... print buddha godhead circuit digit comput gear cycl transmiss mountain petal flower buddha demean oneself

Are these similar to the key phrases you chose? There are lots of areas above that can be tweaked. Let me know what you come up with :) (the code can be found in this gist).

In future posts I will talk about how to rank key phrases. I will also discuss how to scale this to process many documents at once using MapReduce.

In the mean time check out the demos on Streamhacker, solve the problems in the NLTK book, or read the NLTK Cookbook :)