In NLTK, chunking is the process of extracting short, well-formed phrases, or chunks, from a sentence. This is also known as partial parsing, since a chunker is not required to capture all the words in a sentence, and does not produce a deep parse tree. But this is a good thing because it’s very hard to create a complete parse grammar for natural language, and full parsing is usually all or nothing. So chunking allows you to get at the bits you want and ignore the rest.

Training

The general approach to chunking and parsing is to define rules or expressions that are then matched against the input sentence. But this is a very manual, tedious, and error-prone process, likely to get very complicated real fast. The alternative approach is to train a chunker the same way you train a part-of-speech tagger. Except in this case, instead of training on (word, tag) sequences, we train on (tag, iob) sequences, where iob is a chunk tag defined in the the conll2000 corpus. Here’s a function that will take a list of chunked sentences (from a chunked corpus like conll2000 or treebank), and return a list of (tag, iob) sequences.

import nltk.chunk def conll_tag_chunks(chunk_sents): tag_sents = [nltk.chunk.tree2conlltags(tree) for tree in chunk_sents] return [[(t, c) for (w, t, c) in chunk_tags] for chunk_tags in tag_sents]

Accuracy

So how accurate is the trained chunker? Here’s the rest of the code, followed by a chart of the accuracy results. Note that I’m only using Ngram Taggers. You could additionally use the BrillTagger, but the training takes a ridiculously long time for very minimal gains in accuracy.

import nltk.corpus, nltk.tag def ubt_conll_chunk_accuracy(train_sents, test_sents): train_chunks = conll_tag_chunks(train_sents) test_chunks = conll_tag_chunks(test_sents) u_chunker = nltk.tag.UnigramTagger(train_chunks) print 'u:', nltk.tag.accuracy(u_chunker, test_chunks) ub_chunker = nltk.tag.BigramTagger(train_chunks, backoff=u_chunker) print 'ub:', nltk.tag.accuracy(ub_chunker, test_chunks) ubt_chunker = nltk.tag.TrigramTagger(train_chunks, backoff=ub_chunker) print 'ubt:', nltk.tag.accuracy(ubt_chunker, test_chunks) ut_chunker = nltk.tag.TrigramTagger(train_chunks, backoff=u_chunker) print 'ut:', nltk.tag.accuracy(ut_chunker, test_chunks) utb_chunker = nltk.tag.BigramTagger(train_chunks, backoff=ut_chunker) print 'utb:', nltk.tag.accuracy(utb_chunker, test_chunks) # conll chunking accuracy test conll_train = nltk.corpus.conll2000.chunked_sents('train.txt') conll_test = nltk.corpus.conll2000.chunked_sents('test.txt') ubt_conll_chunk_accuracy(conll_train, conll_test) # treebank chunking accuracy test treebank_sents = nltk.corpus.treebank_chunk.chunked_sents() ubt_conll_chunk_accuracy(treebank_sents[:2000], treebank_sents[2000:])

The ub_chunker and utb_chunker are slight favorites with equal accuracy, so in practice I suggest using the ub_chunker since it takes slightly less time to train.

Conclusion

Training a chunker this way is much easier than creating manual chunk expressions or rules, it can approach 100% accuracy, and the process is re-usable across data sets. As with part-of-speech tagging, the training set really matters, and should be as similar as possible to the actual text that you want to tag and chunk.