A Simple Extractive Summarisation System

This post talks you through the code of a simple extractive summarisation system written in Python. There was a bit of interest in a summariser posted to reddit a few days ago, which inspired me to do some tinkering. I’ve never really engaged with the extensive summarisation literature, and I doubt I’ll be publishing in it anytime soon. I mostly built this for fun, and to teach myself a little about publishing an application with django and Google App Engine.

The system can be found at http://toolongdr.appspot.com (at least while I’m within my free quota). The version at app engine is slightly different from the one described below, as I’ve added a few optimisations to fit within the memory limit and make my CPU time stretch a little further.

You can find a version of the GAE ready code here, under the FreeBSD license.

Overview

We can divide the system into three basic components:

Text pre-processor. This handles things like dividing the document into sentences, dividing the sentences into words, normalising away suffixes, eliminating common words, etc. The sentence selector. This is the engine of the system. We will base our selection criteria on word frequencies. This is a common starting point for many NLP tasks. Sentence sorter. This sorts the selected sentences to produce a pleasing summary. We will simply order the sentences by their position in the document.

The Summariser Class

class Summariser(object): def __call__(self, doc, n): """ Return the N most pertinent sentences of a document, in the order in which they occur """ # A 'bag-of-words' is a sparse vector # of token frequencies. docBow = doc.bagOfWords # Get N most similar. topN = sorted(doc, key=lambda s: self.similarity(docBow, s.bagOfWords))[-n:] # Sort back into document order. return sorted(topN, key=lambda s: s.position) def similarity(self, docBow, sentBow): # Sum the frequencies of each token in the sentence # in the rest of the document return sum((docBow[t]-f)*f for t, f in sentBow.items())

The Summariser.__call__ function takes the processed document and performs two sorts, trimming the document down to N sentences after the first sort by similarity. The similarity score for each sentence is defined as the sum of the frequency of its words in other sentences of the document. We exclude the sentence itself, so that we avoid selecting long sentences that contain terms not in the rest of the document. The idea here is to find sentences that are prototypical of the document — that is, the most like the rest of the document, in terms of their word frequency distribution.

Pre-processing with nltk

The summariser expects the document to have the text divided into sentences, the sentences divided into words, and word frequencies counted within both. To do this, we call some functions from nltk The pre-processing stage takes the document from a raw string to a list of sentences, each of which contains a list of tokens. We will also index the tokens in the document into a “bag-of-words”: a sparse vector that stores each word’s frequency. This is implemented as a defaultdictionary. The Document and Sentence classes look like this:

from collections import defaultdict from nltk.tokenize import sent_tokenize as sent_tokenise from nltk.tokenize import word_tokenize as word_tokenise class Document(list): def __init__(self, **kwargs): """ Build a list of sentences and a bag of words """ list.__init__(self) if 'text' in kwargs: text = kwargs.pop('text') elif 'url' in kwargs: text = self._urlToText(kwargs.pop('url')) else: raise StandardError, "Document requires text or url" assert not kwargs bow = defaultdict(int) for i, sentenceStr in enumerate(sent_tokenise(text)): sentence = Sentence(sentenceStr, i) self.append(sentence) for k, v in sentence.bagOfWords.items(): bow[k] += v self.bagOfWords = bow class Sentence(list): def __init__(self, sentenceStr, position): self.string = cgi.escape(sentenceStr) # Lowercase the first word if sentenceStr[0].isupper(): letters = list(sentenceStr) letters[0] = letters[0].lower() sentenceStr = ''.join(letters) tokens = word_tokenise(sentenceStr) bow = defaultdict(int) for token in tokens: # Normalise for morphology and case, # and exclude common words term = standardise(token) if term: bow[term] += 1 self.append(term) self.bagOfWords = bow self.position = position

All the real work here is done by the nltk Punkt sentence boundary detector, and the tokeniser. These tasks aren’t that interesting, but they’re surprisingly hard to get just right. Sentence boundary detection is a good example of this: good solutions are much more complicated than you might think. I would advise against rolling your own sentence boundary detector with a regular expression, for the same reason regular expressions are a poor choice for parsing html: it’s an error-prone strategy that’s fundamentally not powerful enough for the task. Use the good solutions that work right out of the box. Speaking of html parsing, the process that extracts text from web pages is, in my opinion, the weakest part of the system. I’d really appreciate feedback on a better way to do this. The function looks like this:

whiteRE = re.compile(r'\s+') constraintsRE = re.compile(u'^[^\u2022]|[.!?]$') def _urlToText(self, url): """ (Terrible) text extraction using that ugly swamp BeautifulSoup """ page = urllib2.urlopen(url) soup = BeautifulSoup(page) body = soup.findAll('body') body = body[0] if body else soup parElements = [p for p in body('p') if not p.attrs] brPars = [p for p in body(True) \ if p('br', recursive=False)] for node in brPars: for par in node.fetchText(True, recursive=False): parElements.append(par) paragraphs = [] for paragraph in parElements: if isinstance(paragraph, unicode): text = paragraph else: text = ' '.join(paragraph.fetchText(True)) text = self.whiteRE.sub(' ', text.strip()) if self.constraintsRE.search(text): paragraphs.append(text) title = soup.find('title').string return title, '

'.join(paragraphs)

This method looks for text under p elements, and text that is the sibling of br elements. A few ad hoc constraints were added from manual testing. This is the part of the program I’m least happy with, and I’d love to hear about a more principled solution to finding the body text of an arbitrary page.

I’m also quite unhappy with the implementation. Initially I wrote this with pyquery instead of BeautifulSoup, but Google App Engine does not support lxml. Boo and hiss by starring this issue. I really don’t like BeautifulSoup. It’s slow, and I find the maze of aliases in the interface incredibly annoying.

Addressing Zipf’s Law

The pre-processing and summarisation code presented so far is enough to do the job. However, we can improve the accuracy of the summariser by writing in just a touch more linguistic sensitivity. Most words in a language are rare, which can be problematic when your application works by counting word frequencies. To be more precise, word frequencies follow a power law distribution, an observation known as Zipf’s law. This means that our frequency distributions will generally be quite sparse, making lexical frequencies less meaningful. There are a few standard tactics we can deploy to address this.

Standardising Word Forms

The first set of tweaks are stemming, stopping and case normalisation — processes I collectively refer to as “standardisation”. Stemming is the removal of inflexional and derivational suffixes. For instance, both archeologist and archeology would get stemmed to archeolog. This might look weird, but it means the terms will match, allowing us to count them together as one term. Stemming loses some precision, but on the whole it’s often beneficial when you’re trying to model the document’s topic. I used the nltk implementation of Porter‘s algorithm. The (ugly) source for this can be viewed here. Stopping addresses the other side of the problem caused by the unbalanced frequency distribution of words. Grammatical terms, such as “the” and “of”, are far more frequent than the interesting content words. It is therefore useful to discard them from our counts. I do this with the stop list provided with nltk. Finally, another possibility is to normalise away case variation. I’ve chosen not to do this, because I don’t want to confuse proper nouns with common nouns. Instead I just lower case the first word of the sentence. This will be wrong for sentences that start with proper nouns, but I didn’t want to use a more complicated process, such as part-of-speech tagging. I implement the standardise processes as a private class inside a module, with only an instance made public.. This centralises them, so that I can be sure all tokens are being standardised in the same way.

import _stops import nltk.stem.porter as _porter class _Standardise(object): def __init__(self): self.stemmer = lambda word: word self.stopWords = {} self.lower = False def __call__(self, string): if string.lower() in self.stopWords: return None if self.lower: string = string.lower() stemmed = self.stemmer(string) return stemmed def config(self, **kwargs): stemmer = kwargs.pop('stemming') if stemmer == 'porter': self.stemmer = _porter.PorterStemmer().stem_word else: raise StandardError, "Unknown stemmer", stemmer if kwargs.pop('stopping'): self.stopWords = _stops.nltkStops self.lower = bool(kwargs.pop('lower')) standardise = _Standardise()

TF-IDF Weighting

There’s actually a way that we can use the imbalanced frequency distribution of word occurrences to our advantage, to discover what a document is really about more reliably. Imagine we have a document where two words occur 5 times each: professional and archeologist. It seems reasonable to assume that archeologist is a much stronger clue about the document’s topic, and we should weight sentences that contain this word higher so that hopefully they make it into the summary.

The reason archeologist is a better indication is that we are more surprised to see it. What we want, then, are statistics about how often a word appears in the language as a whole. Word which are rare overall will be weighted highly; words which are common will receive a low weight. This is called TF-IDF weighting, for term frequency*inverse document frequency. The weight is calculated by taking the log of the inverse percentage of documents in a sample that the word has occurred in. I used a sample of 800,000 pages from the English Wikipedia to determine the document frequencies. The log is taken to avoid having the IDF part dominate the term frequency part. Consider that we would be weighting single occurrence words by 800,000 if we used a linear weight.

Apart from some boring code to read in the document frequencies and turn them into IDF weights, the TF-IDF weighted summariser simply overrides the similarity method of the Summariser:

class TFIDFSummariser(Summariser): """ Scores sentences by TF-IDF weighted token frequencies """ def __init__(self): Summariser.__init__(self) # The number of documents the frequencies were drawn from. n = 818741.0 self.idfs = self._loadIDFs(n) def _loadIDFs(self, n): dfLoc = localPath('wiki_doc_freqs.txt') dfs = collections.defaultdict(int) # Convenience for codecs.open. lines = utf8open(dfLoc).read().strip().split('

') # Read in the document freqs. # Have to do this first because we collapse some freqs # through standardisation. for line in lines: token, freq = line.split('\t') token = standardise(token) if token: dfs[token] += int(freq) # Turn the frequencies into IDF weights. idfs = collections.defaultdict(float) for token, freq in dfs.items(): idf = log(n/freq, 10) idfs[token] = idf return idfs def similarity(self, docBow, sentBow): idfs = self.idfs # Apply the IDF weight to each term. return sum([f*(docBow[t]-f)*idfs[t] for t, f in sentBow.items()])

This is a little less efficient than applying the IDF weights to the document vector directly, since we may have to weigh the same term multiple times for different sentences. However, I doubt this will make much difference to overall running time, and it allows a design I think is a little nicer, as the variation is confined to the similarity method.

That’s it for now…

Well, that’s the simple extractive summariser, and a tweaked one with TF-IDF. If there’s sufficient interest, I’ll follow this up with a little evaluation experiment.

Summarisation is difficult to evaluate, because quality is subjective, but I have a plan for what seems to me to be a reasonable approximation.