A quick analysis on how to generate stopword lists that are good enough for natural language processing. Basically, how many documents do you need to analyse before it’s good enough for i.e a search engine, and what manual steps do you still need to do?

Or, in other words: How many documents do you need to crawl from a Wikipedia site, before it’s enough for a good stopword analysis?

Stopwords are words that bears little or no meaning and for this reason you want to remove them before you use the text in some form of natural language processing like search.

I’ve created a small program for generating lists of stopwords based on a document corpus. It’s called stopword-trainer and it calculates “stopwordieness” based on how many times a word is used in total combined with how many documents the word is found in.

stopWordiness = (termInCorpus / totDocs) * (1 / (Math.log(totDocs/(termInDocs - 1))))

The test

For this test I’ve crawled 108.501 documents from the main Norwegian Wikipedia-site and made five different sized batches of the document corpus:

100 documents

1000 documents

10000 documents

40000 documents

108501 documents

I’ve only extracted paragraphs of text, excluding titles, sub-titles, lists and links. The idea is that the paragraphs are the chunks of text that has the lowest ratio of words that bears any meaning.

Peculiarities of Wikipedia to look out for

The batches with 100 and 1000 documents generated stopword lists with a lot of noice. They were kind of weak. Nothing trange there.

What was more surprising was that analysing the 10000 batch gave me the word “hovedbelteasteroide”, or “main-belt asteroid” in english, as the word with the 26th highest stopworthieness. The word is found 1802 times within the first 10000 documents of the 108501 documents in the total corpus. The word is only found 7 more times in the batch of 40000 documents and again 4 more times in full batch of 108501 documents.

Wikipedia is an encyclopedia and it has a certain structure to it that we need to understand. Since I’m crawling documents sorted alphabetically on title, I’ll get all the special characters first, then all the numbers and then titles starting with a letter.