DISCLAIMER: I personally do not have an opinion on the classification of English and that is not what this article is about, I am not a linguist. This article only researches the statistics behind English vocabulary as there is currently no such data available.

The English language and its origins have been a topic for fierce debate among many linguists. English is classified as a (West) Germanic language, meaning that it is closely related to other Germanic languages such as Swedish, Dutch and German. The other dominant language family in Western Europe is the group of Romance languages: French, Italian, Spanish… all languages that have sprouted from Latin somewhere throughout history.

Unlike other Germanic languages, English shares a large portion of their vocabulary with French and Latin, often attributed to the period of Norman French dominance in England after 1066. The size of this Romance influence on English, along with some other technical aspects such as pronunciation and syntax, has led some radical linguists to believe that English should in fact not be seen as a Germanic language, but rather as a Romance-Germanic hybrid. However, the general consensus is that the overall English vocabulary is only a third of Old English origin (so, Germanic) but that the core vocabulary is entirely Old English. The keyword here is core, as most linguists claim that French and Latin influence only enters the language through a handful of basic words but a vast majority of academic terms. For many, this seems to be the most important criterion for its classification as a Germanic language.

I personally don’t care much about these classifications, but I was very surprised to discover that in fact no-one recently has actually bothered to research the origins of English vocabulary, let alone the core. The latest research was done in 1975 by Joseph M. Williams, where he examined the 10,000 most frequently used words in English, based on a rather small sample size of corporate letters. Here are my issues with his research:

the research carries a bias towards French and Latin, as companies are more likely to use academic language

proper names were not removed, possibly diluting the results for an etymological composition

he used the 10 000 most common words in that corpus of letters, not really “core vocabulary”

And core vocabulary is precisely what this whole debate is all about, so I decided to do my own little research using Python to see how I could provide some statistics behind these claims.

Joseph M. Williams’ research visualized (top 10,000 words). Image by by Murraytheb via Wikipedia Commons

The quest for etymology

Gathering the data

The Oxford Dictionary claims that there are roughly 250 000 distinct words in English vocabulary. But what share represents the core vocabulary? What does that even mean? The Oxford Dictionary uses the following table with some insight on the relation of the most common words in English to the appearance of words in English sources:

Using a dataset of 5,000 words would make up roughly 85% of all used words in the Oxford English Corpus. Image via Oxford Dictionaries

This table shows us a rather large problem: the actual occurrence of words in applied English does not reflect the (core) vocabulary or even the language as a whole. 50% of any given text in English will use the exact same linkers/pronouns, even though those 100 words only reflect 0.04% of distinct English vocabulary. A word such as “the” alone makes up 6% of any given source in English. This disproportionate use of extremely basic structural words deceives the reader into thinking that English vocabulary is of an entirely different etymological composition. This is why to determine the composition of a language, or in my case just the core vocabulary, we have to use frequency lists, and not simply count words in sources. A frequency list, a list with the most commonly used words in a language, lets us accurately determine the core vocabulary of a language.

Now that I’ve established that I need to find a frequency list, I need to know how large that set of words has to be in order to have the core vocabulary. The definition of core vocabulary is not strictly defined, but the numbers seem to vary from roughly 3,000 to the 5,000 most common words in English, as used by various dictionaries, English learning sites and literature. Based on those estimates, I decided to use a sample size with the 5,000 most frequently used words in English, representing the top 2% of distinct vocabulary and making up 85% of all words in any English source.

Acquiring the actual 5,000 words turned out to be a lot harder than initially anticipated; there are almost no clean, unbiased (and uh, free) datasets. On one hand you have large datasets of all subtitles ever uploaded on OpenSubtitles.org, and on the other you have sources such as Project Gutenberg. Unfortunately is either the source severely outdated such as Project Gutenberg or is it severely biased, such as the subtitles; when words such as “kill” and “f*ck” are among the top entries, you know that you are not handling reliable data.

I ended up going with the frequency list made available by Wordfrequency, a dataset that they believe to be the “most accurate frequency data of English”. It is based on the Corpus of Contemporary American English, containing 450 million words sourced from colloquial speech to literature to magazines.

Processing the data

Next, I had to find a way to automate the process of retrieving the original language (or in other words, the etymology) of each word in my dataset. For this, I had to find reliable online dictionaries. The most obvious source for etymological data is Etymonline, an extremely accurate dictionary that was compiled and verified manually, and therefore used for 80% of this dataset. The interface looks like this:

Results for the query “origin” on etymonline.com

Because the language of origin is mentioned right away, I scrape the first 15 words of each entry using Python’s urllib and BeautifulSoup. The language that is mentioned first is then taken as the language of origin. If there is no language present, it tries the next method.

Note that sometimes a word of Latin origin will return “French” using my method. This is because Etymonline always mentions French before Latin if the word entered English through French and the word changed sufficiently from the root. A word such as “origin” (from “origo”) will therefore return French, whereas a word such as “provide” (from “providere — provideo”) will return Latin.

The second source is Memidex, an index of online dictionaries. Memidex returns the many possible origins of each word based on its indexed dictionaries.

Results for the query “origin” on memidex.com. The amount of possible dictionaries returning results can be up to ten.

I simply scrape the first 8 words after every mention of “Origin:” on the webpage and then return the language of origin that is mentioned the most. During every step of the process, the percentage of each language that makes up English is then logged into a list for later visualization and interpretation.

Visualizing the data

The following area graph contains the results of my research: