I have been reading a book on the development of the English language recently and I’ve become fascinated with the idea of word etymology — the study of words and their origins. It’s no secret that English is a great borrower of foreign words but I’m not enough of an expert to really understand what that means for my day-to-day use of the language. Simply reading about word history didn’t help me, so I decided that I really needed to see some examples.

Using Douglas Harper’s online dictionary of etymology, I paired up words from various passages I found online with entries in the dictionary. For each word, I pulled out the first listed language of origin and then re-constructed the text with some additional HTML infrastructure. The HTML would allow me to associate each word (or word fragment) with a color, title, and hyperlink to a definition.

The results look like this:

This simple sentence is constructed of eight distinct words and one word suffix. Six of the words are from Old English (colored in pink) while the others are from Gallo-Roman and Middle Low German (both colored in gray). Hovering over each word provides the exact source and clicking the word takes you to the full origin description.

A second example shows more variety:

This is a surprisingly complex Monty Python quote where the colors represent Old English (pink), Middle English (red), Anglo-French (orange), Old French (light orange), Middle French (pale orange), and Classical and Medieval Latin (both yellow). I suspect that both the complexity and variety of word sources is intentional — standing in humorous contrast to the appearance of the speaker.

What follows are five excerpts taken from a spectrum of written sources. The intent was to investigate each passage and see if word origin varied significantly based on the intended purpose of the passage.

(This process was pretty involved and my initial dream of creating an app that would allow me to convert any paragraph to this format faded when I realized that much of the word matching process needed manual intervention. I definitely suggest digging in to the full etymology site to explore the full history of each word. I have probably made plenty of translation mistakes as I developed my paragraphs but I certainly had fun.)

Passage #1: American Literature

The first paragraph I looked at was an excerpt from Mark Twain’s The Adventures of Tom Sawyer. I chose this text because I thought it would have a good mix of English and American words.

The passage has a solid base of Old English words mixed with a variety of French, Latin and Old Norse terms. Middle English makes an appearance in the form of a few words and suffixes while American English is found solely in the list of items Tom Sawyer collects from his friends. Two of these American terms (“fire-crackers” and “door-knob”) are hyphenated words built from Old English and Scandinavian components. (Several of Twain’s other hyphenated words apparently didn’t make it over the hump into full-fledged Americanisms. However, it should be noted that Twain was often the first author to record usage of U.S. slang of the era.)

I found it interesting that Middle English had such a poor showing in this text but it may be due to the fact that the defining elements of Middle English have more to do with sentence structure and grammatical elements than specific words. I was also surprised at the frequent use of longer, Latin-based words in an adventure novel, but the average word length comes in at about 4.4 characters — still fairly short and simple.

Although 73% of the word fragments are Old English, Twain uses words from over a dozen different sources in this short passage alone. Overall, the wide variety of word sources adds a pleasing “flavor” to the passage. The mix seems well-balanced and interesting.

Passage #2: British Literature

For my second test, I wanted to look at text from a non-American author. I chose a paragraph from Charles Dickens’ A Tale of Two Cities Great Expectations out of respect for my 7th-grade English teacher.

The relative simplicity of this passage surprised me a little. The average word length is about 4.2 and over 84% of the word fragments are basic Old English. No other source comes in over 5% and the variety of sources is half that of the Twain passage. American English Hebrew makes an appearance in the form of the name “Joe” but most of the other borrowed words are French in origin. Still, I found the text appealing in a way — basic words for a basic task.

Passage #3: Legal

The third paragraph comes from a United Nations document on maritime territories. I selected this passage because it seemed to contain more jargon and I suspected that much of this jargon was borrowed. This hunch proved to be correct.

This text had a much higher ratio of French and Latin word fragments (16.9% and 9.3%) and a longer average word length — nearly 4.8 characters — than both previous passages. With 64.4% of the word fragments, Old English still serves as a major binding agent in this text but there is less variety overall. Middle English makes its appearance only as a suffix and there is only one word outside of the English/French/Latin triumvirate. After the visual and poetic excitement of the two literature entries, this paragraph seems very bland.

Passage #4: Medicine

Note: This passage has been revised (see thread)

My dad suggested that I take a look at a healthcare-related passage to see if the use of specific medical terminology would tilt the word usage even farther away from “native” English words. Boy, was he right.

The medical paragraph has only 51.9% Old English word fragments and the average number of characters per word is 5.7 — much higher than even the legal text. French Latin, and Greek were used more frequently in this passage and, despite U.S. prowess in the healthcare field, there were no American English terms. This is a paragraph that is doing a lot of heavy lifting and it uses a lot of dense, muscular words to get the job done.

Passage #5: Sports

This last passage was an attempt to stack the deck in favor of some home grown words. It doesn’t get more American than baseball, but the only American word in this article about a spring training rainout between the Milwaukee Brewers and the Texas Rangers is the word “baseball” itself. Everything else is either Old English or borrowed. Still, I have to assume that phrases like “at-bats” and “suicide squeeze bunt” are not exactly common constructions and my guess is that the entire article would be a mystery to someone who didn’t know the game.

First of all, I absolutely LOVE the fact that Caleb Gindl uses two Old Norse words to describe the weather conditions during the game. It provides a certain primal, unhinged quality to the situation and adds a third element — nature — to the contest. I also like the use of the onomatopoeic terms “pop” and “crash” because they serve to underscore the action.

The passage itself is a little lighter on the French and Latin roots than some of the earlier paragraphs and many of the terms are fairly short — the average word length comes in at about 4.6 characters. Some of this may be due to the fact that it is an online article (and attention spans are short) but it may also related be to the simple concepts at the core of the game itself. Words like “bat” and “ball” are very similar to their proto Indo-European roots (*bhat- and *bhel- respectively), suggesting that any associated activities are pretty basic to the language. Also, the sheer number and variety of numeric references (e.g. “three”, “third”, and “triple”) bring in many simple terms.

Updates: