Do the English Use More Words than the French? An interactive exploration of the vocabulary size of some classic English and French writers

As you’ve probably guessed, this isn’t an exact science: I only take a few works of each author into account, I ignore homonyms and parts of speech, and I discard every word not in my dictionaries. All of these I do either because it’s easier or more effective than the alternative, and all of these no doubt add to the inaccuracies of my results. But while the vocabulary sizes I report might be inaccurate, I take care to make sure they’re comparable. Why I think so, together with my methods, I’ll explain below—but first, the results.

What I’m interested in is the vocabulary of some of my favorite writers. Who has the largest? Who has the most esoteric? And do the English use more words than the French?

The English language probably has more words than French. I say “probably” because no one can agree on what counts as a word and what doesn’t. Is “all-inclusive” one word or two? What about “govern,” “government,” and “misgovern?” There’s been some serious fighting over this in linguistic circles, but examining the largest dictionaries in either language: The Second Edition of the Oxford English Dictionary has 218,623 words and 615,100 definitions, while Le Grand Robert de la langue française has about 100,000 words and 350,000 definitions. Of course, this says nothing about usage.

The vertical axis represents the vocabulary size while the size of circle the word count from which the vocabulary is sampled. English works are in red, French in blue.

James Joyce is in first place, which isn’t much of a surprise. The puns, parodies, allusions, stream-of-consciousness, and prose experiments of Ulysses are only part of why it took me the better part of a year to finish it. (By this measure, I also would expect David Foster Wallace’s Infinite Jest to top this list).

What’s more surprising is to see Flaubert so close behind. Unlike Ulysses, Flaubert’s works aren’t especially challenging; Madame Bovary, for example, is required reading in many high schools in France. Flaubert is reputed to have slowly worked out his sentences, looking for “le mot juste” that perfectly described what he wanted to say. I think his books are near the top not because he used obscure words, but because he used a wide variety of common ones—a sort of breadth over depth of vocabulary.

Rounding off the top three is Victor Hugo. It’s amazing to see someone for whom writing came quickly (reportedly a rate of 100 lines of verse or 20 pages of prose each morning) nevertheless have such a wide grasp of the language. His vocabulary is considerable, and can give even experienced French readers trouble.

I expected to see Trollope and Austen in the lower ranks; they’re superb stylists who rely on common words and an unadorned style to tell their stories. But I did expected some other English authors, particularly Dickens, Eliot, and Milton, to place higher than they did.

And Molière and Shakespeare too. Both are often called their nation’s finest, and both are near the bottom of the list. Of course, vocabulary size is no indication of quality, but it’s still interesting to see, especially in the case of Shakespeare, how reports of an exceptional vocabulary might be more myth than fact. I don’t examine the invention of words, but it’s reported that Shakespeare coined almost one-tenth of the words he used, so I’d be curious to know if those figures hold up as well.

Overall, there’s a lot of blue at the top. I had expected the English to be well ahead, but in usage, I’d have to give the advantage to the French. It’s remarkable just how close the results are; authors and books in both languages are mixed along the scale. Certainly my selection plays a big role here, but not enough to keep me from believing that vocabulary usage, at least among the classic writers, is about the same for both languages.

And what about the words themselves? Below I’ve mapped out each author’s and book’s “favorite” words:

Favorite Words

In the word-clouds above, the size of each word represents its frequency in the text relative to other texts of the same language. So in Flaubert’s Madame Bovary, for example, “pharmacienne” appears an unusual number of times when compared to the other French texts. Many of the words that appear describe the work to some extent; they can be thought of as the “keywords” for their particular text.

It’s interesting to see just how well unusual word frequency can describe some novels: “misunderstanding” and “inconsideration” are central to Austen’s Emma, as are “self-importance” and “self-sufficiency” to Pride and Prejudice. Dickens’ Bleak House, however, is less clearly represented by its word frequencies. Joyce’s novels aren’t at all. In Shakespeare’s King Lear, words like “bastard,” “bastardize,” “flatterer,” and “duteous” circle around the main themes of the play, as do “admonishment,” “hierarchy,” and the like in Milton’s Paradise Lost.

I wonder how recent literature would look though this lens? I imagine the advice given by many writing coaches—“don’t tell us, show us”—veils the subject of a work with objects and symbols. Naming something directly, as Austen and George Eliot do, is less popular today than it was in their time. Authors choose words based on what their readers expect, and so vocabulary isn’t just about style and the words the author knows, but also about the tradition into which the text is written. It’s this that the word-clouds illustrate best: that the vocabulary in a text might not measure an author’s lexicon, but only how many words he needed to tell his story.

Vocabulary over Time