English Letter Frequency Counts:

Mayzner Revisited

or

ETAOIN SRHLDCU

Introduction

I culled a corpus of 20,000 words from a variety of sources, e.g., newspapers, magazines, books, etc. For each source selected, a starting place was chosen at random. In proceeding forward from this point, all three, four, five, six, and seven-letter words were recorded until a total of 200 words had been selected. This procedure was duplicated 100 times, each time with a different source, thus yielding a grand total of 20,000 words. This sample broke down as follows: three-letter words, 6,807 tokens, 187 types; four-letter words, 5,456 tokens, 641 types; five-letter words, 3,422 tokens, 856 types; six-letter words, 2,264 tokens, 868 types; seven-letter words, 2,051 tokens, 924 types. I then proceeded to construct tables that showed the frequency counts for three, four, five, six, and seven-letter words, but most importantly, broken down by word length and letter position, which had never been done before to my knowledge.

perhaps your group at Google might be interested in using the computing power that is now available to significantly expand and produce such tables as I constructed some 50 years ago, but now using the Google Corpus Data, not the tiny 20,000 word sample that I used.

On December 17th 2012, I got a nice letter from Mark Mayzner , a retired 85-year-old researcher who studied the frequency of letter combinations in English words in the early 1960s. His 1965 publication has been cited in hundreds of articles. Mayzner describes his work:and he wonders if:The answer is: yes indeed, I am interested! And it will be a lot easier for me than it was for Mayzner. Working 60s-style, Mayzner had to gather his collection of text sources, then go through them and select individual words, punch them on Hollerith cards , and use a card-sorting machine

Here's what we can do with today's computing power (using publicly available data and the processing power of my own personal computer; I'm not relying on access to corporate computing power):

I consulted the Google books Ngrams raw data set, which gives word counts of the number of times each word is mentioned (broken down by year of publication) in the books that have been scanned by Google. I downloaded the English Version 20120701 "1-grams" (that is, word counts) from that data set given as the files "a" to "z" (that is, http://storage.googleapis.com/books/ngrams/books/googlebooks-eng-all-1gram-20120701-a.gz to http://storage.googleapis.com/books/ngrams/books/googlebooks-eng-all-1gram-20120701-z.gz). I unzipped each file; the result is 23 GB of text (so don't try to download them on your phone). I then condensed these entries, combining the counts for all years, and for different capitalizations: "word", "Word" and "WORD" were all recorded under "WORD". I discarded any entry that used a character other than the 26 letters A-Z. I also discarded any word with fewer than 100,000 mentions. (If you want you can download the word count file; note that it is 1.5 MB.) I generated tables of counts, first for words, then for letters and letter sequences, keyed off of the positions and word lengths.

Word Counts

WORD COUNT PERCENT bar graph the 53.10 B 7.14% the of 30.97 B 4.16% of and 22.63 B 3.04% and to 19.35 B 2.60% to in 16.89 B 2.27% in a 15.31 B 2.06% a is 8.38 B 1.13% is that 8.00 B 1.08% that for 6.55 B 0.88% for it 5.74 B 0.77% it as 5.70 B 0.77% as was 5.50 B 0.74% was with 5.18 B 0.70% with be 4.82 B 0.65% be by 4.70 B 0.63% by on 4.59 B 0.62% on not 4.52 B 0.61% not he 4.11 B 0.55% he i 3.88 B 0.52% i this 3.83 B 0.51% this are 3.70 B 0.50% are or 3.67 B 0.49% or his 3.61 B 0.49% his from 3.47 B 0.47% from at 3.41 B 0.46% at which 3.14 B 0.42% which but 2.79 B 0.38% but have 2.78 B 0.37% have an 2.73 B 0.37% an had 2.62 B 0.35% had they 2.46 B 0.33% they you 2.34 B 0.31% you were 2.27 B 0.31% were their 2.15 B 0.29% their one 2.15 B 0.29% one all 2.06 B 0.28% all we 2.06 B 0.28% we can 1.67 B 0.22% can her 1.63 B 0.22% her has 1.63 B 0.22% has there 1.62 B 0.22% there been 1.62 B 0.22% been if 1.56 B 0.21% if more 1.55 B 0.21% more when 1.52 B 0.20% when will 1.49 B 0.20% will would 1.47 B 0.20% would who 1.46 B 0.20% who so 1.45 B 0.19% so no 1.40 B 0.19% no

My distillation of the Google books data gives us 97,565 distinct words, which were mentioned 743,842,922,321 times (37 million times more than in Mayzner's 20,000-mention collection). Each distinct word is called a "type" and each mention is called a "token." To no surprise, the most common word is "the". Here are the top 50 words, with their counts (in billions of mentions) and their overall percentage (looking like a Zipf distribution):

Word Lengths

LEN COUNT PERCENT bar graph 1 22301.22 M 2.998% 1 2 131293.85 M 17.651% 2 3 152568.38 M 20.511% 3 4 109988.33 M 14.787% 4 5 79589.32 M 10.700% 5 6 62391.21 M 8.388% 6 7 59052.66 M 7.939% 7 8 44207.29 M 5.943% 8 9 33006.93 M 4.437% 9 10 22883.84 M 3.076% 10 11 13098.06 M 1.761% 11 12 7124.15 M 0.958% 12 13 3850.58 M 0.518% 13 14 1653.08 M 0.222% 14 15 565.24 M 0.076% 15 16 151.22 M 0.020% 16 17 72.81 M 0.010% 17 18 28.62 M 0.004% 18 19 8.51 M 0.001% 19 20 6.35 M 0.001% 20 21 0.13 M 0.000% 21 22 0.81 M 0.000% 22 23 0.32 M 0.000% 23

LEN COUNT PERCENT bar graph 1 26 0.027% 1 2 662 0.679% 2 3 4,615 4.730% 3 4 6,977 7.151% 4 5 10,541 10.804% 5 6 13,341 13.674% 6 7 14,392 14.751% 7 8 13,284 13.616% 8 9 11,079 11.356% 9 10 8,468 8.679% 10 11 5,769 5.913% 11 12 3,700 3.792% 12 13 2,272 2.329% 13 14 1,202 1.232% 14 15 668 0.685% 15 16 283 0.290% 16 17 158 0.162% 17 18 64 0.066% 18 19 40 0.041% 19 20 16 0.016% 20 21 1 0.001% 21 22 5 0.005% 22 23 2 0.002% 23

electroencephalographic radiopharmaceuticals polytetrafluoroethylene electroencephalogram forschungsgemeinschaft keratoconjunctivitis deinstitutionalization counterrevolutionary counterrevolutionaries immunohistochemistry dehydroepiandrosterone internationalisation electroencephalography hypercholesterolemia immunoelectrophoresis phosphatidylinositol institutionalisation compartmentalization acetylcholinesterase electrophysiological internationalization electrocardiographic institutionalization uncharacteristically

Letter Counts

LET COUNT PERCENT bar graph E 445.2 B 12.49% E T 330.5 B 9.28% T A 286.5 B 8.04% A O 272.3 B 7.64% O I 269.7 B 7.57% I N 257.8 B 7.23% N S 232.1 B 6.51% S R 223.8 B 6.28% R H 180.1 B 5.05% H L 145.0 B 4.07% L D 136.0 B 3.82% D C 119.2 B 3.34% C U 97.3 B 2.73% U M 89.5 B 2.51% M F 85.6 B 2.40% F P 76.1 B 2.14% P G 66.6 B 1.87% G W 59.7 B 1.68% W Y 59.3 B 1.66% Y B 52.9 B 1.48% B V 37.5 B 1.05% V K 19.3 B 0.54% K X 8.4 B 0.23% X J 5.7 B 0.16% J Q 4.3 B 0.12% Q Z 3.2 B 0.09% Z

And here is the breakdown of mentions (in millions) by word length (looking like a Poisson distribution). The average is 4.79 letters per word, and 80% are between 2 and 7 letters long:Here is the distribution for distinct words (that is, counting each word only once regardless of how many times it is mentioned). Now the average is 7.60 letters long, and 80% are between 4 and 10 letters long:Here are the 24 words with length of 20 or more (that are mentioned at least 100,000 times each in the book corpus):Enough of words; let's get back to Mayzner's request and look at letter counts. There were 3,563,505,777,820 letters mentioned. Here they are in frequency order:Note there is a standard order of frequency used by typesetters, ETAOIN SHRDLU, that is slightly violated here: L, R, and C have all moved up one rank, giving us the less mnemonic ETAOIN SRHLDCU.

In the colored-bar chart below (inspired by the Wikipedia article on Letter Frequency), the frequency of each letter is proportional to the length of the color bar. If you hover the mouse over each color bar, you can see the exact percentages and counts. (This is the same information as in the table above, presented in a different way.)