A whale of words

American linguist George Zipf popularized the concept that the frequency of any word in a language is inversely proportional to its rank in the frequency table.

Moby Dick or the absence of ladies and new things.

The same fascinating relation occurs in other rankings unrelated to language, such as the population of cities in given countries. According to the Zipf’s law, the biggest city in a country has a population twice as large as the second city, three times larger than the third city, and so on.In a more general way, the Zipf’s law says that the frequency of a word in a language iswhereis the rank of the word, andthe exponent that characterizes the power-law.I tested if the relation holds true not only in the corpus of a language, but also in individual books. We see on the left the frequency of words in the outstanding Moby Dick by Hermann Melville.At first glance theOnce we fit the data with the theoretical law (obtaining, green line), however we see increasing residuals for the high-ranked words (i.e. the least used words in the book). Words with ranking higher than 100 are in fact better described by a Zip’s law with a higher exponent (, red line): Interestingly, a very similar pattern is seen if we fit the Brown Corpus – which contains 500 samples from the English language.After the first few hundreds of words (that contain mainly conjunctions, articles, verbs, and pronouns), the drop in usage of the words themselves becomes even faster. We can maybe say that the first hundred of words represents the, i.e. all the structures that are necessarily to express any concept.

We can compare the occurrences of words in Moby Dick with that of words we use in a more general setting (the Brown corpus).

Related

We see on the left the frequencies of words in Moby Dick and in the Brown corpus. Interestingly.To find which words deviate the most from the 1-1 relation, we can create a statistics telling us if a word is over or underrepresented in Moby Dick compared to the Brown corpus: (frequency(MobyDick) – frequency(Brown))/sigma, where sigma is the Poisson error of the counts.And what’s the first word we find in this new raking? “She”! This makes totally makes sense,In the Brown Corpus, ‘she’ is the word with rank 40. In Moby Dick we cannot find it before rank 180 and appears only 116 times along the ~700 pages of the book, 5 times less than what the frequency of the Brown Corpus would have predicted. Among the 10 most underrepresented words in Moby Dick we find also, of course, ‘her’ and ‘mrs’ , plus a cluster of words related to society (‘state’,’president’,’school’,’government’). Among the overrepresented words, we find instead archaic forms such as ‘thou’ and ‘ye’, maritime terms like ‘captain’ and ‘ship’.Another interesting dichotomy is among, with the first word used very rarely in the book, and “old” used 5 times more frequently than in the Brown corpus. In conclusion Statistics confirm that