Longitudinal analysis of written language

Allometric scaling analysis21 is used to quantify the role of system size on general phenomena characterizing a system and has been applied to systems as diverse as the metabolic rate of mitochondria22 and city growth23,24,25,26,27,28,29. Indeed, city growth shares two common features with the growth of written text: (i) the Zipf law is able to describe the distribution of city sizes regardless of country or the time period of the data26 and (ii) city growth has inherent constraints due to geography, changing labor markets and their effects on opportunities for innovation and wealth creation27,28, just as vocabulary growth is constrained by human brain capacity and the varying utilities of new words across users14.

We construct a word counting framework by first defining the quantity u i (t) as the number of times word i is used in year t. Since the number of books and the number of distinct words grow dramatically over time, we define the relative word use, f i (t), as the fraction of the total body of text occupied by word i in the same year

where the quantity is the total number of indistinct word uses while N w (t) is the total number of distinct words digitized from books printed in year t. Both the N w (“types” giving the vocabulary size) and the N u (“tokens” giving the size of the body of text) are generally increasing over time.

The Zipf law and the two scaling regimes

Zipf investigated a number of bodies of literature and observed that the frequency of any given word is roughly inversely proportional to its rank11, with the frequency of the z-ranked word given by the relation

with a scaling exponent ζ ≈ 1. This empirical law has been confirmed for a broad range of data, ranging from income rankings, city populations and the varying sizes of avalanches, forest fires30 and firm size31 to the linguistic features of nonconding DNA32. The Zipf law can be derived through the “principle of least effort,” which minimizes the communication noise between speakers (writers) and listeners (readers)16. The Zipf law has been found to hold for a large dataset of English text14, but there are interesting deviations observed in the lexicon of individuals diagnosed with schizophrenia15. Here, we also find statistical regularity in the distribution of relative word use for 11 different datasets, each comprising more than half a million distinct words taken from millions of books8.

Figure 1 shows the probability density functions P(f) resulting from data aggregated over all the years (A,B) as well as over 1-year periods as demonstrated for the year t = 2000 (C,D). Regardless of the language and the considered time span, the probability density functions are characterized by a striking two-regime scaling, which was first noted by Ferrer i Cancho and Solé14 and can be quantified as

These two regimes, designated “kernel lexicon” and “unlimited lexicon,” are thought to reflect the cognitive constraints of the brain's finite vocabulary14. The specialized words found in the unlimited lexicon are not universally shared and are used significantly less frequently than the words in the kernel lexicon. This is reflected in the kink in the probability density functions and gives rise to the anomalous two-scaling distribution shown in Fig. 1.

Figure 1 Two-regime scaling distribution of word frequency. The kink in the probability density functions P(f) occurs around f × ≈ 10−5 for each corpora analyzed (see legend). (A,B) Data from all years are aggregated into a single distribution. (C,D) P(f) comprising data from only year t = 2000 providing evidence that the distribution is stable even over shorter time frames and likely emerges in corpora that are sufficiently large to be comprehensive of the language studied. For details concerning the scaling exponents we refer to Table I and the main text. Full size image

The exponent α + and the corresponding rank-frequency scaling exponent ζ in Eq. (2) are related asymptotically by14

with no analogous relationship for the unlimited lexicon values α − and ζ − . Table I lists the average α + and α − values calculated by aggregating α ± values for each year using a maximum likelihood estimator for the power-law distribution33. We characterize the two scaling regimes using a crossover region around f × ≈ 10−5 to distinguish between α − and α + : (i) 10−8 ≤ f ≤ 10−6 corresponds to α − and (ii) 10−4 ≤ f ≤ 10−1 corresponds to α + . For the words that satisfy f ≳ f × that comprise the kernel lexicon, we verify the Zipf scaling law ζ ≈ 1 (corresponding to α ≈ 2) for all corpora analyzed. For the unlimited lexicon regime f ≲ f × , however, the Zipf law is not obeyed, as we find α − ≈ 1.7. Note that α − is significantly smaller in the Hebrew, Chinese and the Russian corpora, which suggests that a more generalized version of the Zipf law14 may be needed, one which is slightly language-dependent, especially when taking into account the usage of specialized words from the unlimited lexicon.

Table 1 Summary of the scaling exponents characterizing the Zipf law and the Heaps law. To calculate σ r (t|f c ) (see Figs. 6 and 7) we use only the relatively common words that meet the criterion that their average word use 〈f i 〉 over the entire word history is larger than a threshold f c = 10/Min[N u )(t)] listed in the first column for each corpus. The b values shown are calculated using all words (U c = 0). The “unlimited lexicon” scaling exponent α − (t) is calculated for 10−8 < f < 10−6 and the “kernel lexicon” exponent α + (t) is calculated for 10−4 < f < 10−1 using the maximum likelihood estimator method for each year. The average and standard deviation listed are computed using the α + (t) and α − (t) values over the 209-year period 1800–2008 (except for Chinese, which is calculated from 1950–2008 data). We show the Zipf scaling exponent calculated as ζ = 1/(〈α + 〉 −1). The last column indicates the β scaling exponents from Fig. 7(A) Figure 6 Non-stationarity in the characteristic growth fluctuation of word use. The standard deviation σ r (t|f c ) of the logarithmic growth rate r i (t) is presented for all examined corpora. There is an overall decreasing trend arising from the increasing size of the corpora, as depicted in Fig. 5(A). On the other hand, the steady production of new words, as depicted in Fig. 5(B) counteracts this effect. We calculate σ r (t|f c ) using the relatively common words that meet the criterion that their average word use 〈f i 〉 over the entire word history T i (using words with lifetime T i ≥ 10 years) is larger than a threshold f c ≡ 1/Min[N u (t)] (see Table I). Full size image Figure 7 Growth fluctuation of word use scale with the size of the corpora. (A) Depicted is the quantitative relation in Eq.(8) between σ r (t|f c ) and the corpus size N u (t|f c ). We calculate σ r (t|f c ) using the relatively common words that meet the criterion that their average word use 〈f i 〉 over the entire word history (using words with lifetime T i ≥ 10 years) is larger than a threshold f c ≡ 10/Min[N u (t)] (see Table I). We show the language-dependent scaling value β ≈ 0.08–0.35 in each panel. For each language we show the value of the ordinary least squares best-fit β value with the standard error in parentheses. (B) Summary of β(U c ) exponents calculated using a use-threshold U c , instead of a frequency threshold f c as used in (A). Error bars indicate the standard error in the OLS regression. We perform this additional analysis in order to provide alternative insight into the role of extremely rare words. For increasing U c the β(U c ) value for each corpora increases from β ≈ 0.05 to β < 0.25. This language pruning method quantifies the role of new rare words (also including OCR errors, spelling and other orthographic variants), which are the significant components of language volatility. Full size image Full size table

The Heaps law and the increasing marginal returns of new words

Heaps observed that vocabulary size, i.e. the number of distinct words, exhibits a sub-linear growth with document size18. This observation has important implications for the “return on investment” of a new word as it is established and becomes disseminated throughout the literature of a given language. As a proxy for this return, Heaps studied how often new words are invoked in lieu of preexisting competitors and examined the linguistic value of new words and ideas by analyzing the relation between the total number of words printed in a body of text N u and the number of these which are distinct N w , i.e. the vocabulary size18. The marginal returns of new words, ∂N u /∂N w quantifies the impact of the addition of a single word to the vocabulary of a corpus on the aggregate output (corpus size).

For individual books, the empirically-observed scaling relation between N u and N w obeys

with b < 1, with Eq. (5) referred to as “the Heaps law”. It has subsequently been found that Heaps' law emerges naturally in systems that can be described as sampling from an underlying Zipf distribution. In an information theoretic formulation of the the abstract concept of word cost, B. Mandelbrot predicted the relation b = 1/ζ in 196134, where ζ is the scaling exponent corresponding to α + , as in Eqs. (3) and (4). This prediction is limited to relatively small texts where the unlimited lexicon, which manifests in the α − regime, does not play a significant role. A mathematical extension of this result for general underlying rank-distributions is also provided by Karlin35 using an infinite urn scheme and extended to broader classes of heavy-tailed distributions recently by Gnedin et al36. Recent research efforts using stochastic master equation techniques to model the growth of a book have also predicted this intrinsic relation between Zipf's law and Heaps' law13,37,38.

Figure 2 confirms a sub-linear scaling (b < 1) between N u and N w for each corpora analyzed. These results show how the marginal returns of new words are given by

which is an increasing function of N w for b < 1. Thus, the relative increase in the induced volume of written languages is larger for new words than for old words. This is likely due to the fact that new words are typically technical in nature, requiring additional explanations that put the word into context with pre-existing words. Specifically, a new word requires the additional use of preexisting words as a result of both (i) the explanation of the content of the new word using existing technical terms and (ii) the grammatical infrastructure necessary for that explanation. Hence, there are large spillovers in the size of the written corpus that follow from the intricate dependency structure of language stemming from the various grammatical roles39,40.

Figure 2 Allometric scaling of language. Scatter plots of the output corpora size N u given the empirical vocabulary size N w using all data (U c = 0) over the 209-year period 1800–2008. Shown are OLS estimation of the exponent b quantifying the Heaps' law relation N w ~ [N u ]b. Full size image

In order to investigate the role of rare and new words, we calculate N u and N w using only words that have appeared at least U c times. We select the absolute number of uses as a word use threshold because a word in a given year can not appear with a frequency less than 1/N u , hence any criteria using relative frequency would necessarily introduce a bias for small corpora samples. This choice also eliminates words that can spuriously arise from Optical Character Recognition (OCR) errors in the digitization process and also from intrinsic spelling errors and orthographic spelling variations.

Figures 3 and 4 show the relational dependence of N u and N w on the exclusion of low-frequency words using a variable cutoff U c = 2n with n = 0 … 11. As U c increases the Heaps scaling exponent increases from b ≈ 0.5, approaching b ≈ 1, indicating that core words are structurally integrated into language as a proportional background. Interestingly, Altmann et al.41 recently showed that “word niche” can be an essential factor in modeling word use dynamics. New niche words, though they are marginal increases to a language's lexicon, are themselves anything but “marginal” - they are core words within a subset of the language. This is particularly the case in online communities in which individuals strive to distinguish themselves on short timescales by developing stylistic jargon, highlighting how language patterns can be context dependent.

Figure 3 Pruning reveals the variable marginal return of words. The Heaps scaling exponent b depends on the extent of the inclusion of the rarest words. For a given corpora and U c value we make a scatter plot between N w (t|U c ) and N u (t|U c ) using words with u i (t) ≥ U c . (Panel Inset) We use OLS estimation to estimate the scaling exponent b(U c ) for the model N w (t|U c ) ~[N u (t|U c )]b to show that b(U c ) increases from approximately 0.5 towards unity as we prune the corpora of extremely rare words. Our longitudinal language analysis provides insight into the structural importance of the most frequent words which are used more times per appearance and which play a crucial role in the usage of new and rare words. Full size image

Figure 4 Pruning reveals the variable marginal return of words. The Heaps scaling exponent b depends on the extent of the inclusion of the rarest words. For a given corpora and U c value we make a scatter plot between N w (t|U c ) and N u (t|U c ) using words with u i (t) ≥ U c , using the same data color-U c correspondence as in Fig. 3. (Panel Inset) We use OLS estimation to estimate the scaling exponent b(U c ) for the model N w (t|U c ) ~ [N u (t|U c )]b to show that b(U c ) increases from approximately 0.5 towards unity as we prune the corpora of extremely rare words. Our longitudinal language analysis provides insight into the structural importance of the most frequent words which are used more times per appearance and which play a crucial role in the usage of new and rare words. Full size image

We now return to the relation between Heaps' law and Zipf's law. Table I summarizes the b values calculated by means of ordinary least squares regression using U c = 0 to relate N u (t) to N w (t). For U c = 1 we find that b ≈ 0.5 for all languages analyzed, as expected from Heaps law, but for U c ≳ 8 the b value significantly deviates from 0.5 and for U c ≳ 1000 the b value begins to saturate approaching unity. Considering that α + ≈ 2 implies ζ ≈ 1 for all corpora, Figures 3 and 4 shows that we can confirm the relation b(U c ) ≈ 1/ζ only for the more pruned corpora that require relatively large U c . This hidden feature of the scaling relation highlights the underlying structure of language, which forms a dependency network between the common words of the kernel lexicon and their more esoteric counterparts in the unlimited lexicon. Moreover, the function ∂N w /∂N u ~ (N u )b−1 is a monotonically decreasing function for b < 1, demonstrating the decreasing marginal need for additional words as a corpora grows. In other words, since we get more and more “mileage” out of new words in an already large language, additional words are needed less and less.

Corpora size and word-use fluctuations

Lastly, it is instructive to examine how vocabulary size N w and the overall size of the corpora N u affect fluctuations in word use. Figure 5 shows how N w (t) and N u (t) vary over time over the past two centuries. Note that, apart from the periods during the two World Wars, the number of words printed, which we will refer to as the “literary productivity”, has been increasing over time. The number of distinct words (vocabulary size) has also increased reflecting basic social and technological advancement8.

Figure 5 Literary productivity and vocabulary size in the Google Inc. 1-gram dataset over the past two centuries. (A) Total size of the different corpora N u (t|U c ) over time, calculated by using words that satisfy u i (t) ≥ U c ≡ 16 to eliminate extremely rare 1-grams. (B) Size of the written vocabulary N w (t|U c ) over time, calculated under the same conditions as (A). Full size image

To investigate the role of fluctuations, we focus on the logarithmic growth rate, commonly used in finance and economics

to measure the relative growth of word use over 1-year periods, Δt ≡ 1 year. Recent quantitative analysis on the distribution P(r) of word use growth rates r i (t) indicates that annual fluctuations in word use deviates significantly from the predictions of null models for language evolution9.

We define an aggregate fluctuation scale, σ r (t|f c ), using a frequency cutoff f c ∝ 1/Min[N u (t)] to eliminate infrequently used words. The quantity Min[N u (t)] is the minimum corpora size over the period of analysis and so 1/Min[N u (t)] is an upper bound for the minimum observed frequency for words in the corpora. Figure 6 shows σ r (t|f c ), the standard deviation of r i (t) calculated across all words that satisfy the condition 〈f i 〉 ≥ f c for words with lifetime T i ≥ 10 years, using f c = 1/Min[N u (t)]. Visual inspection suggests a general decrease in σ r (t|f c ) over time, marked by sudden increases during times of political conflict. Hence, the persistent increase in the volume of written language is correlated with a persistent downward trend what could be thought of as the “system temperature” σ r (t|f c ): as a language grows and matures it also “cools off”.

Since this cooling pattern could arise as a simple artifact of an independent identically distributed (i.i.d) sampling from an increasingly large dataset, we test the scaling of σ r (t|f c ) with corpora size. Figure 7(A) shows that for large N u (t), each language is characterized by a scaling relation

with language-dependent scaling exponent β ≈ 0.08–0.35. We use f c = 10/Min[N u (t)], which defines the frequency threshold for the inclusion of a given word in our analysis. There are two candidate null models which give insight into the limiting behavior of β. The Gibrat proportional growth model predicts β = 0 and the Yule- Simon urn model predicts β = 1/242. We observe β < 1/2, which indicates that the fluctuation scale decreases more slowly with increasing corpora size than would be expected from the Yule-Simon urn model prediction, deducible via the “delta method” for determining the approximate scaling of a distribution and its standard deviation σ43.

To further compare the roles of the kernel lexicon versus the unlimited lexicon, we apply our pruning method to quantify the dependence of the scaling exponent β on the fluctuations arising from rare words. We omit words from our calculation of σ r (t|U c ) if their use u i (t) in year t falls below the word-use threshold U c . Fig. 7(B) shows that β(U c ) increases from values close to 0 to values less than 1/2 as U c increases exponentially. An increasing β(U c ) confirms our conjecture that rare words are largely responsible for the fluctuations in a language. However, because of the dependency structure between words, there are residual fluctuation spillovers into the kernel lexicon likely accounting for the fact that β < 1/2 even when the fluctuations from the unlimited lexicon are removed.

A size-variance relation showing that larger entities have smaller characteristic fluctuations was also demonstrated at the scale of individual words using the same Google n-gram dataset9. Moreover, this size-variance relation is strikingly analogous to the decreasing growth rate volatility observed as complex economic entities (i.e. firms or countries) increase in size42,44,45,46,47,48, which strengthens the analogy of language as a complex ecosystem of words governed by competitive forces.