Quantifying the birth rate and the death rate of words

Just as a new species can be born into an environment, a word can emerge in a language. Evolutionary selection laws can apply pressure on the sustainability of new words since there are limited resources (topics, books, etc.) for the use of words. Along the same lines, old words can be driven to extinction when cultural and technological factors limit the use of a word, in analogy to the environmental factors that can change the survival capacity of a living species by altering its ability to survive and reproduce.

We define the birth year y 0,i as the year t corresponding to the first instance of , where is median word use of a given word over its recorded lifetime in the Google database. Similarly, we define the death year y f,i as the last year t during which the word use satisfies . We use the relative word use threshold in order to avoid anomalies arising from extreme fluctuations in f i (t) over the lifetime of the word. The results obtained using threshold did not show a significant qualitative difference.

The significance of word births Δ b (t) and word deaths Δ d (t) for each year t is related to the vocabulary size N w (t) of a given language. We define the birth rate γ b and death rate γ d by normalizing the number of births Δ b (t) and deaths Δ d (t) in a given year t to the total number of distinct words N w (t) recorded in the same year t, so that

This definition yields a proxy for the rate of emergence and disappearance of words. We restrict our analysis to words with birth-death duration y f,i − y 0,i + 1 ≥ 2 years and to words with first recorded use t 0,i ≥ 1700, which selects for relatively new words in the history of a language.

The γ b (t) and γ d (t) time series plotted in Fig. 2 for the 200-year period 1800–2000 show trends that intensifies after the 1950s. The modern era of publishing, which is characterized by more strict editing procedures at publishing houses, computerized word editing and automatic spell-checking technology, shows a drastic increase in the death rate of words. Using visual inspection we verify most changes to the vocabulary in the last 10–20 years are due to the extinction of misspelled words and nonsensical print errors and to the decreased birth rate of new misspelled variations and genuinely new words. This phenomenon reflects the decreasing marginal need for new words, consistent with the sub-linear Heaps' law observed for all Google 1-gram corpora in24. Moreover, Fig. 3 shows that γ b (t) is largely comprised of words with relatively large f while γ d (t) is almost entirely comprised of words with relatively small f (see also Fig. S1 in the Supplementary Information (SI) text). Thus, the new words of tomorrow are likely be core words that are widely used.

Figure 2 Dramatic shift in the birth rate and death rate of words. The word birth rate γ b (t) and the word death rate γ d (t) show marked underlying changes in word use competition which affects the entry rate and the sustainability of existing words. The modern print era shows a marked increase in the death rate of words which likely correspond to low fitness, misspelled and (technologically) outdated words. A simultaneous decrease in the birth rate of new words is consistent with the decreasing marginal need for new words indicated by the sub-linear allometric scaling between vocabulary size and total corpus size (Heaps' law)24. Interestingly, we quantitatively observe the impact of the Balfour Declaration in 1917, the circumstances surrounding which effectively rejuvenated Hebrew as a national language, resulting in a 5-fold increase in the birth rate of words in the Hebrew corpus. Full size image

Figure 3 Survival of the fittest in the entry process of words. Trends in the relative uses of words that either were born or died in a given year show that the entry-exit forces largely depend on the relative use of the word. For the English corpus, we calculate the average of the median lifetime relative use, 〈Med(f i )〉, for all words born in year t (top panel) and for all words that died in year t (bottom panel), which shows a 5-year moving average (dashed black line). There is a dramatic increase in the relative use (“utility”) of newborn words over the last 20–30 years, likely corresponding to new technical terms, which are necessary for the communication of core modern technology and ideas. Conversely, with higher editorial standards and the recent use of word processors which include spelling standardization technology, the words that are dying are those words with low relative use. We confirm by visual inspection that the lists of dying words contain mostly misspelled and nonsensical words. Full size image

We note that the main source of error in the calculation of birth and death rates are OCR (optical character recognition) errors in the digitization process, which could be responsible for a significant fraction of misspelled and nonsensical words existing in the data. An additional source of error is the variety of orthographic properties of language that can make very subtle variations of words, for example through the use of hyphens and capitalization, appear as distinct words when applying OCR. The digitization of many books in the computer era does not require OCR transfer, since the manuscripts are themselves digital and so there may be a bias resulting from this recent paradigm shift. We confirm that the statistical patterns found using post 2000- data are consistent with the patterns that extend back several hundred years24.

Complementary to the death of old words is the birth of new words, which are commonly associated with new social and technological trends. Topical words in media can display long-term persistence patterns analogous to earthquake shocks25,26 and can result in a new word having larger fitness than related “out-of-date” words (e.g. blog vs. log, email vs. memo). Here we show that a comparison of the growth dynamics between different languages can also illustrate the local cultural factors that influence different regions of the world. Fig. 4 shows how international crisis can lead to globalization of language through common media attention and increased lexical diffusion. Notably, as illustrated in Fig. 4(a), we find that international conflict only perturbed the participating languages, while minimally affecting the languages of the nonparticipating regions, e.g. the Spanish speaking countries during WWII.

Figure 4 The significance of historical events on the evolution of language. The standard deviation σ(t) of growth rates demonstrates the sensitivity of language to international events (e.g. World War II). For all languages there is an overall decreasing trend in σ(t) over the period 1850–2000. However, the increase in σ(t) during WWII represents a“globalization” effect, whereby societies are brought together by a common event and a unified media. Such contact between relatively isolated systems necessarily leads to information flow, much as in the case of thermodynamic heat flow between two systems, initially at different temperatures, which are then brought into contact. (a) σ(t) calculated for the relatively new words with T i ≥ 100 years. The Spanish corpus does not show an increase in σ(t) during World War II, indicative of the relative isolation of South America and Spain from the European conflict. (b) σ(t) for 4 sets of relatively new words that meet the criteria T i ≥ T c and t i ,0 ≥ 1800. The oldest “new” words (T c = 200) demonstrate the most significant increase in σ(t) during World War II, with a peak around 1945. (c) The standard deviation σ(t) for the most common words is decreasing with time, suggesting that they have saturated and are being “crowded out” by new competitors. This set of words meets the criterion that the average relative use exceeds a threshold, 〈f i 〉 ≥ f c , which we define for each corpus. (d) We compare the variation σ(t) for relatively new English words, using T i ≥ 100, with the 20-year moving average over the time period 1820–1988. The deviations show that σ(t) increases abruptly during times of conflict, such as the American CivilWar (1861–1865), World War I (1914–1918) and World War II (1939–1945) and also during the 1980s and 1990s, possibly as a result of new digital media (e.g. the internet) which offer new environments for the evolutionary dynamics of word use. D(t) is the difference between the moving average and σ(t). Full size image

The lifetime trajectory of words

Between birth and death, one contends with the interesting question of how the use of words evolve when they are “alive.” We focus our efforts toward quantifying the relative change in word use over time, both over the word lifetime and throughout the course of history. In order to analyze separately these two time frames, we select two sets of words: (i) relatively new words with “birth year” t 0,i later than 1800, so that the relative age τ ≡ t − t 0,i of word i is the number of years after the word's first occurrence in the database and (ii) relatively common words, typically with t 0,i < 1800.

We analyze dataset (i) words (summary statistics in Table S1) so that we can control for properties of the growth dynamics that are related to the various stages of a word's life trajectory (e.g. an “infant” phase, an “adolescent” phase and a “mature” phase). For comparison with the young words, we also analyze the growth rates of dataset (ii) words in the next section (summary statistics in Table S2). These words are presumably old enough that they are in a stable mature phase. We select dataset (ii) words using the criterion 〈f i 〉 ≥ f c , where is the average relative use of the word i over the word's lifetime T i = t 0,f − t 0,i + 1 and f c is a cutoff threshold derived form the Zipf rank-frequency distribution1 calculated for each corpus24. In Table S3 we summarize the entire data for the 209-year period 1800–2008 for each of the four Google language sets analyzed.

Modern words typically are born in relation to technological or cultural events, e.g. “Antibiotics.” We ask if there exists a characteristic time for a word's general acceptance. In order to search for patterns in the growth rates as a function of relative word age, for each new word i at its age τ , we analyze the “use trajectory” f i (τ) and the “growth rate trajectory” r i (τ). So that we may combine the individual trajectories of words of varying prevalence, we normalize each f i (τ) by its average 〈f i 〉, obtaining a normalized use trajectory . We perform an analogous normalization procedure for each r i (τ), normalizing instead by the growth rate standard deviation σ[r i ], so that (see the Methods section for further detailed description).

Since some words will die and other words will increase in use as a result of the standardization of language, we hypothesize that the average growth rate trajectory will show large fluctuations around the time scale for the transition of a word into regular use. In order to quantify this transition time scale, we create a subset {i |T c } of word trajectories i by combining words that meets an age criteria T i ≥ T c . Thus, T c is a threshold to distinguish words that were born in different historical eras and which have varying longevity. For the values T c = 25, 50, 100 and 200 years, we select all words that have a lifetime longer than T c and calculate the average and standard deviation for each set of growth rate trajectories as a function of word age τ.

In Fig. 5 we plot for the English corpus, which shows a broad peak around τ c ≈ 30–50 years for each T c subset before the fluctuations saturate after the word enters a stable growth phase. A similar peak is observed for each corpus analyzed (Figs. S4–S7). This single-peak growth trajectory is consistent with theoretical models for logistic spreading and the fixation of words in a population of learners27. Also, since we weight the average according to 〈f i 〉, the time scale τ c is likely associated with the characteristic time for a new word to reach sufficiently wide acceptance that the word is included in a typical dictionary. We note that this time scale is close to the generational time scale for humans, corroborating evidence that languages require only one generation to drastically evolve27.

Figure 5 Quantifying the tipping point for word use. (a) The maximum in the standard deviation σ of growth rates during the “adolescent” period τ ≈ 30–50 indicates the characteristic time scale for words being incorporated into the standard lexicon, i.e. inclusion in popular dictionaries. In Fig. S4 we plot the average growth rate trajectory 〈r′(τ|T c )〉 which shows relatively large positive growth rates during approximately the same 20-year period. (b) The first passage time τ 1 53 is defined as the number years for the relative use of a new word i to exceed a given f-value for the first time, f i (τ 1 ) ≥ f. For relatively new words with T i ≥ 100 years we calculate the average first-passage time 〈τ 1 (f)〉 for a large range of f. We estimate for each language the f c representing the threshold for a word belonging to the standard “kernel” lexicon4. This method demonstrates that the English corpus threshold f c ≡ 5 × 10–8 maps to the first passage time corresponding to the peak period τ ≈ 30 – 50 years in σ(τ) shown in panel (a). Full size image

Empirical laws quantifying the growth rate distribution

How much do the growth rates vary from word to word? The answer to this question can help distinguish between candidate models for the evolution of word utility. Hence, we calculate the probability density function (pdf) of . Using this quantity accounts for the fact that we are aggregating growth rates of words of varying ages. The empirical pdf P(R) shown in Fig. 6 is leptokurtic and remarkably symmetric around R ≈ 0. These empirical facts are also observed in studies of the growth rates of economic institutions28,29,30,31. Since the R values are normalized and detrended according to the age-dependent standard deviation σ[r′(τ|T c )], the standard deviation is σ(R) = 1 by construction.

Figure 6 Common leptokurtic growth distribution for new words and common words. (a) Independent of language, the growth rates of relatively new words are distributed according to the Laplace distribution centered around R ≈ 0 defined in Eq. (4). The the growth rate R defined in Eq. (11) is measured in units of standard deviation and accounts for age-dependent and word-dependent factors. Yet, even with these normalizations, we still observe an excess number of |R| ≥ 3σ events. This fact is demonstrated by the leptokurtic form of each P(R), which exhibit the excess tail frequencies when compared with a unit-variance Gaussian distribution (dashed blue curve). The Gaussian distribution is the predicted distribution for the Gibrat proportional growth model, which is a candidate neutral null-model for the growth dynamics of word use29. The prevalence of large growth rates illustrate the possibility that words can have large variations in use even over the course of a year. The growth variations are intrinsically related to the dynamics of everyday life and reflect the cultural and technological shocks in society. We analyze word use data over the time period 1800–2008 for new words i with lifetimes T i ≥ T c , where we show data calculated for T c = 100 years. (b) PDF P(r′) of the annual relative growth rate r′ for all words which satisfy 〈f i 〉 ≥ f c (dataset #ii words which are relatively common words). In order to select relatively frequently used words, we use the following criteria: T i ≥ 10 years, 1800 ≤ t ≤ 2008 and 〈f i 〉 ≥ f c . The growth rate r′ does not account for age-dependent factors since the common words are likely in the mature phase of their lifetime trajectory. In each panel, we plot a Laplace distribution with unit variance (solid black lines) and the Gaussian distribution with unit variance (dashed blue curve) for reference. Full size image

A candidate model for the growth rates of word use is the Gibrat proportional growth process29,30, which predicts a Gaussian distribution for P(R). However, we observe the “tent-shaped” pdf P(R) which is well-approximated by a Laplace (double-exponential) distribution, defined as

Here the average growth rate 〈R〉 has two properties: (a) 〈R〉 ≈ 0 and (b) 〈R〉 ≪ σ(R). Property (a) arises from the fact that the growth rate of distinct words is quite small on the annual basis (the growth rate of books in the Google English database is γ w ≈ 0.01124) and property (b) arises from the fact that R is defined in units of standard deviation. Being leptokurtic, the Laplace distribution predicts an excess number of events > 3σ as compared to the Gaussian distribution. For example, comparing the likelihood of events above the 3σ event threshold, the Laplace distribution displays a five-fold excess in the probability P(|R − 〈R〉| > 3σ), where for the Laplace distribution, whereas for the Gaussian distribution. The large R values correspond to periods of rapid growth and decline in the use of words during the crucial “infant” and “adolescent” lifetime phases. In Fig. 6(b) we also show that the growth rate distribution P(r′) for the relatively common words comprising dataset (ii) is also well-described by the Laplace distribution.

For hierarchical systems consisting of units each with complex internal structure32 (e.g. a given country consists of industries, each of which consists of companies, each of which consists of internal subunits), a non-trivial scaling relation between the standard deviation of growth rates σ(r|S) and the system size S has the form

The theoretical prediction in32,33 that β ∈ [0, 1/2] has been verified for several economic systems, with empirical β values typically in the range 0.1 < β < 0.333.

Since different words have varying lifetime trajectories as well as varying relative utilities, we now quantify how the standard deviation σ(r|S i ) of growth rates r depends on the cumulative word frequency

of each word. We choose this definition for proxy of “word size” since a writer can learn and recall a given word through any of its historical uses. Hence, S i is also proportional to the number of books in which word i appears. This is significantly different than the assumptions of replication null models (e.g. the Moran process) which use the concurrent frequency f i (t) as the sole factor determining the likelihood of future replication10,18.

We estimate Eq. (5) by grouping words according to S i and then calculating the growth rate standard deviation σ(r|S i ) for each group. Fig. 7(b) shows scaling behavior consistent with Eq. (5) for large S i , with β ≈ 0.10 – 0.21 depending on the corpus. A positive β value means that words with larger cumulative word frequency have smaller annual growth rate fluctuations. We conjecture that this statistical pattern emerges from the hierarchical organization of written language12,13,14,15,16 and the social properties of the speakers who use the words8,17,34. As such, we calculate β values that are consistent with nontrivial correlations in word use, likely related to the basic fact that books are topical3 and that book topics are correlated with cultural trends.

Figure 7 Scaling in the growth rate fluctuations of words. We show the dependence of growth rates on the cumulative word frequency using words satisfy the criteria T i ≥ 10 years. We verify similar results for threshold values T c = 50, 100 and 200 years. (a) Average growth rate 〈r〉 saturates at relatively constant values for large S. (b) Scaling in the standard deviation of growth rates σ(r|S) ∼ S–β for words with large S. This scaling relation is also observed for the growth rates of large economic institutions, ranging in size from companies to entire countries31,33. Here this size-variance relation corresponds to scaling exponent values 0.10 < β < 0.21, which are related to the non-trivial bursting patterns and non-trivial correlation patterns in literature topicality as indicated by the quantitative relation to the Hurst exponent, H = 1 – β shown in35. We calculate β Eng. ≈ 0.16 ± 0.01, β Eng.fict ≈ 0.21 ± 0.01, β Spa. ≈ 0.10 ± 0.01 and β Heb. ≈ 0.17 ± 0.01. Full size image

Quantifying the long-term cultural memory

Recent theoretical work35 shows that there is a fundamental relation between the size-variance exponent β and the Hurst exponent H quantifying the auto-correlations in a stochastic time series. The novel relation H = 1 − β indicates that the temporal long-term persistence is intrinsically related to the capability of the underlying mechanism to absorb stochastic shocks. Hence, positive correlations (H > 1/2) are predicted for non-trivial β values (i.e. 0 ≤ β ≤ 0.5). Note that the Gibrat proportional growth model predicts β = 0 and that a Yule-Simon urn model predicts β = 0.533. Thus, f i (τ) belonging to words with large S i are predicted to show significant positive correlations, H i > 1/2.