« previous post | next post »

Alexander M. Petersen, Joel Tenenbaum, Shlomo Havlin, and H. Eugene Stanley, "Statistical Laws Governing Fluctuations in Word Use from Word Birth to Word Death" (appearing in Scientific Reports, 3/15/2012):

We analyze the dynamic properties of 10^7 words recorded in English, Spanish and Hebrew over the period 1800–2008 in order to gain insight into the coevolution of language and culture. We report language independent patterns useful as benchmarks for theoretical models of language evolution. A significantly decreasing (increasing) trend in the birth (death) rate of words indicates a recent shift in the selection laws governing word use. For new words, we observe a peak in the growth-rate fluctuations around 40 years after introduction, consistent with the typical entry time into standard dictionaries and the human generational timescale. Pronounced changes in the dynamics of language during periods of war shows that word correlations, occurring across time and between words, are largely influenced by coevolutionary social, technological, and political factors. We quantify cultural memory by analyzing the long-term correlations in the use of individual words using detrended fluctuation analysis.

Here's the most striking result:

My first thought was that the decreasing "birth rate" of words was a trivial consequence of the necessarily downward-accelerated shape of type-token curves (like those exhibited here and here). When you first start scanning a stream of words (or insects, or random space-separated character strings, or pretty much anything else where individuals are instances of types), every individual that you see is a member of a new type ("species" or "word" or whatever). As the number of word tokens that you've seen goes up, the probability that the next token will be an instance of a new type obviously goes down; and this pattern will continue until you've seen every possible type (if that ever happens).

But the authors of this paper are much too smart to make this elementary mistake. Their definition of "word birth" and "word death":



–



There may still be a bit of a problem here, in that during the early years of collection, when no words have yet been (seen to be) born, the rate of apparent word birth will presumably be artificially inflated. How big this effect is, and how long it lasts, will depend on the details of the collection and perhaps on other details of their data processing. There's potentially a similar problem at the end of the time-series.

Since the data is available from the Google ngrams site, it's possible to replicate their work and examine this sort of thing in detail. One could also look at the results of various simulated random processes, to help clarify which aspects of their time-series reflect the nature of the underlying process, and which might be artefacts of their processing method.

One critical consideration, however, is that this paper is not really about words at all — it's about contiguous letter-strings in optical-character-reader output for scanned printed books. Different inflected forms of a word are different "words"; different word spellings are different "words"; word-fragments split typographically across lines are different "words"; typos are different "words"; OCR errors are different words". For expositional clarity, let's call the strings in the Google 1-gram corpus "g-words".

Given this, their results are surely impacted by the fact that the English "long s" lasts into the first few decades of their test period; the fact that English spelling was not fully codified until after the start of their test period; and the fact that printing technology, paper preservation, and other factors mean that the quality of Google's OCR for more recent works will be in general better. (And maybe editing and printing is really better now that it was in, say, 1900 — though I'm not convinced in advance that this is true…)

The authors recognize these issues to some extent in the paper:

The γ b (t) and γ d (t) time series plotted in Fig. 2 for the 200-year period 1800–2000 show trends that intensifies after the 1950s. The modern era of publishing, which is characterized by more strict editing procedures at publishing houses, computerized word editing and automatic spell-checking technology, shows a drastic increase in the death rate of words. Using visual inspection we verify most changes to the vocabulary in the last 10–20 years are due to the extinction of misspelled words and nonsensical print errors, and to the decreased birth rate of new misspelled variations and genuinely new words.

In other words, essentially all of the "drastic increase in the death rate of words" is in fact due to changes in the rate of mistakes at various stages of the data production process — spelling, editing, type-setting, and optical character recognition — that leads from the history of language to the lists of strings in the Google unigram corpus. And some portion — maybe most — of the "dramatic" decrease in the birth rate of g-words is also really a dramatic decrease in such book production and data processing errors, which have nothing to do with the life or death of "words" in the linguistic sense at all.

Chris Shea in the Wall Street Journal has a reasonable take on this, I think — duly impressed by the potential value of newly accessible linguistic data, and duly skeptical of the value and interpretation of specific observed patterns ("The New Science of the Birth and Death of Words: Have physicists discovered the evolutionary laws of language in Google's library?", WSJ 3/16/2012). He was most impressed by a "tipping point" observation:

The authors even identified a universal "tipping point" in the life cycle of new words: Roughly 30 to 50 years after their birth, they either enter the long-term lexicon or tumble off a cliff into disuse. The authors suggest that this may be because that stretch of decades marks the point when dictionary makers approve or disapprove new candidates for inclusion. Or perhaps it's generational turnover: Children accept or reject their parents' coinages.

Or perhaps the referents and sources of topical g-words — presidents, kings, battles, laws, authors, inventions, slogans — fade in and out of cultural focus. I'd be very surprised to learn that dictionaries have anything to do with this phenomenon; and slightly less surprised to learn that generational rebellion plays a significant role. Luckily, it's an empirical question — given a systematic survey of what kinds of g-words have tipped which way when, we can evaluate which processes are responsible for how much of the effect (and how strong the effect is to start with).

And even better, we've all got access to the data that allows this kind of follow-up to be done!

Permalink