« previous post | next post »

For the past century or so, the commonest word in English has gradually been getting less common. Depending on data source and counting method, the frequency of the definite article THE has fallen substantially — in some cases at a rate as high as 50% per 100 years.

At every stage, writing that's less formal has fewer THEs, and speech generally has fewer still, so to some extent the decline of THE is part of a more general long-term trend towards greater informality. But THE is apparently getting rarer even in speech, so the change is more than just the (normal) shift of writing style towards the norms of speech.

There appear to be weaker trends in the same direction, at overall lower rates, in German, Italian, Spanish, and French.

I'll lay out some of the evidence for this phenomenon, mostly collected from earlier LLOG posts. And then I'll ask a few questions about what's really going on, and why and how it's happening. [Warning: long and rather wonky.]

Data from the Google Books ngram corpus shows a decline in the frequency of THE, mostly in the last third of the 20th century.

Comparing the first decade of the century with the last decade, we get:

SOURCE 1900-1910 1990-2000 Difference English 6.39% 5.28% -17.3% American English 5.98% 4.99% -16.5% Fiction 4.97% 4.45% -10.5% British English 5.86% 5.32% -9.3%

(And the systematically lower frequency of THE in the "Fiction" dataset represents the influence of a generally less formal genre.)

The Corpus of Historical American English shows a similar effect, spread more evenly over the 20th century:

SOURCE 1900-1909 1990-1999 Difference COHA 6.53% 5.37% -17.8%

The Corpus of Contemporary American English shows a decline of nearly 8% over the 25 years from 1990 to 2015, which would be about 28% compounded for a century:

SOURCE 1990-1994 2010-2015 Difference Projection COCA 5.62% 5.18% -7.9% -28%

And COCA's rates by section (for the period 1990-2015) exhibit the genre/formality effect — the frequency of THE in the "Spoken" section is about 27% lower than the rate in the "Academic" section:

Spoken Fiction Newspaper Magazine Academic 4.62% 5.29% 5.35% 5.36% 6.30%

The COCA "Spoken" segments are relatively formal interview transcripts — in the Fisher corpus of conversational telephone speech, THE's overall frequency is only 2.47%, less than half the rate of the "Spoken" segment of COCA.

And if we break things out by age and sex, we see the pattern typical of a language change in progress. Younger people use THE at lower rates than older people, and in each age group, women use THE at lower rates than men:

The same numbers in tabular form:

AGE <28 Age 28-40 Age >40 MALE 2.53% 2.72% 2.97% FEMALE 2.31% 2.49% 2.62%

Data from a (more recent) collection of Facebook posts from 75,000 volunteers shows a similar (but even more advanced) pattern, with teen women's posts dipping below 2%:

It's conceivable that these are stable life-cycle and gender effects, but I doubt it — in every case where I've seen a pattern like this, independent evidence has shown that the pattern reflects a change in progress.

American presidents' State of the Union addresses show a decline of about 50% over the past 115 years in the frequency of THE:

SOURCE 1900-1910 2005-2015 Difference SOTU addresses 9.21% 4.67% -49.3%

If we compare the SOTU data with COHA and Google over a longer span of time, we can see that the trends are in the same direction, although the SOTU addresses show an effect of greater magnitude:

The biomedical abstracts in the MEDLINE dataset show a steady decline of 26% over 40 years, from 6.48% in 1975 to 4.82% in 2014 — which would project to a decline of over 50% in 100 years, matching the SOTU rate:

The Google Books datasets in other languages seem to show flatter profiles for definite determiners over the course of 20th century. Let's start with data for English, created by the same method that I used for German, Italian, Spanish, and French below, namely to ask the Google Books ngram interface for the sum of determiner forms with and without initial capitalization, with smoothing=3.

(For the results reported above, I downloaded the various English-language 1gram datasets, 2012 edition, pulled the counts out myself, including variants like THE, and plotted the results without any smoothing — which is why the numbers are slightly different.)

SOURCE 1900-1910 1990-2000 Difference ENGLISH 6.36% 5.27% -17.1%

Cherry-picking the maximum value (in 1916) and the minimum value (in 2000) doesn't change the numbers by a lot:

SOURCE MAX: 1916 MIN: 2000 Difference ENGLISH 6.42% 5.23% -18.5%

German also shows a decline in the summed frequency of the various forms of the definite determiner (which unfortunately are homographs with pronominal forms):

Comparing the first and last decade gives us a decline of -7.2%, notably lower than the English dataset's -17.1%:

SOURCE 1900-1910 1990-2000 Difference GERMAN 9.61% 8.91% -7.2%

And comparing the mid-century maximum to the end-0f-century minimum increases the difference, though still not to the level of the English dataset:

SOURCE MAX: 1959 MIN: 2000 Difference GERMAN 10.05% 8.78% -12.7%

Among the Romance languages available through the Google Books ngram viewer, Italian shows the greatest change in definite article frequency over the course of the 20th century. (Though note that like the other non-English languages considered here, the definite articles overlap with pronoun uses…)

Comparing the first and last decade gives us a decline of -8.1%:

SOURCE 1900-1910 1990-2000 Difference Italian 5.00% 4.59% -8.1%

And comparing the 1923 maximum to the 1985 minimum increases the difference a bit, though still not to the level of the English dataset:

SOURCE MAX: 1959 MIN: 2000 Difference Italian 5.05% 4.55% -10.0%

Spanish shows even less change (though I should note again that there may be some confusion between el the definite article and él the pronoun, and the counts definitely conflate la, las, los the definite articles and la, las, los the object pronouns):

SOURCE 1900-1910 1990-2000 Difference SPANISH 8.57% 8.34% -2.7%

Cherry-picking the century's maximum and minimum values increases the difference only a little:

SOURCE MAX: 1900 MIN: 1956 Difference SPANISH 8.60% 8.23% -4.3%

And French shows the least overall change (though again the counts conflate articles and object pronouns):

SOURCE 1900-1910 1990-2000 Difference FRENCH 6.25% 6.15% -1.6%

As with Spanish, the decline is mostly a feature of the first half of the century:

SOURCE MAX: 1918 MIN: 1953 Difference FRENCH 6.39% 6.05% -5.4%

Putting all five languages on the same plot, and showing the changes as proportions relative to the century-wide mean, highlights the differences:



English and German seem to show parallel declines in definite-determiner rates, at least in the second half of the 20th century. Other evidence for English yields higher rates of change, and provides additional evidence for change in the first half of the century.

Italian also shows a reasonably convincing pattern of decline.

The evidence for Spanish and French is more equivocal.There does seem to be a modest trend, though mostly in the first half of the century rather than the second half.

For all of the languages other than English, the patterns are surely obscured to some extent by the fact that the determiners involved are homographs with pronouns, though the pronouns are generally much less frequent.

So is there a general decay of European definiteness? Or a specifically Germanic trend? Does German show the same formality, age, and gender effects as English? What about Dutch, Swedish, Norwegian, etc.? What about other languages, related and unrelated, with roughly comparable determiner systems?

Why might English, German, Italian, Spanish, and French have been moving in the same direction, even if at different rates and perhaps in different time periods? Is there some kind of general dynamical law here, a sort of Jespersen's Cycle for determiners?

And in the case of English, we're in a position to ask where all those THEs are going. Among the possibilities that occur to me:

Substitution of other determiners, such as this, that, these, those? Problem: many of these words are also declining in frequency, and any increases are too small to account for much of the change in THE.

More use of 's possessives rather than of possessives: "The X's Y" rather than "The Y of the X". Problem: this is happening, but the construction is way too rare to account for much of what's going on with THE.

Substitution of pronouns for definite descriptions?

Substitution of other constructions for abstract nouns (e.g. "that's why" instead of "that's the reason")?

Substitution of indefinites for definites?

Increased general verbiage (not involving THE) for a given amount of informational content?

None of these seem empirically very promising to me — but as a start, it should be possible to characterize the relevant differences between formal writing at 6.5% THE and conversational transcripts at 2.5% THE.And there's probably research on this topic that I don't know about.

Update — Bob Ladd points out that in Italian, we can look just as "the main masculine forms il and i, which are never pronominal". And the result looks proportionately just like the full set, which increases my confidence that there's a real effect:

Some relevant past LLOG posts:

"SOTU evolution", 1/26/2014

"Decreasing definiteness", 1/8/2015

"Why definiteness is decreasing, part 1", 1/9/2015

"Why definiteness is decreasing, part 2", 1/10/2015

"Why definiteness is decreasing, part 3", 1/18/2015

"Positivity", 12/21/2015

"Normalizing", 12/31/2015

See also:

"Dutch DE", 1/4/2016

"The determiner of the turtle is heard in our land", 1/7/2016

"Correlated lexicometrical decay", 1/9/2016

"Style or artefact or both?", 1/12/2016

"Geolexicography", 1/27/2016

Permalink