« previous post | next post »

Will Brockman of Google explains that

There was a problem with apostrophes in the Ngram viewer front end – my fault, and I corrected it yesterday (1/1/2011).

This is admirably quick work, especially on New Year's day (!) — I wrote about problems with apostrophes on Dec. 28 and 29.

Now we can see the growth of nor'easter relative to northeaster, at least in whatever selection of published books the "American English" corpus includes:

Uncertainty about exactly what's in the corpus is more problematic in considering the historical development of contractions. Consider the graphs for overall won't vs. will not in several sub-corpora — overall English:

American English:

British English:

English Fiction:

"English One Million":

The differences kind of make sense, but it's not clear what kind of sense — other differences would have made sense as well, and it's hard to know what these particular differences mean. If we had access to a reasonable sample of the underlying corpus and metadata — even a fraction of the part up to 1922 — we would have a better idea of what any one of these graphs really means about the development of the language. And we could look at things like the differences between narration and dialogue in fiction, or the development of different forms in the spoken parts of plays.

For a point of comparison, here's the overall plot from the COHA corpus, whose shape is (in some ways) different from all of the above:

[One caveat: As I understand it, the "percentages" across the Google n-gram counts are not comparable for different orders of n-grams, because not all 2-grams are retained in the collection, an even smaller percentage of 3-grams are retained, etc.. If this is correct, then since "won't" is a 3-gram while "will not" is a 2-gram, we can't conclude anything from the Google graphs about their frequencies relative to one another. However, the 2-gram [won '] and the 3-gram [won ' t] seem to yield identical results, so perhaps this is wrong.

Update — as Will Brockman explains in a comment below, I was indeed wrong about this. The percentages are calculated with the denominator of the proportion being the total number of n-grams in the book collection being used, not the total in the subset that is published (which for n>=2 is limited to n-grams that occur at least 40 times).]

Permalink