« previous post | next post »

When Google's Ngram Viewer was launched in December 2010 it encouraged everyone to be an amateur computational linguist, an amateur historical lexicographer, or a little of both. Today, the public interface that allows users to plumb the Google Books megacorpus has been relaunched, and the new version makes it even more enticing to researchers, both scholarly and nonscholarly. You can read all about it in my online piece for The Atlantic, as well as Jon Orwant's official introduction on the Google Research blog.

The big news for linguists and fellow travelers is the introduction of part-of-speech tagging. While Mark Davies at BYU had previously created his own POS-tagged version of Google Ngrams as part of his corpus collection, he only had access to the publicly available datasets of n-grams (up to 5-grams, with a threshold of 40 occurrences for inclusion) and thus wasn't able to parse the corpus in a systematic fashion. The Google team, on the other hand, was able to go back to the underlying data from the Google Books scanning project and do full-scale tagging and parsing, including identifying sentence boundaries. The specifics are laid out in the paper presented by Slav Petrov, Yuri Lin, et al. at the annual ACL meeting in July, "Syntactic Annotations for the Google Books Ngram Corpus."

As I note in the Atlantic piece, the smaller corpora that Mark Davies has compiled, such as COCA and COHA, still offer more flexibility in the search interface, such as the ability to search for lemmas or high-frequency collocations from particular time periods. Furthermore, the universal tagset of twelve parts of speech used by Google may disappoint corpus linguists who are more used to dealing with the intricacies of the CLAWS tagset (as used in the BYU corpora). But that coarser tagset (besides being more straightforward to a lay audience) also allows for cross-linguistic comparisons, encompassing the languages currently available via the Ngram Viewer: English, Spanish, French, German, Russian, Italian, Chinese, and Hebrew. I'll be interested to see how researchers take advantage of the POS tags and dependency relations for investigations in these different languages.

The other major advanced search feature to be introduced in the new version is what they're calling "Ngram Compositions," which allows the user to add, subtract, multiply, and divide n-gram counts. That's quite handy, and I give an example of its use in the Atlantic piece: you can construct such queries as (The United States is + The United States has)/The United States, (The United States are + The United States have)/The United States (graph here) to better answer the question of when The United States began to be construed as a grammatically singular entity. The ability to compare different subcorpora (e.g., British vs. American) is another welcome addition.

I'm also pleased to see that metadata improvements have been made, as faulty metadata (particularly faulty dating of Google Books volumes) has been a long-standing concern. And the growing size of the Ngrams corpus continues to boggle the mind: for English alone, there are now nearly half a trillion words (468,491,999,592 tokens, to be precise). The previous corpus data remains available for searching (the older corpora have the "2009" identifier), so any research based on the original version will still be replicable. Let's see what the culturomicists come up with this time.

(Thanks to Jon Orwant, lead engineer on the project, for letting me play with the new Ngram Viewer before its public release.)

Permalink