

In addition to the word frequency and collocates lists, you can also download large n-grams data files, which are based on data from the 450 million Corpus of Contemporary American English (COCA). The standard n-grams list provides the frequency of more than 150 million unique three word sequences (3-grams) in the corpus. With this data, you can carry out powerful queries offline -- without needing to access the corpus via the web interface. A few examples might be: all NOUN + NOUN sequences that occur more than 20 times in the corpus (more than 55,000 distinct strings)

nouns occurring before any one of 60-70 different adjectives

the 20,000 most common VERB + the + NOUN sequences

any one of 150-200 words followed by any one of 200-250 other words

very fast processing of more narrow searches (collocates of any word you choose), all on your own machine The n-grams are primarily for use in (computational) linguistics, for language modeling and processing. In comparison with other n-grams datasets, we are not aware of any publicly-accessible dataset from a corpus as large as the Corpus of Contemporary American English, other than the Google n-grams sets. And for most people, the COCA n-grams data is probably more usable than the Google data, since it is a size that can actually fit on and run on something besides a high-end workstation or a supercomputer. In addition, the COCA n-grams provide lemma and part of speech information, while the Google n-grams are just strings of words. Feel free to take a look at a sample of the n-grams data. It contains nearly 200,000 3-grams for 400 different words, where the n-gram appears at least ten times in the corpus. Of course, this is just a tiny fraction of the full n-grams set that is available for purchase, which has all 3-grams (including those that occur just once) for all words . The full 150,000,000 n-grams dataset is $195 academic / $395 commercial.

