Using Google for informal linguistics experiments

People with an interest in linguistics, academic or casual, often use Google hit counts to measure usage. Google is, after all, a huge, natural, and easily searchable (though un-annotated, unstructured, variously biased) corpus. There are various problems with this, though, including:

Google changes your queries “intelligently”, making spelling corrections, trying out various inflections, adapting for country origin of IP &c. Google hit counts are rough estimates, and in particular the counts on the first pages are wild guesstimates. Google might be omitting similar pages (which may be or not be what you want); it will only tell you on the last page though.

The solution to problem 1 used to be the plus operator ( +word ), but Google has phased out this feature in favor of the double quotes. Even double quotes, however, might not be enough for an exact search. Luckly Google listened to complaints and there’s now a verbatim search tool.

But notice verbatim still doesn’t solve problems 2 and 3; its first-page counts are often widely incorrect. Always click the link to the last results page when counting hits.

As an example, I just searched for pages containing the words «triceratops» and «semiology», not sequentially. The results I got were:

Search tool Query Results page Hit count Normal search triceratops semiology 1 59,400 Normal search triceratops semiology 10 58,600 Normal search triceratops semiology 33/97 330 (similar results omitted) / 59,200 (included) Normal search "triceratops" "semiology" 1 2,280 Normal search "triceratops" "semiology" 10 2,280 Normal search "triceratops" "semiology" 25/60 248 (similar results omitted) / 2,250 (included) Verbatim search triceratops semiology 1 20,200 Verbatim search triceratops semiology 7/12 61 (similar results omitted) / 113 (included) Verbatim search "triceratops" "semiology" 1 18,400 Verbatim search "triceratops" "semiology" 7/12 61 (similar results omitted) / 113 (included)

These results were taken on 2011-11-16. Some fluctuations seem to be taking place (the numbers varied as I tested). 61/113 seems to be the count closest to reality. Notes: