Google's massive trove of scanned books could be useful for researchers studying the evolution of culture.

In a paper published Dec. 16 in Science, researchers turned part of that vast textual corpus into a 500-billion-word database in which the frequency of words can be measured over time and space.

Their initial subjects of analysis, including cultural trajectories of popular modern thinkers and the conjugation of irregular verbs, hint at what might be done.

"There are many more questions, that we could never think of, that this data makes possible," said Harvard University evolutionary dynamicist Jean-Michel Baptiste. "What we present in the paper is our first explorations of what becomes possible when you have this dataset."

The new research is part of an emerging approach to applying rigorous statistical analysis, traditionally known from the study of biological evolution, to cultural evolution.

Unlike biological evolution, however, which can be studied through the fossil record and in genomic comparisons, cultural evolution has proved difficult to study.

Researchers have used archaeological documentation of Polynesian canoe shapes and records painstakingly assembled by comparative linguists, but rich and rigorously compiled datasets are rare.

One potential source is Google, which has scanned some 15 million books, or roughly 12 percent of every book ever published. Michel-Baptiste and his colleagues turned one-third of these, selected for legibility and fully documented origins, into a massive word database.

Patterns that can be queried from its cloud are not necessarily answers unto themselves, they say, but a way of illuminating subjects for further investigation.

"It's not just an answer machine. It's a question machine," said study co-author Erez Lieberman-Aiden, a computational biologist at Harvard University. "Think of this as a hypothesis-generating machine."

In the new study, the researchers restricted their queries to single words and names, as more sophisticated querying raised the potential of copyright violation. (Google and book publishers are currently negotiating terms of access to copyright material, putting scientific accessibility and legal restrictions at odds.)

Even with these limitations, they were able to show how verbs with irregular endings – dwelt instead of dwelled, burnt instead of burned – have been regularized in different fashion in the United States and the United Kingdom.

They also traced the prominence of 20th-century thinkers – at least numerically, Freud overtook Darwin shortly after World War II – and quantified the public effects of censorship on intellectuals in China and Nazi Germany.

Another analysis found that modern fame both accrues and fades faster now than a century ago, giving quantitative form to an intuitively held sentiment. That example is particularly instructive, as the database identified a trend, but the implied social dynamics need to be studied through non-quantitative approaches.

Cultural evolution researchers greeted the database with qualified enthusiasm.

"There's a shortage of datasets. This might add another important database. But how valuable it's going to be is going to require a lot of thought about various biases in how the data is gathered," said Stanford University biologist Paul Ehrlich, whose investigations of Polynesian canoe design were among the first of the new cultural-evolution studies.

Ehrlich cited the frequency of obscenity or the treatment of women as two off-the-cuff examples of topics for which a database of published books may not be a simple indicator of cultural trends.

"How the books reflect society is a major issue that depends a lot on what particular research you're interested in," he said.

Mark Pagel, a University of Reading evolutionary biologist who has studied the evolution of language, called the database "thrilling."

But like Ehrlich, he said the usefulness of the database would only become evident with time, and will require more-sophisticated use.

To describe the database's potential for studying cultural evolution, the study authors coined the term "culturomics," a term that resonates with the modern field of genomics.

"There was great promise to genomics, and enormous hype surrounding the completion of the Human Genome Project. It was a few years before people realized that having a list of genes wasn't very useful at all. We now appreciate that it's not genes that matter, but how genes are expressed in bodies," said Pagel.

"I'm not saying the data isn't useful. It's just that the database is not going to cough up simple answers," he said.

The database is freely available for online queries and complete download.

Images: 1) Textual frequencies of influential western thinkers during the 20th century./Science. 2) Contrasting evolution of "burned" and "burnt" in the United States and United Kingdom./Science. 3) Culinary trends./Science.

See Also:

Citation: "Quantitative Analysis of Culture Using Millions of Digitized Books." By Jean-Baptiste Michel, Yuan Kui Shen, Aviva Presser Aiden, Adrian Veres, Matthew K. Gray, The Google Books Team, Joseph P. Pickett, Dale Hoiberg, Dan Clancy, Peter Norvig, Jon Orwant, Steven Pinker, Martin A. Nowak, Erez Lieberman Aiden. Science, Vol. 330 Issue 6011, Dec. 17, 2010.