But looking closer at the books originally included in the study, you might start to question the reliability of those results. To begin with, the analysis used not only Robinson Crusoe and A Christmas Carol, but books such as Notes on Nursing and A History of Art for Beginners. A compilation of Hans Christian Andersen tales was handled as if it was a single story, rather than a series of stand-alone narratives. The book that fit the Icarus arc best was a collection of 196 yoga sutras. Another odd marriage was the ‘Cinderella’ arc and its top fit: Boethius’ The Consolation of Philosophy.

Something is not quite right here, and indeed, this is one of the difficulties of doing automated analysis. It is a touchy business to take a large chunk of information, like all the books available on Project Gutenberg, and filter them so that the answers you get match the question you think you’re asking. Andrew Reagan, the graduate student who is the paper’s lead author, readily agrees—even getting to this hodgepodge of texts took a great deal of weeding on his part. Project Gutenberg, after all, is thick with dictionaries and poems and even the text of the Human Genome Project, all of which had to be removed.

Since June, when he first put the paper online, Reagan has received advice and tips on how to do a better job of filtering the data. For instance, he’s learned how to access the Library of Congress classifications for the books on Project Gutenberg. That’s made all the difference: ‘I was able to use that and select for just full works of English fiction,’ he said, so that his latest, revised version of the paper, put up this September, uses only those.

As it happens, the same categories still show up. And they still cover about 85 percent of the stories. But that goes to show that the patterns aren’t exclusive to works of fiction, as one might have assumed if the group had looked only at verified fiction at the beginning. It’s hard to know how to interpret these arcs without knowing exactly why they exist, or what they might represent from the readers’ perspective.

In the meantime, the Vermont group is working on getting detailed information about texts digitized by Google Books, which should yield more data on stories published during the previous century in the United States. The Google data should make it possible to take books from a certain period and compare them with books from the same place at a different time, or another place at the same time, to see if interesting conclusions can be drawn. And future results might also sketch out the archetypal emotional shapes of certain genres—detective fiction, for instance, or romance.

Stepping back, there is a bigger, overarching question here. Are there, in fact, surprises to be stumbled on in this way? Can using computational tools to digest far more literature than a single human could read in the same amount of time tell us things we’d never have noticed on our own? It’s hard to know. But when you think of the time it would take to read every novel on Project Gutenberg, and the skill and effort required to describe what patterns are there, you can see why some people, at least, think it is worth a try.

This post appears courtesy of Aeon Magazine.

We want to hear what you think about this article. Submit a letter to the editor or write to letters@theatlantic.com.