« previous post | next post »

Mark has already extensively blogged the Google Books Settlement Conference at Berkeley yesterday, where he and I both spoke on the panel on "quality" — which is to say, how well is Google Books doing this and what if anything will hold their feet to the fire? This is almost certainly the Last Library, after all. There's no Moore's Law for capture, and nobody is ever going to scan most of these books again. So whoever is in charge of the collection a hundred years from now — Google? UNESCO? Wal-Mart? — these are the files that scholars are going to be using then. All of which lends a particular urgency to the concerns about whether Google is doing this right.

My presentation focussed on GB's metadata — a feature absolutely necessary to doing most serious scholarly work with the corpus. It's well and good to use the corpus just for finding information on a topic — entering some key words and barrelling in sideways. (That's what "googling" means, isn't it?) But for scholars looking for a particular edition of Leaves of Grass, say, it doesn't do a lot of good just to enter "I contain multitudes" in the search box and hope for the best. Ditto for someone who wants to look at early-19th century French editions of Le Contrat Social, or to linguists, historians or literary scholars trying to trace the development of words or constructions: Can we observe the way happiness replaced felicity in the seventeenth century, as Keith Thomas suggests? When did "the United States are" start to lose ground to "the United States is"? How did the use of propaganda rise and fall by decade over the course of the twentieth century? And so on for all the questions that have made Google Books such an exciting prospect for all of us wordinistas and wordastri. But to answer those questions you need good metadata. And Google's are a train wreck: a mish-mash wrapped in a muddle wrapped in a mess.

Start with dates. To take GB's word for it, 1899 was a literary annus mirabilis, which saw the publication of Raymond Chandler's Killer in the Rain, The Portable Dorothy Parker, André Malraux' La Condition Humaine, Stephen King's Christine, The Complete Shorter Fiction of Virginia Woolf, Raymond Williams' Culture and Society, Robert Shelton's biography of Bob Dylan, Fodor's Guide to Nova Scotia, and the Portuguese edition of the book version of Yellow Submarine, to name just a few. (You can find images of most of these on my slides, here — I'm not giving the url's since I expect Google will fix most of these particular errors now that they're aware of them).

And while there may be particular reasons why the 1899 date comes up so much, these misdatings are spread out all over the place. A book on Peter Drucker is dated 1905, a book of Virginia Woolf's letters is dated 1900, Tom Wolfe's The Bonfire of the Vanities is dated 1888, and an edition of Henry James 1897 What Maisie Knew is dated 1848.



It might seem easy to cherry-pick howlers from a corpus as exensive as this one, but these errors are endemic. Do a search on "internet" in books written before 1950 and Google Scholar turns up 527 hits.

Or try searching on the names of writers or famous restricting your search to works published before the years of their birth. You turn up 182 hits for Charles Dickens, more than 80 percent of them misdated books referring to the writer as opposed to someone else of the same name. The same search turns up 81 hits for Rudyard Kipling, 115 for Greta Garbo, and 29 for Barack Obama. (Or maybe that was another Barack Obama.)

A search on books mentioning candy bar that were published before 1920 turns up 66 hits, of which 46, or 70 percent, are misdated. I'd be surprised if that proportion of errors or anything like it held up in general for books in that range, and dating errors are far denser for older works than for the ones Google received from publishers. But even if the proportion is only 5 percent, that suggests hundreds of thousands of dating errors.

In discussion after my presentation, Dan Clancy, the Chief Engineer for the Google Books project, said that the erroneous dates were all supplied by the libraries. He was woolgathering, I think. It's true that there are a few collections in the corpus that are systematically misdated, like a large group of Portuguese-language works all dated 1899. But a very large proportion of the errors are clearly Google's doing. Of the first ten full-view misdated books turned up by a search for books published before 1812 that mention "Charles Dickens", all ten are correctly dated in the catalogues of the Harvard, Michigan, and Berkeley libraries they were drawn from. Most of the misdatings are pretty obviously the result of an effort to automate the extraction of pub dates from the OCR'd text. For example the 1604 date from a 1901 auction catalogue is drawn from a bookmark reproduced in the early pages, and the 1574 dating (as of this writing) on a 1901 book about English bookplates from the Harvard Library collections is clearly taken from the frontispiece, which displays an armorial bookplate dated 1574:



Similarly, the 1719 date on a 1919 edition of Robinson Crusoe in which Dickens' name appears in an advertisement is drawn from the line on the title page that says the book is reprinted from the author's edition of 1719. And the 1774 date assigned to an 1890 book called London of to-day is derived from a front-matter advertisement for a firm that boasts it was founded in that year.

Then there are the classification errors. William Dwight Whitney's 1891 Century Dictionary is classified as "Family & Relationships," along with Mencken's The American Language. A French edition of Hamlet and a Japanese edition of Madame Bovary both classified as "Antiques and Collectibles." An edition of Moby Dick is classed under "Computers": a biography of Mae West classified as "Religion"; The Cat Lover's Book of Fascinating Facts falls under "Technology & Engineering." A 1975 reprint of a classic topology text is "Didactic Poetry"; the medievalist journal Speculum is classified "Health & Fitness."





And a catalogue of copyright entries from the Library of Congress listed under "Drama" — though I had to wonder if maybe that was just Google's little joke.

Here again, the errors are endemic, not simply sporadic. Of the first ten hits for Tristram Shandy, four are classified as fiction, four as "Family & Relationships," one as "Biography & Autobiography," and one is not classified. Other editions of the novel are classified as "Literary Collections," "History," and "Music." The first ten hits for Leaves of Grass are variously classified as "Poetry," "Juvenile Nonfiction," "Fiction," "Literary Criticism," "Biography & Autobiography," and mystifyingly, "Counterfeits and Counterfeiting."

Various editions of Jane Eyre are classified as "History," "Governesses," "Love Stories," "Architecture," and "Antiques & Collectibles" ("Reader, I marketed him").

In his response on the panel, Dan Clancy said that here, too, the libraries were to blame, along with the publishers. But the libraries can't be responsible for books mislabeled as "Health and Fitness" and "Antiques and Collectibles," for the simple reason that those categories are drawn from the BISAC codes that the book industry uses to tell booksellers where to put books on the shelves, not from any of the classification systems used by libraries. And inasmuch as BISAC classifications weren't in use until about 20 years ago, only Google could be responsible for their misapplications on books published earlier than that: the 1904 edition of Leaves of Grass assigned to "Juvenile Nonfiction"; the 1919 edition of Robinson Crusoe assigned to "Crafts & Hobbies"; the 1845 number of the Edinburgh Review assigned to "Architecture"; the 1907 edition of Sir Thomas Browne's 1658 Hydriotaphia: Urne-Buriall, or a discourse of the sepulchrall urnes lately found in Norfolk assigned to "Gardening"; and countless others.

Google's fine Bayesian hand reveals itself even in the classifications of works published after the BISAC categories came into use, such as the 2003 edition of Susan Bordo's Unbearable Weight: Feminism, Western Culture and the Body (misdated 1899), which is assigned to "Health & Fitness" — not a classification you could imagine coming from the University of California Press, though you can see how a probabilistic classifier could come up with it, like the "Religion" tag on the Mae West biography subtitled "Icon in Black and White."

But whether it gets the BISAC categories right or wrong, the question is why Google decided to use those headings in the first place. (Clancy denies that they were asked to do so by the publishers, though this might have to do with their own ambitions to compete with Amazon.) The BISAC scheme is well suited to organizing the shelves of a modern 35,000 foot chain bookstore or a small public library where ordinary consumers or patrons are browsing for books on the shelves. But it's not particularly helpful if you're flying blind in a library with several million titles, including scholarly works, foreign works, and vast quantities of books from earlier periods. For example, the BISAC "Juvenile Nonfiction" subject heading has almost 300 subheadings, including separate categories for books about "New Baby," "Skateboarding," and "Deer, Moose, and Caribou." By contrast, the "Poetry" subject heading has just 20 subdivisions in all. That means that Bambi and Bullwinkle get a full shelf to themselves, while Schiller, Leopardi, and Verlaine have to scrunch together in the lone subheading reserved for "Poetry/Continental European." In short, Google has taken the great research collections of the English-speaking world and returned them in the form of a suburban mall bookstore.

These don't exhaust the metadata errors by any means. There are a number of mismatches of titles and texts. Click on the link from the 1818 Théorie de l'Univers, a work on cosmology by the Napoleonic mathematician and general Jacques Alexander François Allix, and it takes you to Barbara Taylor Bradford's 1983 novel Voices of the Heart, whereas the link on a misdated number of Dickens' Household Words takes you to a 1742 Histoire de l'Académie Royale des Sciences. The link from the title Supervision and Clinical Psychology takes you to a book called American Politics in Hollywood Film. Numerous entries mix up the names of authors, editors, and writers of introductions, so that the "about this book" page for an edition of one French novel shows the striking attribution, "Madame Bovary By Henry James":

More mysterious is the entry for a book called The Mosaic Navigator: The essential guide to the Internet Interface, which is dated 1939 and attributed to Sigmund Freud and Katherine Jones. My guess is that this is connected to Jones' having been the translator of Freud's Moses and Monotheism, which must have somehow triggered the other sense of the word mosaic, though the details of the process leave me baffled.

For the present, then, linguists, humanists and social scientists will have to forego their visions of using Google Books to assemble all the early nineteenth-century book sale catalogues mentioning Alexander Pope or tracking the use of "Gentle Reader" in Victorian novels: the metadata and classifications are simply too poor.

Google is certainly aware of many of these problem (if not on this scale) and they've pledged to fix them, though they've acknowledged that this isn't a priority. I don't doubt their sincere desire to get this stuff right. But it isn't clear whether they plan to go about this in the same way they're addressing the many scanning errors that users report, correcting them one-by-one as they're notified of them. That isn't adequate here: there are simply too many errors. And while Google's machine classification will certainly improve, extracting metadata mechanically simply isn't sufficiently reliable for scholarly purposes. After some early back-and-forth, Google decided it did want to acquire the library records for scanned books along with the scans themselves, and now it evidently has them, but I understand the company hasn't licensed them for display or use — hence, presumably, the odd automated stabs at recovering dates from the OCR that are already present in the library records associated with the file.

In our panel discussion, Dan Clancy suggested that it should fall on users to take some of the responsibility for fixing these errors, presumably via some kind of independent cataloguing effort. But there are hundreds of thousands of errors to pick up on here, not to mention an even larger number of of files with simply poor metadata or virtually no metadata at all. Beyond clearing up the obvious errors, the larger question is whether Google's engineers should be trusted to make all the decisions about metadata design and implementation for what will probably wind up being the universal library for a long time to come, with no contractural obligation, and only limited commercial incentives, to get it right. That's probably one of the questions the Antitrust Division of the Justice Department should be asking as it ponders the Google Books Settlement over the coming month.

Some of the slack here may be picked up by the HathiTrust, a consortium of a number of participating libraries that is planning to make available several million of the books that Google scanned along with their WorldCat records. But at present HathiTrust is only going to offer the out-of-copyright books, which are about 25 percent of the Google collection, since libraries have no right to share the orphan works. And it isn't clear what search functionalities they'll be offering, or to whom — or, in the current university climate, for how long. In any event, none of this should let Google off the hook. Google Books is unquestionably a public good, but as Pam Samuelson pointed out in her remarks at another panel, a great public good also implies a great public trust.

Permalink