Over the course of history, humanity has suffered some horrifying damage to our collective cultural legacy in the form of books and other text lost to accident or neglect. The digitalization of text holds out the promise of permanently preserving the written word in an archive that can be distributed widely and kept safe from accidental damage. This presents archivists with a challenge: the works that are most in need of preservation are likely to already be damaged or distorted, making the use of automated scanning and text processing less likely to succeed. Researchers are now reporting on a successful way to identify the words that computers can't handle: turn them into CAPTCHAs, and get people to do the work.

For those who haven't heard the term, CAPTCHA stands for Completely Automated Public Turing test to tell Computers and Humans Apart. In practical terms, a CAPTCHA takes the form of a string of characters subjected to distortions that make it difficult for computerized character recognition to identify them. Humans, who have a visual recognition capacity that vastly outperforms even the best computers, generally do pretty well in identifying these distorted characters. That has made the CAPTCHA a useful tool (although the bad guys are catching up) for keeping spam bots from harvesting e-mail addresses or posting spam-filled messages to public forums.

Researchers at Carnegie Mellon noticed a while back that there are parallels between CAPTCHAs and the problem words in scanned works: in both cases, the letters were distorted to the point where computers weren't capable of recognizing the underlying word. So, they created a system, reCAPTCHA, in which words that weren't recognized by character recognition software were distorted slightly and converted into CAPTCHAs. We covered the announcement of the system over a year ago. Today's issue of Science contains a paper describing the result which, by all measures, appears to be a resounding success.

According to the authors, humans handle over 100 million CAPTCHAs every day. "This mental effort is precious," they write, "since deciphering CAPTCHAs requires people to perform a task that computers cannot." Their automated system attempts to harvest this precious effort. Scanned text is subjected to analysis by two optical character recognition programs; in cases where the programs disagree, the questionable word is converted into a CAPTCHA. It, along with a control word of known identity (used for cases where a bot is trying to crack the CAPTCHA) are then distributed to participating websites. Currently, over 40,000 sites are using reCAPTCHA.

The identification performed by each computer program is given a value of 0.5 points, and each interpretation by a human is given a full point. Once a given identification hits 2.5 votes, the word is considered called. Those words that are consistently given a single identity by human judges are recycled as control words.





Image is Creative Commons licensed

The researchers tested the system using a random sampling of 250 New York Times articles from different eras where the identity of every word was confirmed by two independent transcription experts. Each OCR software program managed about 84 percent accuracy but, when their results were combined with the reCAPTCHA system, the overall accuracy shot up to 99.1 percent. That's actually within the bounds of professional transcription services that use two independent experts to generate copies that are then examined by a third party. The few remaining problems typically came when the OCR software missed word breaks.

The authors also tested software designed to crack CAPTCHAs against images created using reCAPTCHA, and found that they failed completely. The authors ascribe this to the fact that the letters in scanned images contain distortions that are not the result of a clean mathematical transformation. User response times were also measured, but there were no significant differences between the time it took users to handle traditional systems and that required to use reCAPTCHA.

There are still a few limits to the system; short words aren't recognized as accurately, results from countries where English is a second language and the use of non-English keyboards is common tend to be spotty, and users are very casual about capitalization, punctuation, and spelling. Still, any improvements based on these identified limits will be going up from 99.1 percent accuracy; the gains are likely to be marginal.

The best news for those running sites that use the reCAPTCHA system is that their users seem to like it, since it makes the process something more than a mindless security measure. It's great to see what the authors term "wasted human processing power" put to use in a way that makes the processors feel good about contributing.

Science, 2008. DOI: 10.1126/science.1160379