Carnegie Mellon researchers have launched a new service that will not only protect e-mail addresses on the web from spambots, but also help digitize a backlog of old books, magazines, and newspapers so that they can eventually be computer searchable. The service, called reCAPTCHA, hopes to use the eyeballs of millions of Internet users to identify thousands of words for the Internet Archive.

The service repurposes technology from Completely Automated Public Turing Test to Tell Computers and Humans Apart (CAPTCHA)—a human-filtering method originally developed by Carnegie Mellon for Yahoo to prevent computers from registering bogus e-mail accounts. It's something we're all familiar with: any given site (such as Yahoo, Hotmail, PayPal, Joost, etc.) may present you with a box containing an image of distorted letters or words. You then have to type the letters accurately into a text box in order to proceed.

reCAPTCHA works the same way, except that it presents two words—one that the computer knows (a CAPTCHA), and one that it doesn't (text that has stumped an optical character recognition—OCR —scanner for whatever reason). By using two CAPTCHAs together, the system is able to identify first that the user is human, and gain greater confidence that what the user entered for the OCR word is correct. If enough users solve the same OCR-generated word the same way, the system can then deduce that those people know what they're talking about and digitize that word.

The Carnegie Mellon team, headed up by computer science professor Luis von Ahn, hopes to replace as many traditional CAPTCHAs with reCAPTCHAs as possible in order to harness the "work" already being done by the public on the Internet every day. "It is estimated that 60 million or more CAPTCHAs are solved each day, with each test taking about 10 seconds," director of the Internet Archive, Brewster Kahle, said in a statement. "That's more than 150,000 precious hours of human work that are lost each day, but that we can put to good use with reCAPTCHAs.

But the reCAPTCHA team wants to make sure that the public can get some use out of the digitization efforts too. That's why they're offering the reCAPTCHA Mailhide, a service that hides a user's full e-mail address on any web page until a CAPTCHA has been solved. "Many sites display e-mails like bmaurer [at] foo [dot] com or use hacks with tables, javascript or encodings to get the same effect" wrote Ben Maurer, an undergraduate student on the project, on his personal blog. "Spammers are getting smarter and figuring out these tricks."

As a solution to stay ahead of spammers, a user can employ reCAPTCHA Mailhide to present an e-mail address like this: jac...@arstechnica.com. Someone looking for my e-mail address would then have to click on the "..." and enter a CAPTCHA to prove that he or she is human before getting the full address. And of course, the user trying to access my address is actually entering two CAPTCHAs and helping digitize words for the Internet Archive in the process.

"This is an example of why having open collections in the public domain is important," Kahle said. "People are working together to build a good, open system." The reCAPTCHA project is being run by Carnegie Mellon with the help of server hardware donated by Intel and Suse Linux Enterprise Server support subscriptions donated by Novell.