This morning, the Official Google Blog announced that the search giant has acquired reCAPTCHA. The company provides a service that combines two things that Google would be very interested in: it verifies that information provided to a server has been entered by a human and, in the process, helps identify difficult-to-decipher text from book digitization projects. As such, it's a natural fit for Google.

The basic premise of the reCAPTCHA service is based on two related computer science problems. Book digitization efforts rely on the ability of optical character recognition (OCR) software to help extract the text from a scanned image of a page. For a variety of reasons—damage to a book, improperly placed pages, unusual fonts, etc.—this process fails at a certain rate, leaving an incomplete digitization.

The problem of identifying distorted or partially obscured characters faced by the OCR software is precisely the same problem faced by botnets that attempt to propagate spam by opening e-mail accounts with online services or posting ads in comments on blogs. To do so, they have to overcome CAPTCHAs (Completely Automated Public Turing test to tell Computers and Humans Apart, for the curious) by identifying characters that have been distorted and placed over a complex background. In short, would-be spammers also need to overcome an OCR problem.

reCAPTCHA is an attempt to solve both problems at once. The service takes images of text that were not successfully processed by OCR programs, and repurposes them for use as CAPTCHAs, since they're already known to fool the sort of software that's used by botnets. If actual humans can decipher them, the results are fed back to the book digitization project, filling in the blanks in older texts.

The service started out as an academic project, and its lead, Luis von Ahn, has published papers that describe its success in defeating botnets and aiding in digitization. He also launched an extension of the service into the audio realm that uses recordings of historic radio broadcasts.

von Ahn told Ars that, although reCAPTCHA started as an academic project, it was later spun out as a company. Now, the whole reCAPTCHA team will be moving into Google, where von Ahn will become a Google employee while remaining on the faculty at Carnegie Mellon. "I will continue advising graduate students, but on projects not related to Google," he told Ars.

When asked whether reCAPTCHA was being used for Google's book digitization efforts, von Ahn responded "Not until now." Still, it's a natural fit, given that Google will happily consume both services provided by reCAPTCHA: the identity verification and the improved digitization. Given that Google may soon be a major purveyor of older books, there's no reason to think that it won't continue to offer the service to third parties, as well.

Listing image by Image from the reCAPTCHA blog.