OCR has been used to scan books and other printed documents for years, but it’s not well suited for the material in the Secret Archives. Traditional OCR breaks words down into a series of letter-images by looking for the spaces between letters. It then compares each letter-image to the bank of letters in its memory. After deciding which letter best matches the image, the software translates the letter into computer code (ASCII) and thereby makes the text searchable.

This process, however, really only works on typeset text. It’s lousy for anything written by hand—like the vast majority of old Vatican documents. Here’s an example from the early 1200s, written in what’s called Caroline minuscule script, which looks like a mix of calligraphy and cursive:

In Codice Ratio

The main problem in this example is the lack of space between letters (so-called dirty segmentation). OCR can’t tell where one letter stops and another starts, and therefore doesn’t know how many letters there are. The result is a computational deadlock, sometimes referred to as Sayre’s paradox: OCR software needs to segment a word into individual letters before it can recognize them, but in handwritten texts with connected letters, the software needs to recognize the letters in order to segment them. It’s a catch-22.

Some computer scientists have tried to get around this problem by developing OCR to recognize whole words instead of letters. This works fine technologically—computers don’t “care” whether they’re parsing words or letters. But getting these systems up and running is a bear, because they require gargantuan memory banks. Rather than a few dozen alphabet letters, these systems have to recognize images of thousands upon thousands of common words. Which means you need a whole platoon of scholars with expertise in medieval Latin to go through old documents and capture images of each word. In fact, you need several images of each, to account for quirks in handwriting or bad lighting and other variables. It’s a daunting task.

In Codice Ratio sidesteps these problems through a new approach to handwritten OCR. The four main scientists behind the project—Paolo Merialdo, Donatella Firmani, and Elena Nieddu at Roma Tre University, and Marco Maiorino at the VSA—skirt Sayre’s paradox with an innovation called jigsaw segmentation. This process, as the team recently outlined in a paper, breaks words down not into letters but something closer to individual pen strokes. The OCR does this by dividing each word into a series of vertical and horizontal bands and looking for local minimums—the thinner portions, where there’s less ink (or really, fewer pixels). The software then carves the letters at these joints. The end result is a series of jigsaw pieces:

In Codice Ratio

By themselves, the jigsaw pieces aren’t tremendously useful. But the software can chunk them together in various ways to make possible letters. It just needs to know which groups of chunks represent real letters and which are bogus.