Machines reading the archive: handwritten text recognition software

Any researcher who has used online newspaper archives, repositories of digitised books or even resources like The National Archives’ Cabinet Papers Online will recognise the revolution created by optical character recognition (OCR) technology. It is this technology which enables us to search not just the title, or date, but actually the words written inside a book, newspaper or archival document. OCR has transformed the way many scholars conduct their research and opened up huge areas of scholarly endeavour which were previously unimaginable. For those of us who work on archival collections this revolution has always come with a caveat – OCR does not work on handwritten documents. It is for this reason that we are so excited by the new platform called Transkribus, developed by the EU funded READ Project. This offers, for the first time, the potential to use computers to ‘read’ handwritten documents.

The technology behind Transkribus is still very new and The National Archives has been running a pilot project to test out the feasibility of using this type of handwritten text recognition (HTR) software. For this project we decided to focus on our collection of PROB 11 wills. The reasons behind this were largely driven by the technology; these volumes contain clerks’ copies of wills, so the handwriting style is very uniform, and they are legal documents and therefore have structured language patterns. They are also an extraordinary collection of documents containing details of people, places, material goods, social and economic networks, and other factors across time and space. However, as anyone who has used these documents will tell you, they are not the easiest things to read – for this reason they appeared to offer an excellent test for this new technology.

The Transkribus software works by training a model on accurate transcriptions of documents. Researchers upload images of some of their documents and then match up a correct transcription with the text in the images. This allows the model to learn the style of hand and language patterns. This training data is referred to as ‘ground truth’. The trained model can then be used to automatically transcribe similar types of documents in terms of language, handwriting, and so on. As you would expect, the more training data you feed in, the better the results you can achieve from your model.

The first stage of the HTR process is to upload images of your documents onto the platform, and then carry out a task called segmentation. This entails defining the ‘text regions’ and lines of text. Basically this tells the software where to look for text. This process is largely automated, but it is sometime necessary to check and amend the results. Once this is complete you can either upload your training data, or once you have a model, run the HTR software to produce an automatic transcription.

We began experimenting with the software a little while ago, and had some good results from a model trained on a relatively small set of training data (roughly 15,000 words). The accuracy of OCR and HTR transcriptions is measured in terms of Word Error Rate (WER) and Character Error Rate (CER). Our first model achieved a WER of 39% and a CER of 21%. Encouraged by these figures, we produced some more training data and developed a new model based upon roughly 37,000 words. Happily this showed a major step forward in accuracy, with a WER of 28% and a CER of 14%.

This represented a good result – but clearly, with over a quarter of all words being incorrect, there was still some way to go. The trouble was that transcribing large quantities of these wills is a difficult and time consuming operation. Thus we turned to our community of online volunteers to help develop a larger set of training data. Thanks to the amazing work of a number of dedicated individuals we have rapidly accumulated an additional 60,000 words of transcriptions which is currently being used to train a new and improved model.

We have high hopes of what this new model will be able to achieve, but I think it is fair to say that it is going to be some time before we can rely solely on computers to read all of these tricky handwritten documents for us. In the meantime, this type of technology offers other potential opportunities, most notably in terms of key word searching, which may have a more profound impact on archival collections in the short term. Put simply, you can use this type of technology to search handwritten documents even when the level of accuracy is not really good enough to produce a transcription. This is because a transcription can only show one possibility for a word on a page, whereas the software itself throws up multiple possibilities for each word. Using clever tools you can search these multiple options with a far greater likelihood of finding the correct word.

This type of technology has the potential to revolutionise the way researchers engage with archival collections and we are really excited to be experimenting with this. This work is, however, only possible because of the commitment and dedication of our volunteers who have done much of the leg work in terms of transcriptions. This serves to highlight once again the interconnections between exciting new digital technologies, and the more traditional archival practices.

We are continuing this work using HTR and will report back soon on the progress with our new model.