This is easy to understand, right?

How about this? A bit harder?

Are you able to decipher this one at all?



courtesy of Faris Algosaibi

The first example can be easily recognized by most character recognition algorithms. However, as your text gets progressively more complex, this seemingly simple task becomes more and more difficult for even the best machine learning algorithms to successfully complete.

Most text found in the wild isn’t as clear as scanned, printed documents like the first example. A picture of something as mundane as a stop sign can be incredibly difficult for most OCR algorithms to understand. Here is an example from Google’s vision API:

Algorithms which have been trained for parsing documents and structured text do not generalize well to natural scene images such as the stop sign above. Most commercially available tools are designed only for document parsing, which means it can be difficult (if not impossible) to find a tool that can work on natural text.

But all is not lost. At Algorithmia, we’ve implemented three hot new algorithms which can assist you in obtaining high quality OCR results on a variety of image styles.

TextDetectionCTPN

One of the tricks we’ve discovered that really improved our OCR performance was that by focusing on just the text in an image, we were able to obtain much higher recognition accuracy – even with models trained on documents!

To accomplish this, we used a pretrained model designed to detect and locate text in both natural scene and document scan images, and hosted it on our platform as the algorithm TextDetectionCTPN. This deep learning model is great at locating text in complex and multi-faceted scenes. It even works when the text is obscured or partially hidden!

The source code used to make CTPN is available, along with the original paper.

Tesseract

Tesseract is a great general purpose OCR tool that, while trained to recognize text in documents, is also capable of working on a large variety of problems. Like many other models, it requires that images be pre-cropped to contain only text – which means that it works extremely well when combined with a text-isolation algorithm such as TextDetectionCTPN!

This algorithm works not only on english text, but in over 100 different languages! As the source project improves, this algorithm will evolve as well, becoming more and more accurate. Cool, eh?

If you’d like to check the tesseract project out for your self, take a look at the source code.

NaturalTextNet

Last but not least, NaturalTextNet has been trained to recognize text in natural images, like signs or billboards. It may not have the general accuracy that tesseract has, but when Tesseract fails because an image is too noisy, this algorithm is an excellent fallback.

For more information on the algorithm itself, take a look at the source code, or the original CRNN paper.

Next steps

Character recognition is a hard problem, and even harder to find publicly available solutions. Algorithmia is here to help. Keep your eyes peeled for our followup post, in which we’ll describe a way to combine all three of these algorithms to create a powerful composition we call SmartTextExtraction.