What a text recognition system actually sees

Some insights into the neural network “black box” of a text recognition system

Let’s have a look at what happens inside the neural network “black box” of a text recognition system.

The performance of modern text recognition systems implemented as neural networks is amazing. They can be trained on medieval documents and are able to read them and only make very few mistakes. Such tasks would be very difficult for most of us: look at Fig. 1 and give it a try!

Fig. 1: Hard to read for most people, but easy for a text recognition system trained on this dataset.

How do these systems actually work? At which parts in the image do these systems look at to identify the text? Do they exploit some clever patterns? Or do they cheat by using short-cuts like dataset-specific patterns? In the following text we’ll look at two experiments to get a better understanding of what’s happening inside such a neural network.

First experiment: pixel relevance

For our first experiment, we ask the following question: given an input image and the correct class (ground-truth text), which pixels in the input image vote for and which vote against the correct text?

We can compute the influence of a single pixel on the result by comparing the score of the correct class in two scenarios:

The pixel is included in the image. The pixel is excluded from the image (by marginalizing over all possible gray-values of the pixel).

By comparing these two scores, we can see if a pixel votes for or against the correct class. Fig. 2 shows the relevance of the pixels in an image with the ground-truth text “are”. Red pixels vote for the text “are”, blue pixels vote against it.

Fig. 2: Top: input image. Bottom: pixel relevance and blended input image. Red pixels vote for, blue pixels against the correct text “are”.

We can now look at some critical regions (dark red, dark blue) to get an idea which image features are important for the neural network to get to its decision:

The red region above the “a” is white in the input image and is very important for the correct result “are”. As you might guess, if a black dot would appear above the vertical line of the “a”, the vertical line could be interpreted as an “i”. The “r” is connected to the “e”, which confuses the neural network as indicated by the blue region. If these two characters would be disconnected, this should increase the score for “are”. The gray pixels inside the “a” (left lower inner part) slightly vote against “are”. If the hole inside the “a” would be completely white, this should increase the score. On the right upper part of the image is an important region for the correct vote. It is unclear how this region can be interpreted.

Let’s investigate if our assumptions 1. — 3. are correct, and what the meaning of 4. is, by changing some pixel values inside these regions. In Fig. 3 the original and changed images, the score for the correct text, and the recognized text are shown. The first row shows the original image with a score of 0.87 for the text “are”.

If we draw a dot over the vertical line of “a”, the score for “are” decreases by a factor of 10 and we get the text “aive” instead. So, the neural network makes heavy use of the superscript dot to decide if a vertical line is an “i” or something else. Removing the connection between “r” and “e” increases the score to 0.96. Even tough the neural network is able to implicitly segment characters, it seems that disconnected characters simplify the task. The hole inside the “a” is important to detect the “a”, therefore exchanging the gray pixels with white pixels slightly improves the score to 0.88. When drawing some gray pixels into the upper right region of the image, the system recognizes “ane” and the score for “are” decreases to 0.13. In this case the system obviously has learned features which don’t have anything to do with text.

Fig. 3: Change some pixels inside critical regions and observe what happens.

To conclude our first experiment: the system has learned some meaningful text features like the superscript dot to identify the character “i”. But it has also learned features which do not make any sense to us. However, these features still help the system to recognize text in the dataset it was trained on: these features let the system take (easy) short-cuts instead of learning real text features.

Second experiment: translation invariance

A translation invariant text recognition system is able to correctly recognize text independent of its position in the image. Fig. 4 shows three different horizontal translations of a text. We would like the neural network to be able to recognize “to” in all three positions.

Fig. 4: Three horizontal translations of a text.

Let’s again take our image from the first experiment containing the text “are”. We will shift it pixel by pixel to the right and look at the score of the correct class and also at the predicted text as shown in Fig. 5.

Fig. 5: Score for text “are” while shifting the text pixel by pixel to the right. The labels on the x-axis show both the number of pixels the image is shifted and the recognized text (using best path decoding).

As can be seen, the system is not translation invariant. The original image has a score of 0.87. By shifting the image one pixel to the right, the score decreases to 0.53. Shifting it one more pixel to the right decreases the score to 0.28. The neural network is able to recognize the correct text until a translation of four pixels. Afterwards the system occasionally outputs the wrong result, starting with “aare” five pixels to the right.

The neural network was trained on the IAM dataset in which all words are left-aligned. Therefore the system has never learned how to handle images with white-space on the left. As obvious as ignoring the white-space might be for us — it is an ability that has to be learned. And if the system never was forced to handle such situations — why should it learn it at all?

Another interesting property of the score function is the periodicity of four pixels. These four pixels equal the downsizing factor of the convolutional network from a width of 128 pixels to a sequence length of 32. Further investigations are needed to explain this behavior, but it might be caused by the pooling layers with their discontinuities: shifting a pixel one position to the right, it might stay within the same pooling cluster, or it might step over to the next one, depending on its position.

Conclusion

A text recognition system learns whatever is helpful to increase the accuracy in the dataset it is trained on. If some random looking pixels help to identify the correct class, then the system will use them. And if the system only has to handle left-aligned text, then it will not learn any other type of alignment. Sometimes, it learns features which also we humans find useful for reading and which generalize to a wide range of text styles, but sometimes it learns short-cuts which are only useful for one specific dataset.

We have to provide diverse data (e.g. mix multiple datasets or use data augmentation) to ensure that the system really learns text features and not just some cheats.

References