In the previous blog post, I described a deep convolutional neutral network (CNN) method to identify the location of the handwritten passage (page segmentation). In the current blog post, the passage of handwritten text is segmented line by line (line segmentation) so that each line can be used for handwriting recognition.

Methods

The input to the model is an image that only contains handwritten text. The outputs are bounding boxes that correspond to each line of the text (see Figure 1). The problem is similar to the object detection problem in computer vision. In this method, we utilized the single shot multibox detector (SSD) architecture to detect the positions of each line of the passage.

The SSD architecture essentially takes image features and repeatedly downsamples the features (to account for different scaling factors). At each downsample step, the features are fed into two CNNs: one to estimate the locations of bounding boxes relative to anchor points, and one to estimate the probability of the bounding box encompassing an object. See blog post [1], tutorial [2], or the original paper [3] for more details of the SSD.

SSD was implemented using MXNet based on tutorial [2] and was optimized to this application by altering:

Network architecture

Anchor points

Data augmentation

Non-maximum suppression

Network architecture

Figure 2: Network architecture.

The main differences between the current network architecture and the one described in the tutorial [2] is the use of Resnet 34 to extract image features. The first convolutional layer of a pre-trained Resnet 34 (RGB) was replaced with a 1-channel convolutional layer (grayscale) by averaging the weights of the RGB channels.

Anchor points

The bounding boxes encompassing lines of handwritten text are mostly restricted to horizontal rectangles. On the other hand, the bounding boxes required for general object detection dramatically vary in size (Figure 3-a). Therefore, rectangles with aspect ratios >1 were chosen for the current application and is shown in Figure 3-b. We also utilized two more anchor points compared to the blog post.

Figure 3: a) Anchor boxes used in the tutorial [2]. b) Anchor boxes used in the current application.

Data augmentation

The paper [3] emphasized the importance of data augmentation when training the SSD model (the tutorial [2] did not include data augmentation). The authors used random translations, cropping, and flipping however for the current application, random cropping and flipping is not appropriate as cropping will compromise the continuity of the text and flipping will reverse the writing direction. In this work, we similarly used random translations and we also introduced a method that randomly removed lines (that may be similar to random cropping). Specifically, each line was removed with a probability of p (p=0.15 in this work). The image was filled with the color of the background where the bounding box is located and then the bounding box was removed (see Figure 4).

Figure 4: Example of data augmentation by randomly replacing lines with the background color of the document. Dotted lines are the predicted bounding boxes and solid lines are the labelled bounding boxes.

Non-maximum suppression

The network predicts numerous overlapping and redundant bounding boxes (as shown in Figure 5-a). To obtain more meaningful results, the box_nms (box non-maximum suppression) [4] function was applied on the output of the network. Three parameters: overlap threshold (overlap_thres), top k boxes (topk), and minimum threshold (valid_thres) were varied and tuned. The parameters overlap_thres=0.1, topk=150, and valid_thres=0.01 were selected and the results are shown in Figure 5-b (note that the results shown in Figure 4 were passed through the non-maximum suppression algorithm).

Figure 5: Demonstration of the non-maximum suppression algorithm. a) with no non-maximum suppression algorithm, b) with non-maximum suppression algorithm (overlap_thres=0.1, topk=150, and valid_thres=0.01).

Results

The final results are shown in Figure 6. As shown in Figure 6-a, we can see that the predicted bounding boxes have large overlaps with the labelled bounding boxes (train IOU = 0.593, test IOU = 0.573). We also observed that the network can learn the bounding boxes for handwriting with distinct lines substantially easier than large handwriting which overlaps. Several interesting examples are also presented in Figure 6-b and c.