Handwriting detection

The handwriting detection takes the input image containing a line of text and returns a matrix containing probability of each character appearing. Specifically, the matrix is of size (sequence_length × vocab. of characters).

Previous works utilize multidimensional LSTMs to recognize handwriting [3] however more recent works suggest that similar performances can be achieved with a 1D-LSTM [4]. Briefly, [4] utilized a CNN for image feature extraction and fed the features into a bidirectional LSTM and trained the network to optimize the Connectionist Temporal Classification (CTC) loss (shown in Figure 3). Intuitively, the CNN generates image features that are spatially aligned to the input image. The image features are then sliced along the direction of the text and sequentially fed into an LSTM. This network is denoted as CNN-biLSTM. The main reason that the CNN-biLSTM was selected is because multidimensional LSTMs substantially more computationally expensive compared to the CNN-biLSTM.

Figure 3: General framework of the CNN-biLSTM

I extended the described implementation in [4] by providing multiple downsamples of the image features (Figure 4). Multiple downsamples were provided to assist in recognizing images of handwritten text that vary in size (e.g., lines that contain only 1 word vs lines that contain 7 words). Note that a pre-trained res-net was used as a image feature extractor.

Figure 4: Our implementation of the downsampled CNN-biLSTM

The output of the CNN-biLSTM is fed into a decoder to predict probability distribution over the characters for each vertical slice of the image. In the next section, methods to extract the most probably sentence from the matrix using a language model are discussed.

Language models

We explored three methods to extract a string of words given a matrix of probabilities are explored: 1) greedy search, 2) lexicon searching, and 3) beam searching + lexicon searching with a simple language model.

Greedy search

The greedy search method simply iterates over each time-step and obtains the most probable character at each time step:

This method doesn’t use any external language models or dictionary to alter the results.

Lexicon searching

The lexicon search model use the output of the greedy method and attempts to match each word with a dictionary. Specifically, the following procedures were taken: 1) Decontraction of the string, 2) Tokenization of the string, 3) for each word: 3.1) Check if the proposed word is an English word, if not, suggest possible words, and 3.2) Choose the word with the shortest weighted edit distance, 4) Contract texts back to its original form. Each component are described and the pseudocode is provided below.

Decontraction of the string

Contractions are a shorted version of a word or a spoken form of a word. For example, did not → didn’t, might have → might’ve, for a more comprehensive list see here). In order to account for contractions, pycontractions was used. This library provides contractions and decontractions of the string. For decontracting ambiguous cases (e.g., ain’t → am not/are not), the library utilizes a language model (google news) to assess the correct case.

2. Tokenization of the string

Tokenization separates a string containing a sentence into distinct words. The tokenize package from the nltk toolkit was used to perform this. This was followed by iterating through each word.

3.1 Suggest words if it’s not a word

For each word that was tokenized, the pyenchant module is used to check if the word is an actual English word. If the word is an actual word, then no changes will be made, otherwise, the norvig spell checker was used to provide suggestions for possible words.

3.2 Select the suggested word with the shortest weighted edit distance

Given a list of suggested words, the weighted edited distance between the given word and the suggested words were calculated. Specifically, the edit distance was weighted on the differences between the handwritten characters.

Figure 5: Examples of characters that are similar in terms of handwriting

Characters that are shown in Figure 5 should have lower weights compared to characters that are visual different (e.g., “w” and “o”) and the visually different characters should have large weights.

The character similarity was modeled using the output of the CNN-biLSTM given the training examples. The frequency of differences (insertions, deletions, and substitutions) between actual words and the predicted words was counted. The larger the frequency of a mistake, the more visually similar characters are. Therefore, the smaller the weight should be. The main issue is that the weighted edit distance is limited to substitutions of one character to another. In handwriting recognition, it is common for compounds to characters to look alike (e.g., “rn” and “m”).

4. Contract texts back to its original form

This step simply detokenizes and contracts the words (if it was originally decontracted).

As a whole, the lexicon spell checker works best for predicted strings that are very close to the actual value. However, if the output of the greedy algorithm doesn’t provide close enough string proposals, the lexicon spell checker can actually reduce accuracies instead of improving it. A final check to ensure that the input and output sentences are similar was performed.

Beam searching + lexicon search + language model

In order to alleviate the issues of getting poor proposals from the greedy algorithm, one can iterate through the probability matrix to obtain multiple proposals. However, ranking all possible proposals is computationally expensive. Graves et al. [5] proposed that the beam search algorithm can be used to generate K proposals of strings given the probability matrix.

In our implementation, the K sentences are then fed into the lexicon spell checker to ensure that words can be found in a dictionary (otherwise it will appear as <unk>). Each K proposal is then tokenized and sequentially fed into a language model to evaluate the perplexity of each sentence proposal. The sentence proposal with the lowest perplexity was chosen.

Results

The algorithm was qualitatively and quantitatively evaluated. Examples of the predicted text from four forms are shown in Figure 6.