Text normalization is an important process in conversational AI. If an Alexa customer says, “book me a table at 5:00 p.m.”, the automatic speech recognizer will transcribe the time as “five p m”. Before a skill can handle this request, “five p m” will need to be converted to “5:00PM”. Once Alexa has processed the request, it needs to synthesize the response — say, “Is 6:30 p.m. okay?” Here, 6:30PM will be converted to “six thirty p m” for the text-to-speech synthesizer. We call the process of converting “5:00PM” to “five p m” text normalization and its counterpart — converting “five p m” to “5:00PM” — inverse text normalization.

ASR = automatic speech recognition; NLU = natural-language understanding; DM = dialogue management;

NLG = natural-language generation; and TTS = text-to-speech synthesis

In the example above, time expressions live two lives inside Alexa, to meet an individual skill’s needs and to optimize the system’s performance, even though end users are unaware of such internal format switches. There are many other types of expressions that receive similar treatment, such as date, e-mail address, numbers, and abbreviations.

To do text normalization and inverse text normalization in English, Alexa currently relies on thousands of handwritten rules. As the range of possible interactions with Alexa increases, authoring rules becomes an intrinsically error-prone process. Moreover, as Alexa continues to move into new languages, we would rather not rewrite all those rules from scratch.

Consequently, at this year’s meeting of the North American Chapter of the Association for Computational Linguistics (NAACL), my colleagues and I will report a set of experiments in using recurrent neural networks to build a text normalization system.

By breaking words in our network’s input and output streams into smaller strings of characters (called subword units), we demonstrate a 75% reduction in error rate relative to the best-performing neural system previously reported. We also show a 63% reduction in latency, or the time it takes to receive a response to a single request.

By factoring in additional information, such as words’ parts of speech and their capitalizations, we demonstrate a further error rate reduction of 81%.

What makes text normalization nontrivial is the ambiguity of its inputs: depending on context, for instance, “Dr.” could mean “doctor” or “Drive”, and “2/3” could mean “two-thirds” or “February third”. A text normalization system needs to consider context when determining how to handle a given word.

To that end, the best previous neural model adopted a window-based approach to textual analysis. With every input sentence it receives, the model slides a “window” of fixed length — say, five words — along the sentence. Within each window, the model decides only what to do with the central word; the words on either side are there for context.

But this is time consuming. In principle, it would be more efficient to process the words of a sentence individually, rather than in five-word chunks. In the absence of windows, the model could gauge context using an attention mechanism. For each input word, the attention mechanism would determine which previously seen words should influence its interpretation.

The activation pattern of an attention mechanism, during the normalization of the input “archived from the original on 2011/11/11”

In our experiments, however, a sentence-based text normalization system, with attention mechanism, performed poorly compared to a window-based model, making about 2.5 times as many errors. Our solution: break inputs into their subword components before passing them to the neural net and, similarly, train the model to output subword units. A separate algorithm then stitches the network’s outputs into complete words.

The big advantage of subword units is that they reduce the number of inputs that a neural network must learn to handle. A network that operates at the word level would, for instance, treat the following words as distinct inputs: crab, crabs, pine, pines, apple, apples, crabapple, crabapples, pineapple, and pineapples. A network that uses subwords might treat them as different sequences of four inputs: crab, pine, apple, and the letter s.

Using subword units also helps the model decide what to do with input words it hasn’t seen before. Even if a word isn’t familiar, it may have subword components that are, and that could be enough to help the model decide on a course of action.

To produce our inventory of subword units, we first break all the words in our training set into individual characters. An algorithm then combs through the data, identifying the most commonly occurring two-character units, three-character units, and so on, adding them to our inventory until it reaches capacity.

We tested six different inventory sizes, starting with 500 subword units and doubling the size until we reached 16,000. We found that an inventory of 2,000 subwords worked best.

We trained our model using 500,000 examples from a public data set, and we compared its performance to that of a window-based model and a sentence-based model that does not use subword units.

The baseline sentence-based model had a word error rate (WER) of 9.3%, meaning that 9.3% of its word-level output decisions were wrong. With a WER of 3.8%, the window-based model offered a significant improvement. But the model with subword units reduced the error rate still further, to 0.9%. It was also the fastest of the three models.

Once we had benchmarked our system against the two baselines, we re-trained it to use not only subword units but additional linguistic data that could be algorithmically extracted from the input, such as parts of speech, position within the sentence, and capitalization.

That data can help the system resolve ambiguities. For instance, if the word “resume” is tagged as a verb, it should simply be copied verbatim to the output stream. If, however, it’s tagged as a noun, it’s probably supposed to be the word “résumé,” and accents should be added. Similarly, the character strings “us” and “id” are more likely to be one-syllable nouns if lowercase, two-syllable abbreviations if capitalized.

With the addition of the linguistic data, the model’s WER dropped to just 0.2%.

Acknowledgments: Courtney Mansfield, Ankur Gandhe, Björn Hoffmeister, Ryan Thomas, Denis Filimonov, D. K. Joo, Siyu Wang, Gavrielle Lent