Part One: Visual Speech Recognition (Lip Reading)

Previous work from the team detailed some of the many advancements within the field of Computer Vision. In practice, research isn’t siloed into isolated fields and, with this in mind, we present a short exploration of an intersection between Computer Vision (CV) and Natural Language Processing (NLP) — namely, Visual Speech Recognition, also more commonly known as lip reading.

Similar to the advancements seen in Computer Vision, NLP as a field has seen a comparable influx and adoption of deep learning techniques, especially with the development of techniques such as Word Embeddings[6] and Recurrent Neural Networks (RNNs)[7]. Moreover, the drive to tackle complex, cross-domain problems using a combination of inputs has spawned much to be excited about. One source of excitement for us comes from seeing the skill of Lip Reading move from human-dominance to machine-dominance in the accuracy rankings. Another still from the method by which this was accomplished.

It was not so long ago that lip reading was heralded to be a difficult problem, much like the difficulty ascribed to the game of Go; albeit not quite as well-known. In addition to solving this problem, advancements in lip reading may potentially enable several new applications. For instance, dictating messages in a noisy environment, dealing with multiple simultaneous speakers better, and improving the performance of speech recognition systems in general. Conversely, extracting conversations from video alone may be an area of concern in the future.

Our focus on this niche application, one hopes, is both illustrative and informative. A relatively small body of deep learning work on lip reading was enough to upset the traditional primacy of the expertly-trained lip reader. Meanwhile, the combinatorial nature of AI research and the technologies at the centre of these advancements blend the demarcations between fields in a scintillating way. Where, if ever, such advancements plateau is the question on everyone’s lips.

Framing the problem

The task of predicting innovations and advancements in technologies is notoriously quite difficult, and best reserved for small wagers between colleagues and friends. Where estimates are made, one usually compares a machine’s performance to tasks that humans are already good at, e.g. walking, writing, playing sports, etc. It surprised us to learn two things with regards to lip reading. Firstly, that machines managed to surpass expert-humans recently, and secondly, that expert-humans weren’t that accurate to begin with.

Irrespective of the bar set by the expert, we think it best to delve into what makes this a tough challenge to master. Visemes, analogous to the lip-movements that comprise a lip reading alphabet, pose a clear challenge to those who’ve ever attempted to apply them. Namely, that multiple sounds share the same shape. There exists a level of ambiguity between consonants, which cannot be dispensed with — a problem well documented by Fisher in his extensive study on visemes [8].

Figure 1: Viseme Examples

Note: This picture shows that several sounds, such as ‘p’ and ‘b’, have the same viseme, which means that the corresponding phonemes are very difficult to determine for a person whose only information source is visual.

For example, saying the words “elephant juice” appears identical on the lips to “I love you”.

Source: Originally created by Dr Mary Allen and available from her site [9].

Since there are only so many shapes that one’s mouth can make in articulation, mapping said shapes accurately to the underlying words is challenging [10]. Especially when much communication relies more on sound than on visual information; vocal communication is sound-dependent. Hence, achieving high accuracy without the context of the speech [11] is extremely difficult — for people and machines.

Early results

With these limitations it’s not surprising that early studies focused on simplified versions of the problem. Initially, feature engineering produced improvements using facial recognition models which placed bounding boxes around the mouth, and extracted a model of the lips independent from the orientation of the face. Some common features used were the width-height ratio of a bounding box for detecting mouths, the appearance of the tongue (pixel intensity in the red channel of the image) and an approximation of the amount of teeth from the ‘whiteness’ in the image [12].

Figure 2: Extracting Lips as a Feature

Note: The correlation between the mouth appearance and its ratio extracted independently from facial orientation.

Source: Hassanat (2011) [13]

These approaches obtained impressive results (over 70% word accuracy) for tests performed with classifiers trained on the same speaker they were tested on. But performance was heavily damaged when trying to lip read from individuals not included in the training set. Lip detection in males with moustaches was also more difficult and, therefore, the performance on such cases was poor. Hence, the feature engineering approaches, while an improvement, ultimately failed to generalise well.

Following this, using different viseme classification methods with defined language models improved state of the art (SOTA) performance.[14] Language models help filter results that are obviously incorrect and improve results by selecting from only plausible options, e.g. ’n’ for the 4th character in “soon” rather than “soow” or “soog”. Greater improvements still were made by “fine-tuning” the viseme classifier for phoneme classification, which enabled them to deal with multiple possible solutions for words containing the same visemes in similar intervals. This improved accuracy and performed comparatively better than previous approaches.

These early techniques brought performance to roughly 19% accuracy on an unseen test set, an improvement over the prior best of 17% (+/- 12%) accuracy generated by a sample of hearing-impaired lip readers. A sample group which outperforms the general population on average.[15]

McGurk and MacDonald argue in their 1976 paper[16] that speech is best understood as bimodal, that is taking both visual and audio inputs — and that comprehension in individuals may be compromised if either of these two domains are absent. Intuitively, many of us can recall mishearing speech while on the phone, or the difficulties one has in pairing sound and lips in a noisy environment. The requirement of bimodal inputs, as well as contextual constraints, hampers the ability of people and machines to read lips with accuracy. This pointed to the need for further studies on the use of these combined information sources. A direction which brings us into the most recent epoch of lip reading approaches.

The arrival of Deep Learning

It is with this point that we introduce recent work from Assael et al. (2016) — “LipNet: End-to-End Sentence-level Lipreading.”[17] “LipNet” introduces the first approach for an end-to-end lip reading algorithm at sentence level. Earlier work by Wand, Koutník and Schmidhuber[18] applied LSTMs[19] to the task, but only for word classifications. However, their earlier advances, including end-to-end[20] trainability, were undoubtedly valuable to the body of work in the space. For those wishing to know more about LSTMs and their variants, Christopher Olah provides an intuitive and detailed explanation of their use here.[21]

Figure 3: LipNet Example at Sentence Level

Note: The speaker is saying “place blue”, and the “p” and “b” there have the same viseme, which means that the corresponding phonemes are very difficult to accurately determine for a person whose only information source is visual.

Source: GIF created by The M Tank, originally from LipNet video.[22]

On a high level in the architecture, the frames extracted from a video sequence are processed in small sets within a Convolutional Neural Network (CNN),[23] while an LSTM-variant runs on the CNN output sequentially to generate output characters. More precisely, a 10-frame sequence is grouped together in a block (width x height x 10), sequence length may vary, but the consecutive nature of these frames creates a Spatiotemporal CNN.

Then the output of this LSTM-variant, called a Gated Recurrent Unit (GRU),[24] is processed by a multi-layered perceptron (MLP) to output values for the different characters derived from the Spatiotemporal CNN. Lastly, a Connectionist Temporal Classification (CTC) provides final processing on the sequence outputs to make it more intelligible in terms of precise outputs, i.e. words and sentences. This approach allows information to be passed through the time periods comprising both words and, ultimately, sentences, improving the accuracy of network predictions.

The authors note that ‘LipNet addresses the issues of generalisation across speakers,’ i.e. the variance problems seen in earlier approaches, ‘and the extraction of motion features’, originally classed as open problems in Zhou et al. (2014).[25][26], The approach in LipNet, we feel, is interesting and exciting outside of the narrow confines of accuracy measures alone. The combination of CNNs and RNNs in the network — itself a hark back to our comments around the lego-like approach of deep learning research — is, perhaps, more evidence for the soon-to-be-primacy of differential programming. Deep Learning est en train de mourir. Vive Differentiable Programming![27]

LipNet also makes use of an additional algorithm typically used in speech recognition systems — a Connectionist Temporal Classification (CTC) output. After the classification of framewise characters, which in combination with more characters define an output sequence, CTC can group the probabilities of several sequences (e.g. “c__aa_tt” and “ccaaaa__t”) into the same word candidates (in this case “cat”) for the final sentence prediction. Thus the algorithm is alignment-free. CTC solves the problem of matching sequences where timing is variable.

Figure 4: CTC in Action

Note: CTC maps both representations shown in the image to the same sentence, “the cat”. This is despite differences in spacing and elongation of letters/syllables. There is no need to find the perfect match or alignment between framewise output and classified character, the CTC loss function provides a unified, consistent output.

Source: Brueckner (2016)[28]

By predicting the alphabet characters and an additional “_” (space) character, it’s possible to generate a word prediction by removing repeated letters and empty spaces, as can be seen in fig. 5 for the classification of the word “please”. In practical terms this means that elongated pronunciations, variations in emphasis and timings, as well as pauses between syllables and words can still produce consistent predictions using the CTC for outputs.

Figure 5: Saliency map of “Please”

Note from source: Saliency shows the places where LipNet has learned to attend, i.e. the phonologically important regions. The pictured transcription is given by greedy CTC decoding. CTC blanks are denoted by ‘⊔’.

Source: Assael et al. (2016, p. 7).

CTC is a function for output alignment and a loss correction function based on that alignment, and is independent of the CNN and LSTM-variants. One can also think of CTC as similar to a softmax due to converting the raw output of a network (e.g. raw class scores or in our case, characters) into the expected output (e.g. a probability distribution or in this case, words and sentences). CTC makes matching a single character output to word level possible. Awni Hannun provides an excellent dynamic publication that explains CTC operation; available here.[29]

There is a great video which covers some of LipNet’s functionality, as well as a specific use case — operating within autonomous vehicles. Seeing LipNet in operation ties together much of what we’ve discussed about the system so far.[30]

Assael, Y. (2016). LipNet in Autonomous Vehicles | CES 2017 (See reference [30])

Architecture and results

A hallmark of this method is that the output labels are not conditioned on each other. For example, the letter ‘a’ in ‘cat’ is not conditioned on ‘c’ or ‘t’. Instead this relation is extracted by three spatio-temporal convolutions, followed by two GRUs which process a set number of the input images. The output from the GRUs then goes through a MLP to compute CTC loss (see fig. 6).

Figure 6: LipNet Architecture

Note from source: LipNet architecture. A sequence of T frames is used as input, and is processed by 3 layers of STCNN, each followed by a spatial max-pooling layer. The features extracted are processed by 2 Bi-GRUs; each time-step of the GRU output is processed by a linear layer and a softmax. This end-to-end model is trained with CTC.

Note: Bidirectional GRUs (Bi-GRUs) are an RNN variant that process the input in forward and reverse order.

Source: Assael et al. (2016, p. 5)

The architecture of LipNet was deemed an empirical success, achieving a prediction accuracy of 95.2% on sentences from the GRID dataset, an audiovisual sentence corpus for research purposes.[31] However, literature on deep speech recognition (Amodei et al., 2015)[32] suggested that further performance improvements would inevitably be achieved with more data and larger models. Commentators, reminded of earlier difficulties in generalisability and moustache-handling, expressed concern over the unusual sentences taken from GRID which formed the LipNet example video. The limited nature of GRID produced fears of overfitting; but how would LipNet fare in the real-world?

Figure 7: LipNet and other approaches

Note from source: Existing lipreading datasets and the state-of-the-art accuracy reported on these.

Source: Assael et al. (2016, p. 3)

The arrival of richer data

Not long after LipNet, DeepMind released ‘Lip Reading Sentences in the Wild’, [33] and addressed some of the concerns around LipNet’s generalisability. Taking inspiration from both CNNs for visual feature extraction[34] and the use of LSTMs for speech transcription,[35] the authors present an innovative approach to the problem of lip reading. By adding individual attention mechanisms for each of the input types, and combining them afterwards to produce character outputs, improvements in both the accuracy and generalisability of the original LipNet architecture were realised.

Attention mechanisms, discussed at length in part two of this piece, refer to a technique for focusing on specific parts of the input or previous layer(s) within neural networks. A somewhat-recent technique, taking inspiration from earlier work but popularised by Alex Graves’s in 2013/2014, it has grown in use partially from his memory-related work: the now-famous sequence generation paper[36] along with his work on neural turing machines.[37]

Attention mechanisms have been an enabler of some the recent success within deep learning; due to more efficient and clever processing of data. It also allows these models to have more interpretability, i.e. if asking why a network thinks a certain image is a dog it is often hard to look at and understand the internals of the network to find out why. Attention allows the network to highlight the salient parts of the image used in its prediction, e.g. a snout and pointed ears. Attention has become such a common technique that it spawned papers like “attention is all you need”, which foregoes convolution and recurrence techniques entirely for the problem of machine translation.

Returning to “Lip Reading Sentences in the Wild”, Chung et al. (2017) present their WLAS Network. Composed of three main submodules (watch, listen spell) — with attention sprinkled into the spell module. The system is as follows:

Watch (image encoder): Takes images and encodes them into a deep representation to be processed by further modules.

(image encoder): Takes images and encodes them into a deep representation to be processed by further modules. Listen (audio encoder): Allows the system to take in audio format as optional help to lip reading. This directly processes 13-dimensional MFCC features (see next section).

(audio encoder): Allows the system to take in audio format as optional help to lip reading. This directly processes 13-dimensional MFCC features (see next section). Spell (character decoder): This module incorporates the information from all previous modules. Each encoder above transforms their respective input sequence into a fixed-dimensional state vector and sequences of encoder outputs. The character decoder, which is an LSTM transducer, then reads the fixed state and attention vectors from both encoders and produces a probability distribution over the output character sequence. Finally, the attention vectors are fused with the output states to produce the context vectors that contain the information required to produce the next step output.

(character decoder): This module incorporates the information from all previous modules. Each encoder above transforms their respective input sequence into a fixed-dimensional state vector and sequences of encoder outputs. The character decoder, which is an LSTM transducer, then reads the fixed state and attention vectors from both encoders and produces a probability distribution over the output character sequence. Finally, the attention vectors are fused with the output states to produce the context vectors that contain the information required to produce the next step output. Attend (independent regulation of audio and video attention mechanisms): Attend to what is important in each specific input signal/stream, i.e. audio or video. Without attention the model gets word error rates of over 100% and seems to forget the input signal. This shows that the dual-attention mechanism truly allowed this technique to work end-to-end. It also allows the network to handle out of sync audio/video (different sampling rates), including an absent stream.

WLAS functionality; greater details from more data

Watch is a VGG-M[38] that extracts a framewise feature representation to be consumed by an LSTM, which generates a state vector and an output signal. The Watch module looks at each frame in the video and extracts the relevant features that the module has learned to look for, i.e. certain lip movements/positions. This is done by a regular VGG-M CNN which outputs a feature representation for each frame.

This sequence of feature representations are then fed into a regular LSTM which generates a state vector (or cell state) and an output signal. With LSTMs and GRUs there’s an output and a “state” input to the next LSTM cell. The output is a character prediction (or a probability distribution of predicted character), while state is what encodes “the past”, i.e. what an LSTM has computed/stored of the past which is used to predict the next output.

Figure 8: Watch, Listen, Attend and Spell architecture

Note from source: “Watch, Listen, Attend and Spell architecture (WLAS). At each time step, the decoder outputs a character yi, as well as two attention vectors. The attention vectors are used to select the appropriate period of the input visual and audio sequences.”

Source: Chung et al. (2017, p. 3)

The Listen module uses the Mel-frequency cepstral coefficients (MFCCs)[39] as its input. These parameters define a representation of the short-term power spectrum of a sound based on signal transformations. MFCCs ensure transformations are scaled to a frequency which simulates the human hearing range. Following this, independent attention mechanisms in the Attend module for each of the audio and video inputs are combined. These are then in turn passed through the Spell module. With a multi-layered perceptron (MLP) at each time step, the output from the LSTM ends up in a softmax to define the probabilities of the output characters.

With this, we return to similar themes of progress alluded to in our previous work: data availability and network stack-ability. Neural network-based approaches are typically characterised by heavy data demands. Concomitant to the progress in lip reading is the creation of a unique dataset for training and testing the network. Previously, research in lip reading was hampered by the available datasets and their small vocabularies. One only has to look at the desirable characteristics of Chung et al.’s (2016/2017) datasets, the LRW and the LRS, as expressed by Stafylakis and Tzimiropoulos (2017, p. 2), to understand the value of such data in improving research efforts:

“We chose to experiment with the LRW database, since it combines many attractive characteristics, such as large size (∼500K clips), high variability in speakers, pose and illumination, non-laboratory in-the-wild conditions, and target-words as part of whole utterances rather than isolated.”[40]

Chung et al. (2017) created a pipeline to automatically generate the dataset(s)[41] from BBC recordings as well as from the contained closed captions, which enabled progress in a data-intensive research area. Their creation is a ‘Lip Reading Sentences’ (LRS) dataset for visual speech recognition, consisting of over 100,000 natural sentences from British television.’

Figure 9: Pipeline to generate LRW/LRS dataset

Steps: (A) Video preparation: (1) shot boundaries are detected by comparing colour histograms across consecutive frames. (2) Face detection and grouping of the same face across frames. (3) Facial landmark detection. (4). (B) Audio and text preparation: (1) The subtitles in BBC videos are not broadcast in sync with the audio. (2) Force-align them, filter and fix. (3) AV sync and speaker detection. (4) Sentence extraction. The videos are divided into individual sentences/phrases using the punctuations in the transcript, e.g. by full-stops, commas, question marks, etc.

Source: Chung et al. (2017, p. 5)

The authors also corrupt said datum with storm noises (i.e. weather storms[42]), demonstrating the network’s ability to use distorted and low volume data, or to discard audio completely for prediction purposes. Determining whether there’s value to the prediction in listening or not. For those wishing to see more, Joon Son Chung presents a fantastic overview of the authors’ work at CVPR.[43]

Joon Son Chung presenting Lip Reading Sentences in the Wild @ CVPR 2017. See reference [43].

Although movements towards lower data requirements are pressing-on, this paradigm has yet to shift; and it’s likely that it shall remain this way for some time to come. As for stackability, the very nature of the LipNet and Lip Reading in the Wild architectures illustrate the lego-like nature of neural nets — e.g. CNNs plugged into RNNs with attention techniques.[44] While it’s true that this is a gross oversimplification, as a heuristic we find it increasingly useful in interpreting and understanding the rapid advancements across a lot of existing, and new, AI research.

Here this last point extends outside of the architecture itself, inscribing the potential stacking of inputs into our heuristic also. A great contribution of these works is the creation of an end-to-end architecture capable of using audio, video, or combinations of both as inputs to generate a text prediction as output: creating a truly multimodal model. Multiple input sequences resulting in a singular output sequence. Solving this multi-modal problem, and others like it, potentially opens new paths to explore in connecting video, audio and language systems.

New paths of exploration

Curious as to what would follow the approaches detailed previously, we turn our attention to some of the most recent work in this space. Although not exhaustive, here’s a smattering of the best improvements we came across in this domain:

Combining Residual Networks with LSTMs for Lipreading [45]: Improves on the original LRW paper by using a spatiotemporal convolutional network, a residual network and stacked bidirectional Long Short-Term Memory networks (BiLSTMs). The latter of which processes the input features in forward and reverse order like the BiGRU mentioned earlier. Their approach improves from 76.2% to 83% word accuracy[46] on the LRW paper, and also improves the accuracy on the GRID dataset. They do not extend, at present, to sentences on the LRS dataset.

[45]: Improves on the original LRW paper by using a spatiotemporal convolutional network, a residual network and stacked bidirectional Long Short-Term Memory networks (BiLSTMs). The latter of which processes the input features in forward and reverse order like the BiGRU mentioned earlier. Their approach improves from 76.2% to 83% word accuracy[46] on the LRW paper, and also improves the accuracy on the GRID dataset. They do not extend, at present, to sentences on the LRS dataset. Improving Speaker-Independent Lipreading with Domain-Adversarial Training [47]: Helps improve the performance on a target speaker with only a small amount of data. It isn’t as effective in instances where the model is trained on a lot of data. Hence, we would be interested to see its performance on LRS which has a 1000+ speakers.

[47]: Helps improve the performance on a target speaker with only a small amount of data. It isn’t as effective in instances where the model is trained on a lot of data. Hence, we would be interested to see its performance on LRS which has a 1000+ speakers. End-to-End Multi-View Lipreading[48]: Achieves classification of non-frontal lip views, also utilising Bidirectional Long-Short Memory (BiLSTM). For each viewpoint the authors create an identical encoding MLP architecture (a stream of video processing), which enables the network to train with multiple views simultaneously (see fig. 10).

Figure 10: Architecture

Note from source: Overview of the end-to-end visual speech recognition system. One stream per view is used for feature extraction directly from the raw images. Each stream consists of an encoder which compresses the high dimensional input image to a low dimensional representation. The ∆ and ∆∆ features are also computed and appended to the bottleneck layer. The encoding layers in each stream are followed by a BiLSTM which models the temporal dynamics. A BiLSTM is used to fuse the information from all streams and provides a label for each input frame.

Note: ∆ and ∆∆ refer to the first and second derivatives from the encoded feature maps right before each BiLSTM, respectively. Using these representations forces the encoding MLPs to create encoded features with meaningful information on its derivatives as opposed to the use of derivatives on the image level.[49]

Source: Petridis et al. (2017, p. 3)

Visual Speech Enhancement using Noise-Invariant Training[50]: Tackles a somewhat related problem by providing a method for enhancing the voice of visible speakers in noisy environments. The approach uses the audio-visual inputs seen previously to disentangle the voice from background noise by matching lip movements. Although it differs from the other approaches, the idea itself is novel and, frankly, pretty cool. Especially since it makes use of a lipreading dataset for this task.

“Visual speech enhancement is used on videos shot in noisy environments to enhance the voice of a visible speaker and to reduce background noise. While most existing methods use audio-only inputs, we propose an audio-visual neural network model for this purpose. The visible mouth movements are used to separate the speaker’s voice from the background sounds”

Update August 30th 2019: Two new papers from the authors of LipNet and Lip Reading in the Wild respectively.