Data Augmentation for End-to-End Speech Translation

How to solve data scarcity with audio and text augmentation techniques

Photo by Alexander Sinn on Unsplash

End-to-end (or direct) speech translation is an approach to speech translation (ST) that is gaining high interest from the research world in the last few years. It consists in using a single deep learning model that learns to generate translated text of the input audio in an end-to-end fashion. Its surge in popularity is due to the scientific interest of achieving such a difficult task, but also to the expected effectiveness in practical applications.

The use of a single model is appealing for many reasons:

Decoding with a single model is faster than decoding with a pipeline of (at least) an automatic speech recognition (ASR) and a machine translation (MT) system. The absence of a transcription step can prevent the propagation of early bad decisions (error propagation), possibly resulting in superior quality. A single system is easier to manage and use than a cascaded system, and could save memory and GPU resources when deployed.

There is one big problem with this. End-to-end speech translation requires different data from ASR and MT, and while these two tasks have been studied for decades and have plenty of data at their disposal (at least for some languages), very little is publicly available for the new approach. In ST, we only have some hundreds of thousands segment pairs for the best-resourced languages, while a neural machine translation system (an easier task because the input is textual) is usually trained on tens or hundreds of millions of sentence pairs.

Then, how can we train a system with such small data? The answer so far is data augmentation: creating synthetic data through transformation of the existing data.

Data augmentation is a commonly used technique in deep learning, as this approach is famous to be data hungry and models can improve significantly their quality by adding data. The classic approach is to alter the input sample while keeping fixed its class label. One example for computer vision is to rotate images: the input image will be different as it is viewed by a different angle, but the content will be the same. A dog stays a dog and a cat stays a cat, though rotated. A practical example can be found here:

In this post, I want to focus on text and audio augmentation techniques that have been proposed for speech translation but can also be used for other tasks involving these types of data. Obviously, rotating and shifting pixels are two techniques that do not apply to text and audio, so we need more sophisticated techniques, sometimes involving the use of other machine learning systems.

Audio Augmentation

The first approach is similar to what happens with images: alter the input but keep the labels. A broadly used approach also by the community of ASR is speed perturbation, which consists in using a tool like SoX for perturbing the audio speed while keeping the translation fixed. It was used, for example, in the winning submission at IWSLT 2019 [1], an international workshop on speech translation that organizes a competition every year. One common practice is to multiply the speed (or the time duration, which is equivalent for our purpose) by a random factor in the range [0.9–1.1], which usually produces an audio that is still human-like but sounds different from the original. This kind of transformation is usually applied offline when the dataset is being built.

Another type of audio transformation happens online. Only some types of transformations can be applied online because current speech translation systems receive as input spectrograms, not wave forms. In order to occupy less disk space and save computational time during training, the datasets are usually stored as spectrograms, which require power to be computed and are more compact than the waveforms.

Wave form (above) and spectrogram (below) for the same utterance. The wave form is a time series, while the spectrogram is like a matrix: the x-axis is for time, the y-axis is for frequencies.

SpecAugment [2] is a popular spectrogram augmentation technique that was proposed last year and gathered a great success. It consists of three steps:

i) time warp is a complex and computational expensive algorithm that shifts a portion of the spectrogram along the time axis; ii) frequency masking applies a horizontal mask that covers some frequencies for the whole temporal dimension; iii) time masking applies a vertical mask that covers all the frequencies for some adjacent time steps. Time and frequency masking are the two most effective components of SpecAugment and basically force the model to predict the target sequence while being deaf to some frequencies or portions of the audio. The two types of masks have different width and positions at every iteration, so that the model learns to use them all.

SpecAugment applied to the original input (topmost image). From the SpecAugment paper [2]

A similar idea has been proposed more recently by Nguyen and colleagues [3] to perturb the audio speed in an online fashion. Their method, called time stretch divides the input spectrogram in fixed-width windows and shrinks or stretches each window with a different random factor picked from (0.8–1.25). With this technique every spectrogram will have windows perturbed in time in different and also contrasting ways, and according to the authors it can replace offline speed perturbation.

Transfer Learning

The most direct way to transfer knowledge between tasks, usually when one of the two has more data available, is the so-called pre-training. First, train a model on the large-resourced task, then use the same deep learning architecture for the second task, and initialize the weights with the ones learned from the first task. This is exactly one of the first approaches proposed for transferring knowledge from MT and ASR systems to direct ST systems [4,5,6]. Train an ASR system using the same architecture (at least for the encoder side) that will be used for direct ST. Train an MT system with the same decoder architecture that will be used in direct ST. Finally, initialize the direct ST system with the encoder weights learned in ASR and the decoder weights learned in MT. This approach leads to faster convergence and better translation quality, however, the decoder pre-training appears to be less effective than encoder pre-training. To overcome this problem, Bahar et al. [7] showed that decoder pre-training is more effective if one additional “adapter” encoder layer is put on top of the pre-trained encoder.

Two-Stage Decoding

Kano et al. [8] proposed two-stage decoding as a way to better use pre-trained components. It is a network consisting of three components, one encoder and two cascaded decoders. The first decoder generates sentences in the source language, similarly to an ASR system, while the second decoder generated text in the target language. The second decoder computes its attention with the states of the first decoder, right before they are used to select the output symbols. In this sense, the two-stage decoding is similar to a cascaded systems, but it does not discretize the source sequence in order to prevent errors. The two-stage decoding model is initialized from pre-trained ASR and MT models for an increased effectiveness. Sperber et al. [9] proposed an “attention-passing” model, which improves the two-stage decoding model for a more effective pre-training. They identify a problem in the two-stage decoding as it passes the first decoder outputs as attention input to the second decoder. The decoder states already contain the information to select the next word to generate, then a form of error propagation appears again in the model. In order to overcome it, they propose to use the context vectors produced by the attention between the first decoder and the encoder as input to the second decoder attention. This way, they suppose that the second decoder is connected directly to the relevant audio portion and no error is introduced by some early decisions made by the first decoder. This model seems to be more data-efficient through a clever scheme of multitask learning involving ASR, ST, and MT in the same training.

Schematic comparison between cascade, direct ST, attention-passing model and two-stage decoding model. Credit: Sperber et al. [9]

Knowledge Distillation

If your data is organized in triplets audio-transcript-translation, the knowledge of an MT model can be transferred to the direct ST model through knowledge distillation. For each triplet, the audio is given as input to the direct ST model, the transcript as input to the MT model, then the direct ST model is trained against the distribution generated by the MT model and not the ground truth. This kind of training is useful in some ways: 1) the soft distribution generated by the MT model is easier to learn than the one-hot ground truth; and 2) the MT distribution also contains some relations between symbols that the MT model knows but maybe are not present in the ST training data. This way, the direct ST model can learn a larger vocabulary than what is present in the training data [11].

Weak Supervision

Jia et al. [12] proposed to create synthetic parallel audio-translation data using existing systems for other tasks: 1) when only audio-transcript data is available, use an MT system to translate the transcripts in the target language; 2) when only parallel source-target texts are available, use a text-to-speech (TTS) system to generate the source audio. With this approach, the training data for direct ST can be augmented by orders of magnitude, and the quality of systems for those tasks ensures that the data is high quality. They also showed that the use of MT is more effective than TTS, maybe because the synthesized data is not very similar to human voice in real conditions. Similar results have been confirmed by Pino et al. [13], who also showed that this kind of data augmentation outperforms all other proposed methods and in some cases also makes pre-training useless.

Multilingual ST

A final technique for data augmentation is to go multilingual. As multilingual MT increases the translation quality for the low-resourced languages, Inaguma et al. [14] and us [15] proposed in two parallel works to leverage data from different languages. The objective is twofold: 1) more data to train the audio encoder, and 2) a positive transfer between similar languages that can have similar grammar or identical words, like English and French. The results in both studies are positive, but it is yet to see if the gain from multilingual training is kept also when using other data augmentation techniques.

Conclusion

Many techniques have been proposed for direct ST to leverage more data than the few available for this task. My personal recommendation to build a state of the art system is to always use an encoder pre-trained with a strong ASR model, augment the training set by generating synthetic translations with a strong MT system, and use SpecAugment during training. In our experiments, this combination was always effective, or at least not hurting. The other techniques are also useful, but their effectiveness can be reduced when used in combination with other methods, which may be a problem if they increase the training time.

I hope that this roundup of the existing data augmentation methods for direct ST is useful if you are approaching this field now, but also if you work in related fields based on audio and/or text. In particular, great improvements can come out by generating data with other components trained on more data for other tasks, or with online augmentation techniques that distort the original data.

If you are interested in direct ST but do not know where to start to work with it, check out my tutorial:

References

[1] Potapczyk, Tomasz, et al. (2019). Samsung’s System for the IWSLT 2019 End-to-End Speech Translation Task. Proceeding of IWSLT 2019.

[2] Park, Daniel S., et al. “SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition.” Proc. Interspeech 2019 (2019): 2613–2617.

[3] Nguyen, Thai-Son, et al. “Improving sequence-to-sequence speech recognition training with on-the-fly data augmentation.” arXiv preprint arXiv:1910.13296 (2019).

[4] Weiss, Ron J., et al. “Sequence-to-Sequence Models Can Directly Translate Foreign Speech.” Proc. Interspeech 2017 (2017): 2625–2629.

[5] Bérard, Alexandre, et al. “End-to-end automatic speech translation of audiobooks.” 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018.

[6] Bansal, Sameer, et al. “Pre-training on High-Resource Speech Recognition Improves Low-Resource Speech-to-Text Translation.” Proceedings of NAACL-HLT. 2019.

[7] Bahar, Parnia, Tobias Bieschke, and Hermann Ney. “A comparative study on end-to-end speech to text translation.” Proceedings of ASRU. (2019a).

[8] Kano, Takatomo, Sakriani Sakti, and Satoshi Nakamura. “Structured-based Curriculum Learning for End-to-end English-Japanese Speech Translation.” (2017).

[9] Sperber, Matthias, et al. “Attention-passing models for robust and data-efficient end-to-end speech translation.” Transactions of the Association for Computational Linguistics 7 (2019): 313–325.

[10] Liu, Yuchen, et al. “End-to-End Speech Translation with Knowledge Distillation.” Proc. Interspeech 2019 (2019): 1128–1132.

[11] Hinton, Geoffrey, Oriol Vinyals, and Jeff Dean. “Distilling the knowledge in a neural network.” arXiv preprint arXiv:1503.02531 (2015).

[12] Jia, Ye, et al. “Leveraging weakly supervised data to improve end-to-end speech-to-text translation.” ICASSP 2019–2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019.

[13] Pino, Juan, et al. “Harnessing indirect training data for end-to-end automatic speech translation: Tricks of the trade.” Proceedings of the 16th International Workshop on Spoken Language Translation (IWSLT). 2019.

[14] Inaguma, Hirofumi, et al. “Multilingual end-to-end speech translation.” Proceedings of ASRU (2019).

[15] Di Gangi, Mattia Antonino, Matteo Negri, and Marco Turchi. “One-to-many multilingual end-to-end speech translation.” Proceedings of ASRU (2019).