Today, we are excited to announce STACL (Simultaneous Translation with Anticipation and Controllable Latency), the first simultaneous machine translation system with anticipation capabilities and controllable latency. It is an automated system that is able to conduct high quality translation concurrently between two languages. STACL represents a major breakthrough in natural language processing due in large part to the challenges presented by word order differences between the source and target languages, and the latency requirements in real-world applications of simultaneous translation or interpretation.

Historically, there have been two types of interpretation:

Consecutive interpretation refers to a practice where the translator waits until the speaker pauses (usually at sentence boundaries) to start translating, thus doubling the time needed. Simultaneous interpretation is when the translator begins translating just a few seconds into the speaker’s speech and finishes just a few seconds after the speaker ends.

Thanks to its speed, simultaneous interpretation has been widely used for intergovernmental summits, multilateral negotiations and many other occasions. The benefits of simultaneous translation have created a huge demand for this service but there are not nearly enough simultaneous interpreters available, and each person can only last for a short period of time, after which their error rates grow exponentially. That’s why simultaneous interpreters work in teams of two or three and usually alternate with each other every 20–30 minutes.

Therefore, there is a critical need to develop automated systems to expand the access to simultaneous translation.

Creating an automated system for reliable simultaneous translation has been a long standing challenge in the field, especially due to the word order differences between the source and the target languages. For example, in a Chinese sentence Bùshí Zǒngtǒng zài Mòsīkē yǔ Pǔjīng huìwù, which means “President Bush meets with Putin in Moscow”, the Chinese verb huìwù (“meet”), appears at the very end, similar to a German or Japanese verb. In the English translation, however, the verb “meets” appears much earlier. This variance in word order in human languages has been a major hindrance to both human simultaneous interpreters and the development of reliable simultaneous machine translation systems. As a result, virtually all commercial “real-time” translation systems still today use conventional full-sentence (i.e., non-simultaneous) translation methods, causing the undesirable latency of at least one sentence, rendering the user out of sync with the speaker.

We tackled this challenge using an idea inspired by human simultaneous interpreters, who routinely anticipate or predict materials that the speaker is about to cover in a few seconds into the future. However, different from human interpreters, our model does not predict the source language words in the speaker’s speech but instead directly predicts the target language words in the translation, and more importantly, seamlessly fuses translation and anticipation in a single “wait-k” model. In this model the translation is always k words behind the speaker’s speech to allow some context for prediction. We train our model to use the available prefix of the source sentence at each step (along with the translation so far) to decide the next word in translation. In the aforementioned example, given the Chinese prefix Bùshí Zǒngtǒng zài Mòsīkē (“Bush President in Moscow”) and the English translation so far “President Bush” which is k=2 words behind Chinese, our system accurately predicts that the next translation word must be “meet” because Bush is likely “meeting” someone (e.g., Putin) in Moscow, long before the Chinese verb appears. Just as human interpreters need to get familiar with the speaker’s topic and style beforehand, our model also needs to be trained from vast amount of training data which have similar sentence structures in order to anticipate with a reasonable accuracy.

STACL is also flexible in terms of the latency-quality trade-off, where the user can specify any arbitrary latency requirements (e.g., one word delay or five word delay). Between closely related languages such as French and Spanish, the latency can be set lower because even word-by-word translation works very well. However, for distant languages such as English and Chinese and languages with different word order such as English and German, higher latency should be allowed in order to cope with the word order differences. It is more common for translation quality to suffer with low latency requirements, but our system sacrifices only a small loss in quality compared to conventional full-sentence (e.g. non-simultaneous) translation. We are continuing to improve translation quality given low latency requirements.

While the best human simultaneous interpreters are reported to cover about 60% of the source materials (with about three seconds delay), the new simultaneous translation system from Baidu is about 3.4 BLEU points less than conventional full-sentence translation, where BLEU is the standard evaluation metric for full-sentence translation quality by comparing a machine translation result with a human reference translation. In Chinese-to-English simultaneous translation with a wait-3-words model (where the English translation is lagging behind the Chinese speech by 3 Chinese words, or about 1.5–2 seconds), the translation quality has a single-reference BLEU score of 15.3. The conventional full-sentence (non-simultaneous) translation is about 5 BLEU points higher. This accuracy gap shrinks to around 3.4 BLEU points if we allow 5 words (or about 3 seconds) delay.

Even with the latest advancement, we are fully aware of the many limitations of a simultaneous machine translation system. The release of STACL is not intended to replace human interpreters, who will continue to be depended upon for their professional services for many years to come, but rather to make simultaneous translation more accessible.