Neural machine translation (NMT) is a machine translation approach that utilizes an artificial neural network to predict the likelihood of a sequence of words. With the development of deep learning, NMT is playing a major role in machine translation and has been adopted by Google, Microsoft, IBM and other tech giants. A Google AI research team recently published the paper Massively Multilingual Neural Machine Translation in the Wild: Findings and Challenges, proposing a universal neural machine translation (NMT) system trained on over 25 billion examples that can handle 103 languages.

Paper Abstract: We introduce our efforts towards building a universal neural machine translation (NMT) system capable of translating between any language pair. We set a milestone towards this goal by building a single massively multilingual NMT model handling 103 languages trained on over 25 billion examples. Our system demonstrates effective transfer learning ability, significantly improving translation quality of low-resource languages, while keeping high-resource language translation quality on-par with competitive bilingual baselines. We provide indepth analysis of various aspects of model building that are crucial to achieving quality and practicality in universal NMT. While we prototype a high-quality universal translation system, our extensive empirical analysis exposes issues that need to be further addressed, and we suggest directions for future research. (arXiv)

Synced invited Graham Neubig, an Assistant Professor at the Language Technologies Institute of Carnegie Mellon University who works on natural language processing specifically multi-lingual models and natural language interfaces, to share his thoughts on the universal neural machine translation (NMT) system.

How would you describe the universal neural machine translation (NMT) system?

This paper does not propose a new technique for NMT per se. Rather it continues in a line of recent, and in my opinion exciting, work on building machine translation systems that can translate to or from multiple languages in the same model. Specifically, these models work by taking a neural machine translation system, and training it on data from many different languages at the same time. This line of work started out in 2015, and was further popularized by Google’s impressive paper in the space in 2017.

Recently, there has been a move towards “massively multilingual” systems that translate many many languages at one time, including our paper training on 59 languages, and Google’s recent paper training on over 100 languages. The advantages of these systems are two-fold: (1) they can improve accuracy by learning from many different languages, and (2) they can reduce the computational footprint of deploying models by having only one model trained to translate many languages, as opposed to one model per language.

Why does this research matter?

The significance of this new paper is the sheer scale at which they performed experiments, and some of the insights gleaned therefrom. The scale of the data is 25 billion sentences, which is several orders of magnitude larger than previous multilingual models. It is also a realistic reflection of the “actual data” that is available on the web, so any limitations of the translation results achieved therein are not ones that can simply be solved by adding more data, and instead are ones that will require serious research work to solve.

Regarding the insights, some that stood out to me were:

Figure 1: Per language pair data distribution of the training dataset used for our multilingual experiments. The x-axis indicates the language pair index, and the y-axis depicts the number of training examples available per language pair on a logarithmic scale. Dataset sizes range from 35k for the lowest resource language pairs to 2 billion for the largest.

Figure 1 demonstrates the huge disparities in the amount of data available across languages, with some datasets having as many as 2 billion parallel sentences, and some having as few as 30–50 thousand.

Figure 2: Quality (measured by BLEU) of individual bilingual models on all 204 supervised language pairs, measured in terms of BLEU (yaxes). Languages are arranged in decreasing order of available training data from left to right on the x-axes (pair ids not shown for clarity). Top plot reports BLEU scores for translating from English to any of the other 102 languages. Bottom plot reports BLEU scores for translating from any of the other 102 languages to English. Performance on individual language pairs is reported using dots and a trailing average is used to show the trend.

Figure 2 demonstrates that even with Google-scale resources, languages beyond the top 50 are major problems.

Figure 3: Effect of sampling strategy on the performance of multilingual models. From left to right, languages are arranged in decreasing order of available training data. While the multilingual models are trained to translate both directions, Any→En and En→Any, performance for each of these directions is depicted in separate plots to highlight differences. Results are reported relative to those of the bilingual baselines (2). Performance on individual language pairs is reported using dots and a trailing average is used to show the trend. The colors correspond to the following sampling strategies: (i) Blue: original data distribution, (ii) Green: equal sampling from all language pairs. Best viewed in color.

Figure 3 demonstrates that interestingly, training these large multilingual models is a very effective way to improve translation performance into English, but not exceedingly effective when translating into other languages.

Other than that, there are a number of insights that may be useful to MT practitioners (e.g. the importance of large models, and the effects of some techniques to select how large to make the model vocabularies).

What impact might this work bring to the field?

I think this is an important data point for translation researchers, as it demonstrates the practical promise and limitations of multilingual translation researchers with practical resource limitations that a large company like Google faces. It demonstrates that some problems can be relatively effectively solved by big data and compute, and some cannot be, and the ones that cannot be will be future directions.

Can you identify any bottlenecks in the research?

There are a couple of limitations in the experimental results. First, it is not clear which points in the figures correspond to which languages, so it is hard to get finer-grained takeaways about which types of languages are benefitting from this type of training. Second, there are no qualitative results or translation examples, only results measured using automatic measures such as BLEU score. Because of this it is hard to tell which of these systems have reached a practical level.

Can you predict any potential future developments related to this research?

I think one thing this paper makes clear is that the problem of low-resource translation has definitely not been solved, particularly translation into low-resource languages. I think this indicates that there will need to be significant work in this area to raise the bar for translation into languages where even Google cannot harvest large datasets.

The paper Massively Multilingual Neural Machine Translation in the Wild: Findings and Challenges is on arXiv.