From Jan. 29 to Feb. 4 the Moscow Institute of Physics and Technology (MIPT) will host DeepHack.Babel, a hackathon organized by the Institute’s Neural Networks and Deep Learning Lab. The focus of the event is on neural machine translation, an approach that is gaining in popularity among researchers and that is already used in commercial products. This hackathon is special in that the participants are supposed to train their machine translation systems using nonparallel data. In machine learning terms, this is known as unsupervised learning. Read on to find out what this is all about.

Before neural networks

Up until recently, when neural networks gained popularity, machine translation systems were essentially long lists of terms — words or phrases — in the source language with their possible translations into the target language. These possible translations came from a translation corpus — a massive bilingual collection of texts where each text in one language has a matching translation in the other language. An algorithm would analyze the co-occurrence of words and expressions to identify which of them are mutual translations. To render a sentence into another language, translations of individual words and phrases had to be combined in the most likely way.



These are the possible English translations of the words and phrases in the German sentence: Er geht ja nicht nach Hause. The quality of each possible translation is given by a weighted sum of the values of features such as the probabilities p(e|f) and p(f|e), where e and f are the source and the target expression. But choosing the appropriate translations is not everything: They need to be ordered. This image was taken from a presentation by Philipp Koehn.

To order the translations of individual words, a machine translation system was supplied with a probabilistic language model. A classical variant of such a model is based on n-grams. The principle behind it is similar to how mutual translations are identified by their co-occurrence, but in this case, it is estimated with what probability a given word will occur after a string of n other words. The greater this probability for each subsequent word in a generated sentence — that is, the less “surprised” the model is — the more natural it is deemed to sound. By implication, the probability of it being the correct translation is judged to be higher. It may sound simplistic, but this technique actually enabled a pretty high quality of the translation. Perhaps, one reason for it is the fact that a probabilistic language model is always trained using monolingual corpus, as opposed to translations, so it is possible to train it on a truly massive amount of language data to make it aware of how actual people talk.

Neural networks transformed machine translation. It is now carried out by first encoding a sentence into into a vector that represents the general idea of the sentence in a language-independent form and then decoding this meaning vector into words in a target language. These transformations are often achieved using recurrent neural networks that are generally intended for processing sequences of objects — or, in this case, strings of words.



Principle of operation of a three-layer encoder-decoder model. The encoder (red) generates a representation of a sentence by combining each subsequent input word with the representation of the words that preceded it. The representation of the sentence as a whole is shown in blue. The decoder (green) outputs a word in the target language based on the representation of the source sentence and the preceding model-generated word. This image was taken from a tutorial on neural machine translation from the 2016 meeting of the Association for Computational Linguistics.

With each subsequent step, the neural network combines a new input word — or rather, its vector representation — with the information on the words that preceded it. The parameters of the neural network determine how much “remembering” and “forgetting” is needed at each step. As a result, the general representation of the entire sentence contains the most important information. The encoder-decoder architecture has become a staple of machine translation. You can find out more about it in [1].

Actually, the standard modification of this system is not as good as one might expect, so certain extra tweaks are needed to ensure quality translation. One of the pitfalls is the tendency of a recurrent network with regular cells toward exploding or vanishing gradients. That is to say, the gradients converge to zero or extremely high values and cease changing, rendering the network unable to be trained. To combat this, neurons with a different structure were proposed, including long short-term memory and gated recurrent units. If a sentence is long, the system forgets the beginning by the time it reaches the end. To address this, bidirectional networks reading the sentence from both ends at once are used [4]. It also proved to be effective to match discrete source elements with their translations, the way it is done in statistical machine translation. To do this, attention mechanisms, which had already been applied to other tasks, were introduced into machine translation in [4] and [5], among others. With attention mechanisms, the system is made aware during the decoding stage — which particular source word it is supposed to translate at that point.



Attention mechanism implemented in the encoder-decoder architecture. The current state of the decoder (B) is multiplied by each of the states of the encoder (A). We thus find out which one of the source words is the most relevant for the decoder at that particular point. (Other operations can be substituted in place of multiplication as a means of identifying similarities.) The result of multiplication is transformed into a probability distribution by the softmax function, which returns the weight of each individual input word for the decoder. The combinations of the weighted states of the encoder are supplied to the decoder. This image was taken from a post by Chris Olah.

All of these supplementary techniques ensure that neural machine translation is significantly more effective than purely statistical systems. For example, a recent machine translation contest saw systems based on neural networks triumph in virtually all language pairs.

Statistical models may be outdated, but they have a certain feature that so far has not been reproduced in neural networks: They accommodate the use of massive amounts of nonparallel data — that is, untranslated text.

What is the point, one might wonder, of using nonparallel data, considering that neural systems are reasonably good without them. It turns out that quality translation relies on enormous amounts of training data, which may not be available. Neural networks are quite demanding in terms of how much data they need. It quickly becomes evident that with a small dataset, any classical technique — say, support vector machines — will do better than a neural network. Indeed, there is no shortage of data for such popular language pairs as English ⇔ any other widely spoken European language, English ⇔ Chinese, or English ⇔ Russian, so neural network architectures produce very positive results. But as soon as less than couple of million sentences’ worth of parallel data is available, neural networks become useless. And this is the situation for the majority of language pairs. However, a great many of these languages have an endless supply of monolingual data: news, blogs, social media, official documents. New texts are constantly being created and could be used to improve neural machine translation quality, the same way they were used to improve statistical systems. Unfortunately, no techniques are available so far to do this.

Technically, training neural networks on monolingual texts has been implemented. For example, [6] describes a combination of the encoder-decoder architecture and a probabilistic language model, and [7] relies on the model itself to generate the missing translations in a monolingual corpus. This proves that monolingual corpora can be used to boost the quality of neural machine translation, but the approach remains uncommon. Moreover, we do not know how to use nonparallel data optimally, which approach is the most effective, and whether different approaches fit different language pairs, architectures, etc. It is these questions that we wish to address by holding the DeepHack hackathon.

We are planning a number of controlled experiments: The participants will have a small parallel dataset to train a neural machine translator. They will then be tasked with improving translation quality using monolingual data. The conditions for all participants are the same: identical datasets, identical constraints on the size and training time of models. This will help us pinpoint the most effective methods used by the contestants, taking us one step closer to better machine translations for uncommon language combinations. In fact, we will have experiments featuring a range of language pairs of varying complexity to see whether some of the solutions are language-specific.

What is more, we will finally come to grips with an even more ambitious task, which would be insurmountable with statistical methods alone — namely, translating without any parallel data. Translator training technology using parallel texts is tried and tested, yet questions remain even in that regard. Comparable corpora — that is, those containing texts on one topic in two languages — are also used widely in machine translation [8]. This approach enables researchers to tap into resources such as Wikipedia, where articles in different languages may not repeat one another but still mention roughly the same things. That said, sometimes we just cannot tell if two texts match each other in their content. An example would be a bilingual corpus of news stories, all of them dated with the same year: We may feel fairly confident that the same events were mentioned in the news, which implies that for most of the words in one language, there is a corresponding term in the other language, but matching individual texts — let alone sentences — proves impossible without additional information.

Is it possible to make use of such data? While the idea may seem far-fetched, the authors of several publications on the topic are in fact doing just that. For example, such a system based on denoising autoencoders was recently described in [9]. By reproducing these methods, those participating in the hackathon, hopefully, can beat translation systems trained on parallel texts.



Operation of a machine translation system not trained on parallel texts. Autoencoder (left): The model is trained to reconstruct sentence x from its corrupted version C(x), resulting in a reconstructed version x̂. Translator (right): The model is trained to translate a sentence into another language. It receives a corrupted translation generated be the version of the model of the previous iteration. The first version is a vocabulary that is also trained without the use of parallel data. By combining the two models, a translation quality comparable to that of the systems trained on parallel data is achieved. The image was taken from [9].

The hackathon will feature a series of lectures on machine translation research. The speakers are leading experts, including Kyunghyun Cho, a research scientist with Facebook AI Research (FAIR); Ruslan Salakhutdinov from Carnegie Mellon University, and Andre Martins from Unbabel.

The lectures will be delivered in English. They are open to the public, but it is necessary to register. Some of the lectures will be streamed on DeepHack’s YouTube channel. The program of the event and lecture descriptions are available on the hackathon’s dedicated website.

Bibliography: