Facebook AI Research recently posted a paper in which a Convolutional Neural Network architecture is proposed for machine translation instead of a Recurrent Neural Network architecture, which has been the convention until now. In this post we will explain why a CNN-based architecture might become the standard for machine translation (and even other NLP tasks) in the feature.

Convolutional Neural Networks basics

Let’s first understand how CNNs work in order to explain why they have certain advantages over RNNs.

The basic concept underlying CNNs is that we compute a vector for every possible phrase. For instance take the string “my favorite AI blog”. Then we compute the word vector representation for: “my favorite, favorite AI, AI blog, my favorite AI, favorite AI blog”. Once we have that, we compute all the bigram vectors until we reach a top vector.

Example of the computation of the bigram vectors for a very simple sentence until the top vector is reached. Retrieved from the Association for Computational Linguistics.

The computation of these bigrams (or n-grams in more complex but convenient CNN architectures) can be parallelized, the time-step-based computations in a RNN architecture can’t. This data structure has been really successful in classifying images; for instance let’s take the image of a cat as an example; it first processes clusters of pixels, then recognizes shapes, then recognizes parts of the image (ears, legs, tail, etc.) and it finally recognizes there is a cat in the picture.

Convolutional Sequence to Sequence Learning

Gehring et al., (2017)proposed using CNNs because contrary to RNNs computation can be parallelized, optimization is easier since the number of non-linearities is fixed and independent of the input length and last because they outperform the LSTM accuracy in Wu et al., (2016). In addition to that the algorithm for capturing these dependencies scales in O(n/k) instead of O(n) due to the hierarchical structure.

Despite being known that convolutions have several advantages since the early days, such as the ones presented by Waibel et al., (1989) and LeCun & Bengio, (1995), they solely create representations for fixed sized contexts. RNNs allowed to create representations for variable sized contexts and LSTMs and GRUs tackled the problem of RNNs not capturing long-range dependencies. These were the reasons why RNNs became the standard for machine translation and CNNs became more widely adopted in fixed sized contexts such as image processing. But the convolutional architecture presented by Gehring tackles these problems and outperforms RNNs.

Convolutional Architecture

In this section of the article we’ll explain in a summarized way what the fully convolutional architecture proposed by Gehring consists of; the first step is embedding input elements in a distributional space and giving a sense of order to the model by embedding the absolute position of input elements, then both vectors are combined and we proceed similarly for the output elements that were already generated by the decoder network.

This linear combination represents the input embeddings in the source language.

Based on these input elements, intermediate states are computed both for the encoder and decoder networks. The computation of each of these states is called a block, and each block contains a one-dimensional convolution followed by a non-linearity. Gated Linear Units, as proposed by Dauphin et al., 2016, are the non-linearity that implement a gating mechanism over the output of the convolution. Ultimately, the softmax activation function is used to compute a distribution over the T possible next target elements.

Each convolution kernel takes X (a concatenation of k input elements embedded in d dimensions) as an input and outputs Y which has twice the dimensionality of the input elements.

The top decoder output is transformed with a linear layer with weights Wo and bias bo respectively.

Then we proceed to compute attention. We combine the current decoder state with an embedding of the previous target element to get the current decoder state summary. The current attention for the current decoder layer, state, and source element is computed by taking the dot-product between the decoder state summary and each output of the last encoder block.

Next, the conditional input for the current decoded layer is computed with a weighted sum of the encoder outputs and the input element embeddings. Once this is computed, it’s added to the output of the corresponding decoder layer.

Finally, a normalization strategy and a careful weight initialization are applied in order to ensure that the variance across the network doesn’t change dramatically which results in a stabilized learning. This way we lastly reach our desired translation in the target language.

Experimental Setup and Results

3 major WMT translation tasks were used by Facebook AI Research to compare both architectures. The BLEU algorithm, which is used to evaluate how much correspondence a machine-translated text has with a professional human translation, was used to benchmark translations. For English-Romanian, the convolutional architecture surpassed by 1.8 BLEU. For English-French the difference was 1.5 BLEU. For English-German it was 0.5 BLEU.