Lab41 recently created a novel dataset for the World Machine Translation (WMT) 2020 workshop at the upcoming Conference on Empirical Methods in Natural Language Processing (EMNLP). We focused on Russian →English translation from colloquial text, to complement other researchers’ efforts in six other languages (high, medium, and low-resource).

With seven total languages for this year’s quality estimation task, this is more than three times the data used in last year’s quality estimation shared task! This is tremendous progress in a very important and fast growing subfield of machine translation (MT). Since these datasets require human labeling and linguistic expertise for each sentence and language pair, they are very time intensive to create.

Hermitage museum in St. Petersburg, where Russian is the official language

These datasets were created to train machine translation quality estimation models without relying on human reference translations. With increasingly accurate machine translation, quality estimation is key to identifying mis-translations and further improving accuracy. And, for automated post-editing (a method for correcting errors produced by an unknown MT system), any deficiencies in identifying errors will be propagated as the algorithm attempts to modify the translations, making strong quality estimation a crucial first step.

We hope this dataset drives progress in the field and we encourage you to use it for this year’s quality estimation task at WMT. This year’s shared task includes the high resource language sets of Russian →English, English →German, & English → Chinese. Medium resource sets comprise Romanian and Estonian to English, and low-resource languages pairs include Sinhala and Nepalese to English.

The quality scores used in these datasets are all Direct Assessment (DA), rather than previous years’ Human-Targeted Translation Edit Rate (HTER) scores. HTER score reflects the minimal number of edits required for MT to accurately reflect the source text. In contrast to HTER’s word-level approach, DA scores are a more holistic measure of a translation’s quality. Consider a scenario where an error in one word completely changes the meaning of the sentence: The HTER quality rating would be rather good, whereas the direct assessment score would capture the erroneous term’s functional effect on translation quality.

Examples

Here are a few samples from the training dataset, highlighting the range of DA translation quality scores (0–100). Note that, while the dataset does not include a ground truth (human) translation, I have included one below for non-Russian speakers.