The man versus machine translation debate goes as far back as the 1950s, possibly even earlier, and it is both fascinating and tiring. The rapid progress of neural machine translation over the past two years brought on a resurgence of discussions, leading even big tech companies like Microsoft to publish research papers with bold (if not misleading) titles like “Achieving Human Parity on Automatic Chinese to English News Translation.”

Granted, the Microsoft authors did temper their claims. According to their paper, human parity is achieved “if there is no statistically significant difference between human quality scores for a test set of candidate translations from a machine translation system and the scores for the corresponding human translations.”

In other words, if a bilingual human evaluator judges the quality of human and machine translations as equal (difference in scores are statistically insignificant), “then the machine has achieved human parity.”


Enter Läubli, Sennrich, and Volk

Now, a group of researchers argue that many researchers and industry experts have been looking at the issue from the wrong angle.

In a paper titled “Has Machine Translation Achieved Human Parity? A Case for Document-level Evaluation,” Samuel Läubli, PhD Candidate at the University of Edinburgh and co-authors Dr. Rico Sennrich, Assistant Professor at the University of Edinburgh’s School of Informatics and Dr. Martin Volk of the Institute of Computational Linguistics at the University of Zurich, argue that research should focus on document-level context instead of comparing output at the sentence level.

The authors used professional human translators to evaluate the performance of the NMT engine used by Microsoft on the Conference on Machine Translation (WMT) 2017 Chinese to English news task. Additionally, they also used pairwise ranking (side-by-side comparison of human versus machine translation), while also taking into account document-level context when gauging both translation adequacy and fluency.

Microsoft’s human parity claim, by their definition, held water, but only because they used current MT research standards, which, according to Läubli’s paper, have become unsuitable for effectively evaluating NMT.

Läubli, Sennrich, and Volk’s methodology corrected a few problems with the evaluation used in the Microsoft research paper.

“Knowing about strengths and weaknesses of NMT, we could hardly imagine that [Microsoft’s] system had really reached the quality of professional human translators,” Läubli told Slator via email.

He explained that Microsoft followed current research standards in their methodology, where usually, “raters see single sentences – one by one, from any of the test documents, in random order – and rate their adequacy and fluency on a scale from 0 to 100.”

However, in this process, Läubli said it would be “impossible” for evaluators to detect certain translation errors, and thus they were unable to properly take these into account.

He pointed out some of the major problems in Microsoft’s process, among others:

Evaluators were bilingual crowd workers, not necessarily professional translators. Evaluators only assessed adequacy, not fluency. Evaluators “never directly compared human to machine translation.” They looked at them separately and assigned scores.

To address the direct comparison problem, Läubli said “we used pairwise ranking in our experiments. Raters always saw human and machine translation of a certain source text at the same time, and chose the better of the two.”

“Let me assure you that the MT community doesn’t think that NMT reached the level of professional translators yet,” he concluded.

Not Microsoft’s Fault

Läubli, Sennrich, and Volk’s results point out a few interesting things.

One major finding was that professional human translators strongly preferred human translations compared to NMT output when provided with the context of the entire document and not just single sentences.

Furthermore, the same professional evaluators preferred the fluency of human translations. There was no statistically significant preference either way when it came to adequacy in the sentence level, however.

Document-level context is currently a priority for NMT research. It is one of the next major problems as discussed in Slator’s NMT report 2018 and highlighted by the subject matter experts interviewed, which included Läubli and Sennrich.

“It’s not their fault,” Läubli told Slator, referring to Microsoft, “The procedure they used is standard practice in the MT community.”

“Microsoft isn’t to blame for their system evaluation. It followed “best practice” in the community based on evaluating sentences, not entire documents, and we’re arguing that MT has now reached a level of quality where this “best practice” needs to change: we should use full documents to judge MT quality,” he said.

Indeed, in the conclusion to their paper, the authors wrote that “if we accept our interpretation that human translation is indeed of higher quality in the dataset we tested, this points to a failure of current best practices in machine translation evaluation.”

In his email, Läubli did add, however, that Microsoft’s team could have handled the title a bit better. “The title of their paper was a bit bold,” he said, “It should have read something like: Bilingual non-professionals give isolated sentences produced by our system and professional translators similar scores.”

NMT Evaluation Needs to Change

In their paper’s conclusion, Läubli, Sennrich, and Volk explain that NMT is currently at a level of fluency where BLEU (bilingual evaluation understudy) scores based on a single model translation and even evaluations of non-professional human translators of sentence-level output are no longer enough.

“As machine translation quality improves, translations will become harder to discriminate in terms of quality, and it may be time to shift towards document-level evaluation, which gives raters more context to understand the original text and its translation,” the paper’s conclusion read. It further explained that document-level evaluation shows translation errors otherwise “invisible” in a sentence-level evaluation.

We’re arguing that MT has now reached a level of quality where this “best practice” needs to change: we should use full documents to judge MT quality.

Läubli advised caution when presenting breakthroughs in MT research. “Spreading rumours about human parity is dangerous for both research and practice: funding agencies may not want to fund MT research anymore if they think that the problem is “solved” and translation managers are not going to be willing anymore to have professionals revise MT output at all,” he said.

Läubli’s team is not the first to point out that current MT research community standards need to change.

In Slator’s NMT Report 2018, experts pointed out the limitations of current BLEU scoring standards, and offered some better alternatives. In his own research paper, Professor Andy Way, Deputy Director of ADAPT Centre for Digital Content Technology, said “n-gram-based metrics such as BLEU are insufficient to truly demonstrate the benefits of NMT over [phrase-based, statistical, and hybrid] MT.”

“If NMT does become the new state-of-the-art as the field expects, one can anticipate that further new evaluation metrics tuned more precisely to this paradigm will appear sooner rather than later,” Way wrote in his paper.

Download the Slator 2019 Neural Machine Translation Report for the latest insights on the state-of-the art in neural machine translation and its deployment.