Published online 7 November 2006 | Nature | doi:10.1038/news061106-6

News

Program learns languages by comparing documents.

Software can learn to translate Chinese without any prior information. Punchstock

Google has built an English translation tool for Chinese and Arabic texts — using a team that speaks neither of the two languages.

The system, which last week topped an international exercise to find the best Chinese and Arabic translation technology, is symbolic of a shift in approach to computer translation. Current software, such as the industry leader in Arabic-English translation, made by Cairo-based company Sakhr, draws on knowledge of vocabulary and grammar to translate documents.

But Google's software, which is available in an experimental version, learns to translate by comparing the same document in different languages — such as English and Arabic versions of a newspaper article.

Beginning from perfect ignorance, as its data set grows, Google's software learns to match strings of Arabic or Chinese characters to their English counterparts. This produces a raw translation, which the software tidies up by rearranging the words into fluent English using patterns it has learnt from studying English texts.

The approach can produce impressive results, but requires no knowledge of the languages involved. Philipp Koehn, a machine-translation expert at the University of Edinburgh, UK, who entered the evaluation using a similar approach to Google's, says that when he began working on software to translate Arabic into English his computer did not even have the software to display Arabic text.

Top dog

Statistical software such as Google's has begun to dominate the yearly evaluations, funded originally by the US Defense Advanced Research Projects Agency. The results from this year's comparison, organized by the National Institute of Standards and Technology (NIST) in Gaithersburg, Maryland, were released on 31 October.

Out of 40 entrants, Google finished top or joint top in all but one of the 36 different translation tasks. The NIST evaluation measures the degree to which their outputs match a reference document produced by a human translator. Google's highest score — achieved for Arabic-to-English translations of newswire text — had 50% similarity to the reference, comparing favourably to the 60% that a different human translator of the same document might achieve.

Rivals attribute Google's dominance in part to its ability to recruit the leaders in machine translation. Franz Och left the University of Southern California in Los Angeles for the company's offices in Mountain View, California, two years ago. Och, who was already building top-ranking translation systems, had tens of machines in academia, recalls Peter Norvig, Google's director of research. "I asked him if he could do better with a couple of thousand."

At Google, Och also had access to huge numbers of translated documents. Many research groups draw on sources such as the European Union and the United Nations, which translate diplomatic documents. Some academic organizations also collect translations from sources such as newspapers and share them with other researchers.

ADVERTISEMENT

But Google can also draw on the documents it collects as it indexes the Web. This gives it a particular advantage in the second, tidying-up stage of translation, experts say.

The result is a system that can begin to resemble human translation if it is given enough examples of a particular style of text — such as newspaper stories — to study. But all systems still struggle with informal writing, such as newsgroup conversations.

In the long run, experts expect statistical and rules-based approaches to be combined. In German, for example, verbs often come at the end of sentences, which can trip up statistical approaches such as Google's. Applications Technologies of McLean, Virginia, have developed a system that uses such a hybrid approach; it was the sole system to relegate Google to second place in any test.

Visit our topstranslation_rankin.html">newsblog to read and post comments about this story.