Imagine discovering a secret language spoken only online by a knowledgeable and learned few. Over a period of weeks, as you begin to tease out the meaning of this curious tongue and ponder its purpose, the language appears to shift in subtle but fantastic ways, remaking itself daily before your eyes. And just when you are poised to share your findings with the rest of the world, the entire thing vanishes.

This fairly describes my roller coaster experience of curiosity, wonder and disappointment over the past few weeks, as I’ve worked alongside security researchers in an effort to understand how “lorem ipsum” — common placeholder text on countless Web sites — could be transformed into so many apparently geopolitical and startlingly modern phrases when translated from Latin to English using Google Translate. (If you have no idea what “lorem ipsum” is, skip ahead to a brief primer here).

Admittedly, this blog post would make more sense if readers could fully replicate the results described below using Google Translate. However, as I’ll explain later, something important changed in Google’s translation system late last week that currently makes the examples I’ll describe impossible to reproduce.

CHINA, NATO, SEXY, SEXY

It all started a few months back when I received a note from Lance James, head of cyber intelligence at Deloitte. James pinged me to share something discovered by FireEye researcher Michael Shoukry and another researcher who wished to be identified only as “Kraeh3n.” They noticed a bizarre pattern in Google Translate: When one typed “lorem ipsum” into Google Translate, the default results (with the system auto-detecting Latin as the language) returned a single word: “China.”

Capitalizing the first letter of each word changed the output to “NATO” — the acronym for the North Atlantic Treaty Organization. Reversing the words in both lower- and uppercase produced “The Internet” and “The Company” (the “Company” with a capital “C” has long been a code word for the U.S. Central Intelligence Agency). Repeating and rearranging the word pair with a mix of capitalization generated even stranger results. For example, “lorem ipsum ipsum ipsum Lorem” generated the phrase “China is very very sexy.”

Kraeh3n said she discovered the strange behavior while proofreading a document for a colleague, a document that had the standard lorem ipsum placeholder text. When she began typing “l-o-r..e..” and saw “China” as the result, she knew something was strange.

“I saw words like Internet, China, government, police, and freedom and was curious as to how this was happening,” Kraeh3n said. “I immediately contacted Michael Shoukry and we began looking into it further.”

And so the duo started testing the limits of these two words using a mix of capitalization and repetition. Below is just one of many pages of screenshots taken from their results:

The researchers wondered: What was going on here? Has someone outside of Google figured out how to map certain words to different meanings in Google Translate? Was it a secret or covert communications channel? Perhaps a form of communication meant to bypass the censorship erected by the Chinese government with the Great Firewall of China? Or was this all just some coincidental glitch in the Matrix?

For his part, Shoukry checked in with contacts in the U.S. intelligence industry, quietly inquiring if divulging his findings might in any way jeopardize important secrets. Weeks went by and his sources heard no objection. One thing was for sure, the results were subtly changing from day to day, and it wasn’t clear how long these two common but obscure words would continue to produce the same results.

“While Google translate may be incorrect in the translations of these words, it’s puzzling why these words would be translated to things such as ‘China,’ ‘NATO,’ and ‘The Free Internet,'” Shoukry said. “Could this be a glitch? Is this intentional? Is this a way for people to communicate? What is it?”

When I met Shoukry at the Black Hat security convention in Las Vegas earlier this month, he’d already alerted Google to his findings. Clearly, it was time for some intense testing, and the clock was already ticking: I was convinced (and unfortunately, correct) that much of it would disappear at any moment.

A BRIEF HISTORY OF LOREM IPSUM

Search the Internet for the phrase “lorem ipsum,” and the results reveal why this strange phrase has such a core connection to the lexicon of the Web. Its origins in modernity are murky, but according to multiple sites that have attempted to chronicle the history of this word pair, “lorem ipsum” was taken from a scrambled and altered section of “De finibus bonorum et malorum,” (translated: “Of Good and Evil,”) a 1st-Century B.C. Latin text by the great orator Cicero.

According to Cecil Adams, curator of the Internet trivia site The Straight Dope, the text from that Cicero work was available for many years on adhesive sheets in different sizes and typefaces from a company called Letraset.

“In pre-desktop-publishing days, a designer would cut the stuff out with an X-acto knife and stick it on the page,” Adams wrote. “When computers came along, Aldus included lorem ipsum in its PageMaker publishing software, and you now see it wherever designers are at work, including all over the Web.”

This pair of words is so common that many Web content management systems deploy it as default text. Case in point: Lorem Ipsum even shows up on healthcare.gov. According to a story published Aug. 15 in the Daily Mail, more than a dozen apparently dormant healthcare.gov pages carry the dummy text. (Click here if you skipped ahead to this section).

FURTHER TESTING

Things began to get even more interesting when the researchers started adding other words from the Cicero text from which the “lorem ipsum” bit was taken, including: “Neque porro quisquam est qui dolorem ipsum quia dolor sit amet, consectetur, adipisci velit . . .” (“There is no one who loves pain itself, who seeks after it and wants to have it, simply because it is pain …”).

Adding “dolor” and “sit” and “consectetur,” for example, produced even more bizarre results. Translating “consectetur Sit Sit Dolor” from Latin to English produces “Russia May Be Suffering.” “sit sit dolor dolor” translates to “He is a smart consumer.” An example of these sample translations is below:

Latin is often dismissed as a “dead” language, and whether or not that is fair or true it seems pretty clear that there should not be Latin words for “cell phone,” “Internet” and other mainstays of modern life in the 21st Century. However, this incongruity helps to shed light on one possible explanation for such odd translations: Google Translate simply doesn’t have enough Latin texts available to have thoroughly learned the language.

In an introductory video titled Inside Google Translate, Google explains how the translation engine works, the sources of the engine’s intelligence, and its limitations. According to Google, its Translate service works “by analyzing millions and millions of documents that have already been translated by human translators.” The video continues:

“These translated texts come from books, organizations like the United Nations, and Web sites from all around the world. Our computers scan these texts looking for statistically significant patterns. That is to say, patterns between the translation and the original text that are unlikely to occur by chance. Once the computer finds a pattern, you can use this pattern to translate similar texts in the future. When you repeat this process billions of times, you end up with billions of patterns, and one very smart computer program.”

Here’s the rub:

“For some languages, however, we have fewer translated documents available, and therefore fewer patterns that our software has detected. This is why our translation quality will vary by language and language pair.”

Still, this doesn’t quite explain why Google Translate would include so many references specific to China, the Internet, telecommunications, companies, departments and other odd couplings in translating Latin to English.

In any case, we may never know the real explanation. Just before midnight, Aug. 16, Google Translate abruptly stopped translating the word “lorem” into anything but “lorem” from Latin to English. Google Translate still produces amusing and peculiar results when translating Latin to English in general.

A spokesman for Google said the change was made to fix a bug with the Translate algorithm (aligning ‘lorem ipsum’ Latin boilerplate with unrelated English text) rather than a security vulnerability.

Kraeh3n said she’s convinced that the lorem ipsum phenomenon is not an accident or chance occurrence.

“Translate [is] designed to be able to evolve and to learn from crowd-sourced input to reflect adaptations in language use over time,” Kraeh3n said. “Someone out there learned to game that ability and use an obscure piece of text no one in their right mind would ever type in to create totally random alternate meanings that could, potentially, be used to transmit messages covertly.”

Meanwhile, Shoukry says he plans to continue his testing for new language patterns that may be hidden in Google Translate.

“The cleverness of hiding something in plain sight has been around for many years,” he said. “However, this is exceptionally brilliant because these templates are so widely used that people are desensitized to them, and because this text is so widely distributed that no one bothers to question why, how and where it might have come from.”

Tags: Black Hat, Cecil Adams, Central Intelligence Agency, China, Cicero, Deloitte, google, Google Translate, healthcare.gov, Kraeh3n, Lance James, lorem ipsum, Michael Shoukry, NATO, The Company, The Straight Dope