If you’re interested in translating or adapting this post, please email us first .

Meet google_translate_diff, a Ruby gem for everyone who uses Google Translation API to treat long texts on multi-lingual websites. See how it helps us spend three times less on machine translations at eBaymag.

Spoiler: it has to do with NLP and caching.

No humans involved

Thanks to breakthroughs in AI, automated translation services keep improving at a steady pace. In some cases, Google Cloud Translation produces texts that are indistinguishable from human work. Product descriptions are a perfect example: their translations do not have to be creative, they just have to be exact.

Same product on global and French versions of eBay. Descriptions translated automatically by Google.

At eBaymag, vendors can place their products on multiple international versions of eBay with a click of a button. Every product description is translated into five languages on the fly. There is a downside to this magic though:

Every small change in a single product description (fixing a typo or changing attributes) calls a retranslation of the whole text into all supported languages.

Google charges us per character. That means we have to pay for several pages of text every time a user changes a single character. There must be a way to handle this more efficiently. Let’s find it!

Doing some math

A typical product description looks like this:

"There are 6 pcs <b> Neumann Gefell </b> tube mics MV 101 with MK 102 capsule. It is working with much difference capsules from neumann / gefell.

Additionally…”

On a given day we have:

Total characters: 41,458,854 (525,682,297 including raw HTML);

Duplicates: 25,084,381 characters;

Average description: 1,774 characters;

Median length: 1,140 characters.

Every minor change makes us pay for 1,140 * 5 = 5,700 characters.

We have also noticed that around 3% of descriptions change daily. So, we are supposed to pay for 5 to 6 million characters every day, or for 180 million characters every month. Google translates it into a $3,600 bill.

Saving some money

Another observation we have made:

Descriptions from a single seller often share the same boilerplate: shipping policies, terms of service, etc. Descriptions from different sellers may also share fragments if they are copied from the same external source.

It looks like we need a way to extract identical fragments from descriptions and cache their translations. A basic structural element of a text is a sentence. After crunching more data, we found out that we have:

964,455 sentences;

Only 180,791 of them are unique;

That makes for 9,807,456 characters of unique content.

41.5 million vs. 9.8 million characters. Looks promising!

Also, most of our descriptions contain poorly generated HTML that looks like this:

.png\");

}

#navbar {

position: absolute;

top: 86px;

}

#efooter {

background: none repeat scroll 0 0 #EAEAEA;

margin: auto;

text-align: center;

}



#footwrap {

color: #666666;

font-family: 'Roboto';

font-size: 12px;

font-weight: bold;

margin: 14px auto auto;

padding: 20px;

text-align: center;

width: 980px;

}

#toplinks a {

text-decoration: none;

margin: 0px 5px;

}



.yuinav {

border-left: 1px solid #CCCCCC;

float: left;

margin: 10px 0 0;

padding: 0;

}



.yuinav li {

cursor: pointer;

}



.yuinav li {



Is there anything we can do about it? For our initial analysis, we stripped all markup with ActionView’s #strip_tags . In the real application, though, we need to keep the tags. Although you can translate HTML with Google, splitting the resulting character soup into sentences will be next to impossible. Also, changes in markup should not lead to retranslation.

Only text changes matter, so we need to consider markup separately from the content.

What if we could have something like the following?

[[ "There are 6 pcs " , :text ], [ "<b>" , :markup ], [ "Neumann Gefell" , :text ], [ "</b>" , :markup ], [ " tube mics MV 101 with MK 102 capsule. " , :text ], [ "It is working with much difference capsules from neumann / gefell.

" , :text ], [ "Additionally…" , :text ]]

That way, every text chunk can be handled separately: markup will be cached right away, text sentences—translated and then cached.

ox gem is an excellent, fast SAX parser. punkt-segmenter gem is perfect for sentence splitting.

To achieve this, we ended up writing a special tokenizer. Now we can:

Immediately tell text from markup;

Translate only sentences we have never seen before. Prior translations will be loaded from cache;

Join markup and text back together after we are done.

From an idea to a gem

We have decided to share our tool with the world: google_translate_diff is now open source, and you can use it too. Here’s how:

s = "There are 6 pcs <b>Neumann Gefell</b> tube mics MV 101 with MK 102 capsule. It is working with much difference capsules from neumann / gefell.

Additionally…" GoogleTranslateDiff . translate ( s , from: :en , to: :es ) # => Tokenize [ [ "There are 6 pcs " , :text ], [ "<b>" , :markup ], [ "Neumann Gefell" , :text ], [ "</b>" , :markup ], [ " tube mics MV 101 with MK 102 capsule." , :text ], [ "It is working ... / gefell.

" , :text ], # NOTE: Separate sentence [ "Additionally…" , :text ] # NOTE: Also, separate sentence ] # => Load from cache and translate missing pieces [ [ "Ci sono 6 pezzi " , :text ], # <== cache [ "<b>" , :markup ], [ "Neumann Gefell" , :text ], # <== Google ==> cache [ "</b>" , :markup ], [ " Tubi MV 101 con ... " , :text ], # <== Google ==> cache [ "Sta lavorando cn ... / gefell.

" , :text ], # <== cache [ "Inoltre…" , :text ] # <== cache ] # => Join back "Ci sono 6 pezzi <b>Neumann Gefell</b> Tubi MV 101 con capsula MK 102. Sta lavorando con molte capsule di differenza da neumann / gefell.

Inoltre"

Trade-offs

As always, there is no silver bullet, so here are few things to take into consideration:

Casing : If you try to translate something like ["Sequence", "of words"] into German via Google API, you will get ["Sequenz", "Der Worte"] . Because Google treats every single element of an array as the separate sentence, definite article “der” is capitalized, which is wrong.

Context loss : Context is important to Google. When the text is split into small parts, context is lost, and translation quality drops in some cases.

Losing punkt-segmenter trained data between requests: Library’s README recommends training a splitting algorithm separately for every language and marshal trained data to a file. Currently, this is not happening, but it can be easily implemented. PR’s are welcome!

Our average bill dropped from 180-200 million to 58 million characters per month without any significant decrease in translation quality. We ended up spending three times less: $1,200 instead of $3,600.

Keep in mind that the same approach will work for any translation service that bills you per character. Feel free to fork google_translate_diff repository and adapt it to your use case!