Unlocking the Black Box of Translation Memory Files

Did you know that a translation memory (TM) file is just a text file with structure? Maybe you saw the file extension (*.tmx or *.xliff) and imagined it can only be opened with special software. In reality, all you need is a standard text editor (such as TextEdit on Mac or Notepad on Windows) and you can open it up. Additionally, you don’t need a PhD or computer degree to read it or understand the structure. For the most part, a translation memory file is straightforward and functional... it is not a black box.

So what exactly is a translation memory file?

A translation memory file holds translation and linguistic data in a structured format. It’s just a text file. More specifically, it is typically an XML (Extensible Markup Language) file, which is also a text file, but with a well-defined structure which provides ways to represent complicated data structures.

What type of information does it store?

The main info:

Segments (source and target)

Language

Creation dates and times

Additional data it might store:

Author

Usage count

Change dates and times

Creation tool

Domain (field)

Alternate translations

Notes

What are the typical formats of translation memory files?

The two most popular file types in the industry are XLIFF and TMX - both of which are XML files. However, translation memory can also be stored as spreadsheet files such as Excel (XLS) or even just comma separated value text files (CSV). Although XLS and CSV files tend to be smaller in size, the downside to these types is that you store less data about each translation unit (i.e. typically only the segment and language are stored.)

Why do XLIFF and TMX both use the XML format? There are a few advantages XML provides over a raw text file:

It is easy to parse because it has a well defined structure.

The structure and tags of an XML file often help indicate what the data means (i.e. semantic tags such as <segment>, <trans-unit>, etc.).

There are many software tools built around the XML format to validate, import, parse, search, etc.

Different applications and systems can interact and exchange data because an XML file typically has a well-defined structure.

Let's break it down

What's in a translation memory file? First is the header. The header contains metadata about the file and the localization process. Let's look at an example header for each file type. As you will see, the semantic naming of XML tags makes the files human readable - even without reading the actual specification you can probably understand most of the fields.

Hover your mouse over an attribute for more details.

TMX Header

<?xml version="1.0"?>

<tmx version="1.4">

<header

creationtool="TRADOS Translator's Workbench for Windows"

creationtoolversion="Edition 8 Build 863"

segtype="sentence"

adminlang="EN-US"

srclang="EN-US"

creationdate="20131117T140541Z"

>

</header>

</tmx>

XLIFF Header

In an XLIFF file the metadata attributes are stored in the <file> element.

<?xml version="1.0"?>

<xliff version="1.1">

<file

original="ABC Company Brochure.docx"

source-language="EN-US"

datatype="plaintext"

target-language="JA-JP"

tool="Transdraft"

date="2013-09-04"

>

</file>

</xliff>

After the header comes the body. This is the section that contains the most important data - the translation units and segments. Let's look at an example body for both a TMX and XLIFF file.

Hover your mouse over an attribute for more details.

TMX Body

<?xml version="1.0"?>

<tmx version="1.4">

<body>

<tu creationdate="20110510T103323Z">

<tuv xml:lang="EN-US">

<seg>This is a pen.</seg>

</tuv>

<tuv xml:lang="JA-JP">

<seg>これはペンです。</seg>

</tuv>

</tu>

</body>

</tmx>

XLIFF Body

<?xml version="1.0"?>

<xliff version="1.1">

<file>

<body>

<trans-unit id="1">

<source xml:lang="EN-US">This is a pen.</source>

<target xml:lang="JA-JP">これはペンです。</target>

</trans-unit>

</body>

</file>

</xliff>

Why are translation memory files so important?

Improves efficiency

Translation memory files are typically used by translators in their CAT or TEnT (translation environment tool) tool to help them translate more efficiently. Loading a translation memory file into a translation software tool allows a translator to leverage their prior work. If a segment in the current translation has already been translated before (or even partially translated before) the tool will help the translator by automatically alerting them of the match (or partial match).

Ensures consistency

Translation memory files will also help you maintain consistency as a translator. Over your career you will work on many different projects for many different clients. Some projects might require specific terminology or phrases. Utilizing "client-based" or "project-based" translation memory will allow you to ensure accuracy and stay consistent with every translation you work on.

What are the differences between TMX and XLIFF? TMX and XLIFF are both industry standard file types. Additionally, both are XML-based file types. The two formats have a lot in common, including some inline markup elements; however, each one has a slightly different structure and elements. The following are a few key differences between the two file types: The TMX and XLIFF formats were created for slightly different purposes. XLIFF was created as a format to store extracted text and carry the data from one step to another in the localization process while TMX was created as a format to exchange translation memory data from one tool to another. 1]

TMX allows any number of languages in the same document. XLIFF is designed to work with one source and one target language.

TMX uses only the encapsulation methods for inline codes (there native codes are enclosed within different elements), while XLIFF provides both the encapsulation method (using elements very similar to TMX's) and the placeholder method (where the native codes are removed to the Skeleton file and replaced by a short element that refers to them, using elements very similar to OpenTag's). 2]

In a TMX file, a collection of <tu> elements has no specific order and contains no mechanism to rebuild the original file.

XLIFF adds a few data types and fields that are not present in a TMX file such as pretranslation and history, versioning, binary objects, and others. 3]

TMX files can store time and date data at the translation unit level, while XLIFF files can not.

Which format is better - TMX or XLIFF?

Both TMX and XLIFF are powerful choices. They are both industry standard file formats and both are supported by the majority of translation software tools. Ultimately, whether you end up using a TMX or XLIFF file often depends on the project or tool you will be using. Additionally, sometimes a TM file might be provided to you for a particular job. Both TMX and XLIFF can get the job done well and using translation memory is 1000x better than not using translation memory (regardless of what file format you use). Many times you don't have to "choose" as you can download your translation memory from the tool you are using in either format.

However, given the choice on a new translation project, I would prefer TMX for 2 main reasons:

Translation units are (or at least can be) time stamped (Time stamped translation units allow you to later do a productivity analysis on your work). Multiple target languages can be stored in one file.

On the other hand, if using the TM file to reconstruct or rebuild the original file is important to you, the XLIFF format is much more powerful in this regard.