The need for a format to serialize data is as old as networking itself. In the early days of data processing, the problem was attacked by use of binary protocols  that is, protocols with data that was not human readable. These were frequently custom-defined on an ad hoc basis. The sender and receiver had to agree on where fields were located and what they contained in order to exchange data. These schemes eventually gave way, in part, to emerging standards such as ASN.1.

One of the most successful of these early protocols came from the UNIX world in the mid-1980s, as servers needed some way to exchange data. The resulting XDR format, proposed by Sun, solved one key problem; namely, how to exchange binary data when systems used different endian schemes  a constant problem in the UNIX heyday. XDR was quietly successful and is still found today in NFS and other protocols, as well as in modern products such as Mozilla's SpiderMonkey JavaScript engine where it's used for serializing compiled JavaScript.

By the mid-1990s, under pressure from the rapidly growing Internet, new standards were needed. XDR, for example, was not human-readable and it was generally felt that a human-readable representation that was in keeping with SGML  the markup superset from which HTML is derived  would be a good thing. This turned out to be XML. And by the end of the century, it was already in wide use. All major languages had XML libraries and the format was used whenever and wherever any kind of human representation of data was required. It proved so popular that it moved into areas it was never intended to be, such as text mark-up (in DocBook, for example). The addition of secondary XML technologies, such as XSLT, enabled this.

However, for all its popularity, XML has several significant drawbacks. The first one is the complexity of schemata, which require specialized skills to implement correctly. The second, and by far the biggest factor, is performance. XML is wordy and slow to process. A senior architect at a financial services firm told me recently that in order to optimize the performance of their key business logic servers, they'd done a deep analysis of what exactly was happening with each transaction. They discovered, to their dismay, that almost 50% of their server CPU cycles were consumed encoding and decoding XML. Other organizations have surely recognized, at various points, the significant processing overhead that XML imposes.

Predictably, a smaller alternative emerged over the last few years as the JavaScript revolution has reshaped software development: JSON. Standard JSON can be read as JavaScript and it has the additional benefit of being widely supported with various tools and libraries. However, as its use has been extended to new areas, such as databases, it's become clear that it lacks some desirable traits. Two of them are that it has no support for a date data type and it doesn't support comments. These shortcomings have already led to variants, such as BSON, the binary JSON format devised by 10gen and used in their MongoDB NoSQL database, instead of JSON.

Frustration with JSON has spurred examination and proposal of entirely new schemes. Perhaps one of the most interesting is TOML from Tom Preston-Werner, a cofounder of GitHub. It has the brevity of JSON, although it uses a different notational scheme, that's akin to configuration files with key-value pairs specified one per line and grouped by bracketed item names. It's a take-off on the format of .ini files first popularized by Microsoft, but with many conveniences added in. While there are already libraries in several languages supporting it, it's not clear if TOML will gain sufficient traction. TOML is by no means the only alternative under development. For example, Protocol Buffers is a low-overhead, high-speed data exchange format of particular appeal to C and C++ programmers, that was developed at Google and is widely used there.

In my estimation, the one standard that seems to have almost all the desirable features is YAML. While a big standard (some 80 pages), it is remarkably concise in practice and highly readable. It borrows Python's use of whitespace to indicate the start and end of blocks and subblocks. YAML mostly avoids quotation marks, brackets, braces, and open/close-tags, which enhances its readability. It also contains references, which are ways to refer to a previously defined element. So, if an element is repeated later in a YAML document, you can simply refer to the element using a short-hand name. Finally, YAML supports all the standard data types and can map easily to lists, hashes, or simply individual data items.

YAML is widely supported by libraries in all the principal languages. Its biggest drawback seems to be political rather than technical; namely, that it has not gained the kind of mindshare that would give it the wide acceptance any such protocol needs. Still, if your goal is to elegantly solve the problem of data serialization, especially for internal use, YAML might be exactly the solution you're looking for.

— Andrew Binstock

Editor in Chief

[email protected]

Twitter: platypusguy