Do you ever have an idea, then somebody else has the same idea, but they screw it up? Well, I just had that happen to me. I stumbled on this article on Fast Infoset today. Fast Infoset is a specification for a binary version of XML. I had this same idea a couple of years ago, and the execution is amazingly similar, but, of course, they screwed it up.

Essentially, everybody is finally realizing that while XML is the first widely accepted data markup format, it's a pig. It's verbose and redundant, which makes it store poorly, transmit slowly, and parse, well, like a pig. Don't get me wrong, the original idea for XML was actually fine, but everybody has taken it way over the top and things like SOAP are just an abomination.

Well, one way to help XML while still retaining XML semantics is to make a binary version of it. While a compression program like gzip can reduce the overall transfer and storage size of an XML document, an XML parser still has to deal with the XML textual format on either end of a transfer. The textual format forces the decoder to examine each and every byte to determine its significance in the document, even when a decoder doesn't understand vast expanses of the XML schema that is being parsed. And that textual XML format actually expands binary data carried in an XML document by forcing it into BASE64 format which does a 3-goes-to-4 encoding.

So, my idea, and that of Fast Infoset, is to notice that most XML documents are extremely redundant with tag and attribute information. First, you have all those < and > symbols surrounding every tag. That's four bytes of redundant info for every opening and closing tag pair. Then you have the tag name itself, which gets repeated at least twice, once in the opening tag and once in the closing tag. Then, you have the fact that many XML documents are recursive tree structures where nodes at any given level of the hierarchy share a lot of the same type of info. For instance, an XML document storing a list of books would use TITLE, AUTHOR, and ISBN tags for each book. Add in namespace prefixes and attributes and you have a lot of redundancy, even before you start talking about SOAP. In short, the information density of your average XML document has Claude Shannon spinning in his grave.

Luckily, it's pretty easy to compress this information right out of an XML document. You simply create a tag/attribute hash table and assign each unique tag or attribute in the document a unique number. Rather than writing TITLE everywhere, you would instead use the number 0; 1 for AUTHOR; 2 for ISBN; and so on. So, rather than "<AUTHOR>" (8 bytes), we would simply have 0x0001 or something (2 - 4 bytes). By eliminating the end tag and encoding the content inside the AUTHOR tag with a 4-byte length, we have a net savings of 8 + 9 - 2 - 4 = 11 bytes every time we would otherwise use an AUTHOR tag. Do the same thing for all the other tags in your document and it adds up pretty quickly. Finally, by encoding binary data as binary octets, rather than BASE64, we eliminate the "ASCII tax" imposed by a textual format on binary data.

The interesting thing is that we still fundamentally have an XML document. This means that you can run this new format through a SAX-like or DOM-like decoder and produce exactly the same data that an XML-based application expects, and the application is none-the-wiser. Everything is just smaller and faster. Because the binary format parser doesn't have to actually look at the characters that make up all these tags and try to match them with other strings, it can whip through a document at light speed.

So, that was my theory. I even had a catchy name for it, BCX: Binary Coded XML.

Along comes Fast Infoset. Great minds think alike. Obviously, the need is there. Only they screwed it up. Rather than just keeping things simple, they decided to encode the whole thing in ASN.1 with packed encoding rules (ASN.1/PER). Now, for those of you who don't know, ASN.1 is about the most complex, worst, ugly data format ever designed. And it's no wonder, it was created by an international committee (ISO) as part of the Open Systems Interconnect protocols (OSI; anybody remember FTAM?). The various encoding rules used to actually serialize ASN.1 (and there are several, which is part of the problem) typically do a lot of bit-twiddling, which slows down encoder/decoders. It also makes them buggy. Anybody remember some of the recent buffer overruns attributed to ASN.1 endecs?

Well, it looks like Fast Infoset is being standardized in ISO (in fact, jointly with ISO/IEC JTC 1 and ITU-T, which is about as ugly as it gets in the international standards committee world), so they probably had to use ASN.1 or people would wonder why they weren't supporting their own standards. ASN.1 is used by a couple of protocols in common use today, including SNMP, X.509, and LDAP. That said, most IETF-originated protocols (you know, the ones that move all your web, ftp, email, etc., traffic around) use either straightforward text encodings, or far more simple binary encodings.

Fast Infoset does manage to get pretty good compression (20% - 80% reduction, depending on document size and content). Throughput for some documents is 2 - 3 times what standard XML delivers. Overall, these are good numbers.

But, it could have been even faster. Frankly, I'm partial to Sun's old XDR format, used by ONC-RPC and NFS, which was very fast, if not quite as tightly packed as some other formats. I was also recently reading about Joe Armstrong's Universal Binary Format (UBF), created to give Erlang a streamlined wire protocol.

In short, marrying XML and ASN.1/PER is like tying two stones together and hoping they will float.