Some years ago I authored the XML parsing library XML::Bare. The original form of it was a very basic C library that stored all of its parsed information in linked lists. I made it in reaction to not wanting to include the massive Microsoft XML library that was over 5mb in size by itself, and subsequently because I didn't want to learn how to use the other complex libraries that exist.

Thinking: "Parsing basic XML shouldn't require more than 50 lines of C code," I got to work, and a couple of hours later had a very basic "XML" parser that was less than 50 lines of code. Over the years I kept expanding that code and eventually wrote the CPAN module XML::Bare using that parser as the core. The released module worked well and I have maintained it over the years since.

The downfall to creating my own XML library has been that I am constantly plagued with the notion that the library is not good enough and can be improved. I've ended up researching all variety of data structures trying to come up with faster more efficient ways to parse XML.

One of the various things I invented in the process is the "shared hash table." The basic concept is to use one or more large hash tables instead of many tiny ones. In order to reduce collisions when the same keys are used over and over, the string hashing algorithm is started at a different seed for each "mini hash" that hashes into the same large hash table. A working implemention of shared hash tables can be seen here:

https://github.com/nanoscopic/xml-bare/blob/master/sh_hash.c

Testing shared hash tables shows that they work well for some uses of XML. There is a specific downfall to them though. The downfall is that there is no way to iterate across the keys within a specific small hash, because all of the hashes are mixed together into one large table. One could simply store a list of te keys for each small hash as well, but it would cost more computation time to do so.

You may ask "What problem are shared hash tables solving, why not just use many small hash tables or binary trees?" Tiny hash tables that are near the size of the number of keys hashed into them do not work well because they have a high collision rate. Medium size hash tables don't work well because it increases the amount of memory needed for each hash table, and with XML you inevitably need a large number of them. Binary trees work alright, but they have a higher storage and looked cost compared to hash tables if it is possible to keep the collision rate low. Shared hash tables make it possible to have many small hash tables at near optimal size memory wise.

Hash tables rely on a category of algorithms called string hashing algorithms. The type of string hashing algorithms that are relevant are the ones that take a set of bytes and generate a mathematical hash number based off those bytes that can then be used with a tree or a hash table.

Some of the more popular string hashing algorithms are these:

Jenkins Hash

FNV ( Fowler, Noll, Vo ); most commonly fnv1a

CityHash

MurmurHash

SpookyHash

One of the better articles I have seen discussing hashing functions is here:

http://blog.reverberate.org/2012/01/state-of-hash-functions-2012.html

The article is a bit dated at this point, but works well as an introduction to string hashing algorithms.

After reviewing the existing string hashing algorithms at the time, I wrote my own variant utilizing my knowledge of primes and some guesswork. You can see the new algorithm here:

https://github.com/nanoscopic/xml-bare/blob/master/sh_hash_func.c#L13

The algorithm is very simple. It does the following for each character in the string:

Add the raw numerical value of the character Multiply the current sum by a small near byte size prime Prevent overflow of the sum by wrapping back to 0 at the closest prime to the maximum value of the sum.

The algorithm does not have a "mixing phase". It is not meant to be difficult to reverse. The aim is to do the minimal work possible to compress a series of string bytes into a lesser form. The primes chosen for each value are chosen specifically so that multiplication does not cause recurring values.

The new hash function works well for the specific purpose of XML parsing. It is not something standard, but it works and it works well. That is one of the fundemental principles that we follow at Carbon State. If something works well, even if it is strange or unconventional, then it is good. Meeting the actual need is important. The value of an algorithm is in how well it meets the business need, not in how well it fits "proper" design methods.

Another way to speed up XML parsing is to use perfect hashing. If the node name keys and attribute keys are known before reading the XML, then it becomes possible to write code that reads just enough characters of each key to do perfect hashing on them. Doing this it becomes easy to create a dense set of C structures that represent the XML. There is almost no need for hash tables at all if you have the schema of the XML being parsed. There are, in fact, a number of XML parsers that use this technique to parse XML faster than all other standard XML parsers including my own. Typically these parsers have to use code generation in order to create a parser instance capable of parsing an XML file with a specific schema.

If you don't have a schema for your XML file, it is possible to generate a loose schema representing it by parsing through it once. Such a schema can be very tiny and can sit right next to the XML file and accelerate future parsing of the file. If the XML data changes and the schema changs in the process though, the schema file would have to be updated as well. If the XML file is generated programmatically via a serializer, the serializer itself could write out the generated schema and allow accelerated parsing.

Another idea is to not store textual XML at all in a file. If XML data is parsed into a structure that is localized to a specific region of memory ( paged memory allocation ), and internal pointers are made with offsets within the paged memory instead of as global pointers, then it becomes possible to simply write the in-memory parsed form of XML directly to disk as a binary blob. If that is done, there is no longer any need to parse. You can simply read the entire blob into memory and use it just as before.

Doing so would require the in-memory form of the XML DOM to be as dense as possible though. Often in-memory forms of parsed XML are 5-10 times larger then the XML source. As a result I have yet to implement such a feature. While it is a cute idea it would require extensive optimization of the in-memory structure to work practically.

A friend of mine suggested looking into Apache Arrow since it has similar concepts ( sharing data structures between multiple different languages without serialization/deserialization ). It does, but so far as it relates to XML, Apache Arrow is a columular data store. The attributes within each row are strictly defined, and take up specific numbers of bytes. It works for more fixed data sets, but does not work well at handling the flexible data contained within XML, especially XML that has no fixed schema.

More recently, I have worked with Apache Avro. Avro is similar to XML in that it has a schema and you can somewhat flexibly store the same things in an Avro record as you can in an XML file. In fact, if you use the C version of Avro and generate a schema on the fly, you can store schema-less data into Avro relatively easily. Unfortunately though, the only benefit of using Avro over XML or JSON is that Avro can be written into a dense binary form. This one advantage is, also, something done better by Google protocol buffers. You can think of Apache Avro as a form of JSON that has a schema, and supports a binary form.

After altering the core of my XML parser so much over the years, and looking at so many different related technologies, the core portion of my XML parser hash become much more complex than it was originally. The core C code that drives it has gone from the initial 50 lines of code up to around 600. 600 lines of code is still not a significant amount, but it is enough code that it is easy to make mistakes and hard to optimize. The important next step for the parser was find a way to replace the C code entirely with something more manageable. The C code is, after all, a state machine.

I looked into writing my own state machine language for a while. What would be needed is something that allows replacement of boilerplate state code and transitions with a state machine definition, while still interspersing the state machine definition with the needed C code to actually create an XML DOM structure while the state machine is running. I thought about this for a couple of months and even began writing such a system. It is a non-trivial task though and I never had enough time to complete the work.

What happened is that I stumbled upon the Ragel library. Ragel is essentially exactly what I wanted. It is a state machine system that takes a state machine definition together with chunks of code to be run on state transitions, and will generate code in various languages for you based off of that information. I began tinkering with it immediately on discovering it, attempting to replicate the XML state machine within my XML parse. This turns out to be difficult, because the way the states are defined in Ragel is different from the way they are handled in my XML parser. Ragel is much more formal. Ragel is, to some degree, a language in and of itself. You have to learn Ragel before you can really write anything complex with it.

As a result, since discovering Ragel several years ago, I only ever really created a handful of test Ragel schemas and did not create anything serious with it. That is, not until Carbon State LLC was founded. While creating the core application framework to drive Carbon State ECM, we made a metaprogramming system. The metaprogramming system takes XML as input and generates Perl code corresponding to the high level XML input. XML works acceptably for this, but the style of XML is not perfect for all sorts of input. Also, we have a need for passing data back and forth between the front end of the system and the backend. The preferred data format for JS is JSON, and not XML. Why not merge JSON and XML?

JSON and XML are very different formats on the surface, but if you think about it they represent basically the same structure. They represent a tiered tree of schemaless data. It would be interesting to be able to switch back and forth between JSON and XML syntax within the same data file. It would, after all, be pretty handy to be able to have an XML node like this:

<node numbers=[1,2] />

It would also be nice to be able to drop some JSON directly into the middle of XML:

<root> <b={ "x": 1, "y": 2 }> </root>

If JSON could be placed in the middle of XML, why not vice versa?

<root={ "x": 1, "y": <a>10</a> <b>20</b> <, "z": 3 }>

I went with this idea and created a Ragel schema/grammar that matches this idea. Combine that grammar with the previous code I have written for XML parsing, and we have a parser than can parse a mix of XML and JSON. Of course, the newly created data language is not XML. It is not JSON either. Really it is something loosely inspired by the concept of XML and JSON. We call the new structured data storage language "XJR". XJR will be used for configuration and metaprogramming within Carbon State ECM.

What I have learned after spending 15 years working with structured data formats is that utility trumps technique. It is nice for things to work in the most efficient way possible. At scale, in enterprise situations, such efficiency saves a lot of money. Still; the most useful data format is the one that lets you utilize it as you need. An eternity can always be spent refining performance of a system, but the base design cannot be easily changed. If flexibility is desired the easiest way to get it is to design the flexibility in at the onset.