This is an insanely long and gnarly essay about implementing, then optimizing, the low-level bits of a pure-Ruby XML parser. If you obsess about XML reading, deterministic finite automata, or Ruby code optimization, you may find some part of it interesting. There may perhaps be six people in the world who care about all three; others are warned that an attempt to read this end to end may lead to general paralysis and perhaps even clinical brain-death. [He’s not kidding. -Ed.] By way of compensation, I’ve tried to be offensive wherever the opportunity presented. [Update: Outstanding comment from Avi Bryant below, which he repeats and expands here.]

The Problem · The state of XML parsing in Ruby is not that great. The incumbent default is REXML. Let me start out by saying that its XPath implementation is first-rate, and you can do almost all day-to-day network message processing—for example, implementing the APP—by throwing XPaths at your message buffer.

Also, REXML comes with a DOM mode, a pull parser, and an event-stream parser, all of which are Good Things.

Those aside, REXML has problems. To start with, it’s non-conforming, busted, incorrect, you pick the imprecation, because it happily eats lots of mangled, non-well-formed byteglobs which claim to be XML but aren’t, in direct contravention of this rule. [Isn’t this a little anal/pedantic? -Ed. And your point is? -Tim]

Another irritant is that the API, and the documentation for the API, take great care never to describe the parts of an XML document using the terms defined in the XML specification. Consider for example, in the event-stream parser, the DOCTYPE declaration handler, which was apparently designed on Mars.

Finally, REXML is regarded by some as slow.

There are alternatives; you can get XML handlers based on libxml2 and expat, the two most widely-used C-language XML processors. At that recent RubyConf I attended, REXML complaints were usually dismissed with a brusque “So use libxml already”.

I have three problem with those approaches. First, they mean you have to download, compile, and install new software, which sucks. Second, all of this software is, uh, “lightly maintained” (REXML too). Third, I think that key Ruby libraries should, insofar as possible, be written in Ruby. There are lots of different efforts to re-implement Ruby, any of which could end up being dramatically faster and better than the current C implementation in some set of circumstances, and to the extent that the libraries are in Ruby, they’ll come along for free. Also, C programming rots your brain. [Really? -Ed. Your spleen too. -Tim]

Lark · Back in 1996, I wrote the world’s first reasonably-conformant XML 1.0 processor, named Lark, in Java. It’s called “Lark” because on our honeymoon, my wife Lauren ripped her knee up and we had to spend a few days camped out with her leg raised and immobilized, so I used that time to do the coding; thus Lark for “Lauren’s right knee”. [Hey, your tenth anniversary is coming up. -Ed. But you digress. -Tim]

As a side effect of writing Lark, I learned Java, using Microsoft’s “Visual J++”, which in its early releases was a terrific product, before they ruined it. Lark was quite successful and fairly widely deployed in production, but after a while there were parsers on the market from Microsoft, IBM, and James Clark, and I just didn’t feel like maintaining it as a permanent sideline, so I let it lapse in 1998 sometime. [Is there an attention-span issue here? -Ed. What? -Tim]

I’m going to have to give a fairly detailed description of how Lark worked if the rest of this is going to make any sense.

Parser Flavors · There are a lot of standard parser architectures: recursive-descent, LR, LALR, state-machine, and so on. Most XML parsers are some flavor of recursive descent, but Lark was a deterministic finite automaton (DFA).

A character-driven DFA couldn’t be simpler. You keep track of what state you’re in, and read the next character. Each state has a list of transitions for particular characters, so you look up your character and and find the transition—if there isn’t one that’s a syntax error.

Since you’d like your parser to actually do something, not just parse, it’s common to associate “events” or “actions” with transitions. It’s also common to enhance a DFA slightly with a push-down stack, so you can jump to another state and remember where to come back to at some future point.

The simplest way to implement a DFA is a simple two-dimensional array; each state has a number, which indexes the rows of the array, and you use the numeric character value to index into a per-state array of transitions. So the inner loop of your code would end up looking something like this:

state = dfa[state][character]

You can use a similar trick to look up the actions or events associated with a transition.

Of course, in practice, the arrays could get very big (Unicode has potentially a million or so characters), so DFAs are often implemented using lists or sparse-matrix tricks.

The great virtue of a DFA parsers is that you only have to look at each character once. Lark, in its day, was blindingly fast, and only James Clark managed to produce a faster parser using all sorts of brilliantly sleazy I/O tricks; I was pretty sure I could have leapfrogged his performance if I’d gone back and stolen some of them. [Coulda been a contender, eh? -Ed.]

The Machine · The way Lark worked was, I hand-wrote the automaton in a language I made up in a hurry for the purpose. [Not XML? -Ed. Why would I use XML? -Tim] Here’s a sample, describing the state where you’re cruising along through the text and have seen a < , signaling the start of a start-tag, end-tag, empty-element-tag, or processing instruction.

1. State SawLT BustedMarkup {after <} 2. T ? Push(InPI,InDoc) !ColdStart 3. T ! MDO 4. T / ETAGO 5. T $NameStart StagGI !HotStart

Line 1 gives the state name, “SawLT”, and then there’s another state, “BustedMarkup”, to fall back into for error-reporting purposes if the next character isn’t valid. The curly braces enclose some text to help build an error message when things go wrong. The rest of the lines, starting with “T”, describe this state’s transitions.

Let’s do line 3 before line 2, because it’s simpler. It says that if you’re in “SawLT” and you see a ! , you should do a transition to the state named “MDO”; the name is taken from the XML spec production, in this case standing for Markup Declaration Open. In fact, this has to be the start of a CDATA section.

Line 2 says that when you’re in “SawLT” and see a ? , this has to be the start of a processing instruction, so you push to the “InPI” state, leaving the “InDoc” state on the stack; a subsequent Pop when you see the end of the PI will take you back there. Also, this transition will fire the action called ColdStart , which means the parser needs to start saving the characters going by, starting with the next character.

Line 4 is left as an exercise for the reader.

Line 5 has a couple more wrinkles. First of all, the transition isn’t on a character, it’s on $NameStart , which is short-hand for “any character which can be used to start an XML name”. Also it fires the HotStart action, which turns on character saving with this character.

By the way, you can have more than one action associated with a transition.

Little Tables · It turns out that the automaton had less than 256 states, so the state number would fit in a byte. Cool! But the characters, being Unicode, obviously wouldn’t. So I was driven to trickery. For most of the parser states, notably when you’re reading text, any old Unicode character that’s not a magic syntax character like < or & will do, so you save the real character, but substitute something like ~ to drive the automaton; it’ll have the same effect. For some states, the value of the character matters, as in the example above. In which case you have to check that the character you saw is an XML NameStart , but if it is, you can slap in a or any other valid name-starter and that’ll drive the automaton fine. So all you need is a per-state table of which character-classes apply in which states. It turns out that there aren’t any states in which more than one class has a valid transition, and that in most states, none of them matter. Thus the cost of all this jiggery-pokery is low.

The core of the Lark machine was a transitions table indexed by byte-sized state number, each state being a byte array indexed by the next input character or a class surrogate. Similarly, there was an actions table indexed the same way. Thus, the main loop of the parser ends up looking something like this:

state = START_STATE while true c = next_char break if c == nil next_state = machine[state][c] if next_state == 0 handle_syntax_error end action = actions[state][c] if action call_action_routines(action) end state = next_state end

RX · So I Ruby-fied some of Lark, for the moment calling it RX because I had to think up a name for the development directory. I really hadn’t (and still haven’t) worked out the goals; if it turned out to be both more correct and faster than REXML then maybe it might be worth seeing if it could be slipped in underneath, preserving REXML’s crunchy XPath goodness. But really, I mostly did it because I could.

The implementation was pretty well pure fun; particularly the part that read the automaton and wrote the machine: class MachineBuilder . I stored both the transitions and actions tables in Ruby strings, which conveniently act like byte arrays when you throw a [number] at them. I had to generate some Ruby code; there are hand-written Ruby methods for each of the actions that can fire, so I wrote a little do_action routine which used a giant case on the action number to select which routines to call. Also that file contained the grungy logic for looking in the XML character-class tables to determine whether something was a NameStart or NameChar or Digit or whatever.

The actual parsing code was really simple, way under 500 lines. The input-handling code was also short but harder; it had to do the character encoding auto-detection and turn UTF-8, UTF-16 (I also did ISO-8859-1 and ASCII) into streams of Unicode integer code points.

On my first pass through, I entirely resisted the premature-optimization temptation, implementing everything in the most simple-minded way possible; this was easier than you might think because my Ruby experience is brief enough that I have few misleading intuitions about what’s fast and what’s slow.

As I write this, the MachineBuilder reports “110 states, 28416 bytes, 69 actions”, and the generated Ruby file has 683 lines of really boring code.

The machine is a whole lot smaller than the version in Lark, and to explain why, a digression is necessary. [This is already too long. -Ed. A man’s gotta do what he’s gotta do. -Tim]

Network-Safe XML · As I reviewed the Lark code, my mind boggled at the high proportion that existed to handle what XML calls “General Entities”, that is, the entities you can define yourself in your DTD. XHTML, for example, has hundreds of these, like ¥ and © and so on and so on.

The problem is that in the general case, XML lets you define the entities right there in the document, and furthermore lets them include other entities, so that the parser has to have a fairly sophisticated stack manager and input stream manager.

As I looked at this and thought about the work I’ve recently been doing with Atom, I got more and more nervous and unhappy. I thought of the “billion laughs” attack, and other things bad guys could do with DTDs.

So I threw out the internal subset. This seems like a good solution to me. [Are you raping XML? -Ed. Go away. -Tim] The parser reports entity references, so if you know you’re handling, for example, XHTML, you can provide the appropriate replacement values. But I just couldn’t get comfortable with arbitrary macro processing in-file. RX still parses the DOCTYPE, if present and tells you what the public and system identifiers are, but if it sees an internal subset it stops cold.

Yes, arguably I’m fucking with the basic nature of XML here, but I suspect I’m right. In practical terms, REXML doesn’t really handle XML general entities either, so I’m no worse. More discussion is probably called for.

First Cut · I have unit tests for the input manager and the QName wrangler (thankfully, there are lots of XML test suites out there). I was pondering how to test the actual parser; an automaton is a great big glob of pretty abstract code, and the high-level unit-testing approach wasn’t obvious. Obviously I cared not only about XML conformance detection, but about the data passed to the app using the parser.

So I implemented REXML’s StreamListener event-stream interface, and figured that for my first debugging pass, if I could get the same results as REXML on some nontrivial XML files, that’d be a start. So I wrote this little chunk of code that counts the number of processing instructions, elements, paragraph elements, img elements that have a src= attribute with a value ending in jpg , and occurrences of the word “the” in text content.

It wasn’t too long before I was getting the same results as REXML on quite a few different chunks of XML, most notably 2½ meg of ongoing articles. [Is my blog eating my life? And that 2½M was before I started writing the tome you are now reading.]

Unfortunately, the first cut was approximately ten times slower than REXML.

The Profiler · So, I broke out the Ruby profiler. [Cue crazed laughter in the background.] Which is perfectly fine, except for it’s, well, slow. As in really slow. As in one particular run which normally took 0.826 seconds of user time burned 134.36 seconds with the profiler’s help.

The profiler revealed a whole bunch of things. First, my naïve input implementation predictably sucked rocks. [“Sucked rocks”? -Ed. Uh, kind of weak, isn’t it? -Tim] It was using byte-at-a-time input reads to drive character-at-a-time parser input. So I hacked in a fairly standard buffered I/O setup. If I had a nickel for every time I’ve juiced up a program by fixing the I/O buffering I’d have, well, a couple of bucks anyhow.

The next obvious sinner was that do_action routine, which had a 69-way case selecting among all the possible sets of actions that a transition could fire. I was seeing major time vanishing there and, via it, in Fixnum#=== . Gack. I’m an old C/Java hacker, and I had this notion that most language’s case constructs were there to provide rapid selection among a bunch of alternatives. Ruby’s, on the other hand, provides all sorts of cool features not found elsewhere, but they don’t include rapid selection.

OK, so for each of the 69 action combinations, I generate a method, here’s one of them:

def action_5(c) result = false result ||= a_Push(c, 60) result ||= a_ColdStart(c) result end

Then I loaded up an array like so:

def load_actions @@action_dispatcher = [ nil, method(:action_1), method(:action_2), ...

Then do_action became:

def do_action(i, c) @@action_dispatcher[i].call(c) end

All this made a remarkable difference. [This sucks. -Ed. I’m not sure -Tim] But I was still way behind REXML.

Second Cut · After spending more quality time waiting for the profiler, it became obvious that my per-char main loop was killing me. Here’s the actual code that’s run, with the input character living in c and the automaton driver (see above) in b :

@buf << c if @saving index = (@state * 128) + b @to = nil action = do_action(@@actions[index], c) if @@actions[index] != 0 # turn crank unless action already set the next state @to = @@machine[index] unless @to @state = @to ...

It was burning the vast majority of my CPU. Not having any brilliant ideas how to optimize it, it struck me that I ought to do it less.

At this point I had an inspiration (which could have been used with the original Java-flavored Lark, too). An XML parser’s performance is especially interesting when the input is large. Large XML documents typically contain large runs of text content, and the state machine is going to spin its way stupidly through these, saying, for each character, “is this < or & ? Nope? Save it and move right along, nothing to see here.”

So I refactored away, changing the input subsystem to turn its input buffer into a list of buffers, some of them one character in length and containing either < or & , and others, any length at all, guaranteed not to contain either magic syntax character.

Then the main parser loop, each time around, would check whether it was in the well-known plain-text-reading-state, and if so, was it more than one character into the input buffer, and if so, ask the input system for the whole buffer, known not to contain any syntax.

This successfully cut down the number of times through the main loop. Unfortunately, it also made everything slower. Why?

Third Cut · Let’s have fun waiting for the profiler again, shan’t we? Gack! Array#index and Fixnum#== are eating my brain! You see, I’d exploded the input string, be it UTF-8 or UTF-16 or whatever, into an array of Unicode character integers, and was using index to find the syntax characters. Silly, silly me. At this point I reflected on the downside of making classes like Fixnum and Array open, and took a couple of days off to think.

Fourth Cut · Well, REXML is faster, I thought, and it’s regexp-based, so they must be fast, unlike case constructs and Array#index . So I buckled down and made my input subsystem much more complex; in the case where the input is something Ruby regular expressions can handle (as in, most single-byte encodings) I used regular expressions to chop up the input strings.

@bufs_waiting = buf.scan /<|&|[^<&]+/mu @bufs_waiting.map! { |b| b.unpack('U*') }

This made a huge difference; I was now within spitting distance of REXML. It turns out that unpack is real fast. Pity about the people who use UTF-16; I had to write horrid per-byte expansion logic for that case, and I haven’t measured it, but I bet it’s gonna suck.

Fifth Cut · Another run with the profiler [Row, row, row, your boat, gently down the stream/merrily, merrily, merrily, merrily, life is but a dream] suggested that Ruby method dispatch is simply not up to processing any nontrivial text on a per-character basis, so I changed the interface to the input subsystem from next_char to next_chars . Yep, that helped.

Conclusions · For C and Java programmers like me, Ruby is a new territory and things may not be what they seem. I’m OK with the case construct being what it is, and with using regexes to pick things apart. But I’m irritated about the lousy performance of integer arrays.

Here’s the thing: Inside the computer, arrays of numbers is all there are. Over the years, I’ve developed an instinct: if you can find a way to represent a problem so that you store and manipulate the data using arrays of integers, you usually win, big-time. This is one reason why finite automata can be made to run so fast. Apparently, in the world of Ruby, my instinct is wrong. The problem is really the open classes; because I might choose to override Array#index or Fixnum#== , the runtime isn’t allowed to notice that these are just integer arrays and emit low-level indirection-free code to treat them that way. So maybe Ruby needs a FrozenArray or FrozenFixnum class, to make bit-bangers like me happy.

Except for, everyone says that Smalltalk VMs can do this kind of thing real fast. Gilad Bracha, where are you?

As Of Now · The code is here. And here’s a profiler run showing everything that used 0.5% or more of the CPU time. RX is still only about half as fast as REXML. I haven’t given up yet, and I’ve learned a few things.