This fragment is mostly a note to myself and placeholder and might prove useful to someone slashing through the XML undergrowth with bleeding-edge Ruby. Briefly: I revived my “RX” Ruby tokenizer (see here, here, and here) to contribute to Antonio Cangiano’s proposed Ruby benchmark suite, which I think is a Really Good Idea. I had a bit of pain getting the code to run on both Ruby 1.8 and 1.9, and then when I tried sanity-checking the output by comparing it to REXML on 1.9, REXML blew chunks. There are, apparently, issues about REXML and 1.9. Read on for details in the unlikely event that you care about any of this.

Benchmarking · There’s this problem in that there are a lot of plausible-looking Ruby implementations now (MRI, YARV, JRuby, Rubinius, IronRuby, MagLev) and it would be nice to compare performance. I was talking to some of the implementers about this and someone (Charles Nutter I think) said “Problem is, there’s this huge gap between running fib() and running Rails.” So, for example, how do we find out how fast MagLev will run Rails, without going through all the pain of making MagLev run Rails?

Antonio Cangiano sensibly proposed Let’s create a Ruby Benchmark Suite; when Avi Bryant told me he’d tried my RX code on MagLev, it occurred to me that it might be an interesting benchmark.

RX refresher: It’s a pure automaton-based XML tokenizer whose performance is totally dependent on the efficiency of dereferencing integer arrays, and it turns out that mainstream Ruby really sucks at this. To make it a little more competitive with REXML, the de-facto standard Ruby parser, I had kludged it all over the place with regex preprocessing to cut down on the array traffic.

So I asked Antonio whether, if I de-optimized RX to make it a pure array benchmark, would it be interesting for his suite, and he said yes, so I did.

1.8.6 vs. 1.9 · Perhaps the single most visible difference between today’s Ruby and tomorrow’s is in the low-level string-handling API. Well, an XML parser lives entirely right there, so boy did I ever learn all about it. I had previously converted RX to run on 2006-vintage YARV, but I wanted one version of the code that would run in both 1.8.6 and 1.9. Sigh.

Here’s one of the detail issues, to give a feeling for the problems. Suppose you know that your input stream is in UTF-8 and you’ve read a buffer-full of data and you want to turn it into Unicode integer characters for the parser. The problem is that the buffer might end in the middle of a multi-byte UTF-8 character. Easy enough, a glance at the last byte will diagnose that. The problem is, how do you pull out the unsigned-integer value of the last byte of a buffer, without processing through the whole (potentially large) buffer, with code that runs in both Rubies?

I poked around on IRC and Eric Hodel managed to improve on my original suggestion. Read it and weep:

def byte_at(s, i) s[i, 1].unpack('C')[0] end

REXML Ouch · RX has a primitive unit-test suite; what I do to sanity-check it at a high level is feed a nontrivial XML doc to it and REXML and check that they find the same number of elements, PIs, paragraphs, img elements with a src= attribute whose value ends in .jpg , and occurrences of the word “the” in running text.

Well, when I finally got it running in Ruby 1.9, and started the sanity check, REXML blew up on my document, 2.8 Meg of the input to this blog.

With a bit of poking around, I ascertained that:

Well, there you go. By the way, in Ruby 1.9’s favor, it runs the (simplified de-optimized) RX about three times as fast as 1.8.6. Any other implementors want a whack at it?