Thanks to the commenters on the previous RX piece who recommended ruby-prof (there’s a gem install), which is a much faster and thus better profiler than the built-in one. I learned a few more things.

The commenter who hit closest to the mark was Aristotle Pagaltzis, who references Perl lore: “The fastest way to do something in Perl is frequently the one that implements the most costly step in the fewest ops.” The reasons will be obvious on a little thought, but the PerlMonks piece is worth reading.

Looping Blues · When I run the parser over 2,477,645 bytes of XML, because of the buffer-skipping trick it only actually has to look at 484,021 characters individually, and that loop still burns 25% of the 13 seconds it takes on my PowerBook. I’ve looked at it pretty hard, sliced and diced the loop two or three different ways with the profiler, and squeezed a few improvements out, but at the end of the day, it’s pretty simple sensible code. For now, my conclusion is that the current Ruby implementation is just not gonna be fast enough in any algorithm that requires looking individually at a nontrivial proportion of the characters.

Packing Blues · The other problem is that RX internally processes XML text as an array of numeric values identifying Unicode characters; but APIs are going to want to deal with Ruby strings, so the arrays have to be run through pack before handing data to whatever API is in use. This hurts; Array#pack was burning 11% of my time. What puzzled me was that Integer#to_int (a no-op) was getting called 2,354,147 times and burning another 6.59%.

So I poked around and it turns out that any_array.pack('U*') calls to_int on each element; duck-typing culture at work.

If there were such a notion as an array known to contain only Fixnums, or a pack -type operation that was allowed to throw an exception if its arguments weren’t Fixnums, things might be better.

Now, in the case where the input is already UTF-8, there’s a chance for a special-purpose hack to avoid the String->Array->String round-trip, but that feels both brutal and discriminatory.

Conclusion · The notion of picking one of the libxml or expat based Ruby libraries and maintaining it properly and blessing that as the “right” way to do XML in Ruby is looking better and better.