XML is a bouncing thriving five-year-old now, and yet I've been feeling unsatisfied with it, particularly in recent times. In particular in my capacity as a programmer.

During the process of setting up ongoing, for the first time in a year or more I wrote a bunch of code to process arbitrary incoming XML, and I found it irritating, time-consuming, and error-prone.

Some other recent data points:

The programmers here at Antarctica had to write Flash code to use the MX built-in parser to read the XML that Visual Net generates, and found it sufficiently slow and irritating that we forked the XML output format, and there's another version designed for the Flash client.

Adam Bosworth, a programming titan (his resumÃ© includes Quattro Pro, Access, and IE4) recently wrote convincingly about the undue hardship programmers face in dealing with XML.

At about the time I was setting up ongoing , Joe Gregorio posted a lengthy and compelling rant on the same subject which got me thinking about writing this.

Programming Baskets · Some more background. Serious programming these days more or less all falls into three baskets:

The scripting tribe: Perl, Python and their friends, beloved of input-data wranglers and website gluers everywhere.

The O-O factory, now chiefly represented by Java and C#, where the Big Company Programmers building Big Systems on Big Iron live.

The close-to-the-metal gang, which is still mostly C and some C++. This is where you live if you write Apache models or Linux infrastructure or Perl/Python extensions.

I think all of these communities are having more trouble than they really ought to with XML. Oddly enough, the problem isn't in writing the XML processor, which isn't that hard, look at the number that are out there. The difficulty is in using one.

An XML-Oriented Programing Language? · One response has been a suggestion that we need a language whose semantics and native data model are optimized for XML. That premise is silly on the face of it: here are two reasons why:

Some decades after the advent of the relational database, we have not seen programming languages center themselves around normalized data models; in fact, the movement away from the C struct -centered worldview to O-O code+data encapsulation is really a move away from the tabular paradigm. You can embed SQL in most languages now, but normally you don't implement any serious business logic in it. If this hasn't happened after decades in the relational world, why would we expect it to happen in the XML world?

-centered worldview to O-O code+data encapsulation is really a move away from the tabular paradigm. You can embed SQL in most languages now, but normally you don't implement any serious business logic in it. If this hasn't happened after decades in the relational world, why would we expect it to happen in the XML world? The notion that there is an "XML data model" is silly and unsupported by real-world evidence. The definition of XML is syntactic: the "Infoset" is an afterthought and in any case is far indeed from being a data model specification that a programmer could work with. Empirical evidence: I can point to a handful of different popular XML-in-Java APIs each of which has its own data model and each of which works. So why would you think that there's a data model there to build a language around?

Life in the Scripting Basket · As regards XML, I've been living in the land of scripting generally and Perl specifically in recent times; the internals of the Antarctica runtime codebase are all C, the back end has Java and C++, but these all build and manage internal data structures that look nothing like XML, and the XML we generate is via the venerable printf() -plus-markup-escaping approach.

That leaves input data munging, which I do a lot of, and a lot of input data these days is XML. Now here's the dirty secret; most of it is machine-generated XML, and in most cases, I use the perl regexp engine to read and process it. I've even gone to the length of writing a prefilter to glue together tags that got split across multiple lines, just so I could do the regexp trick.

The reasons are not complicated: If I use any of the perl+XML machinery, it wants me either to let it read the whole thing and build a structure in memory, or go to a callback interface.

Since we're typically reading very large datasets, and typically looking at the vast majority of it, preloading it into a data structure would be impractical not to say stupid. Thus we'd be forced to use parser callbacks of one kind or another, which is sufficiently non-idiomatic and awkward that I'd rather just live in regexp-land.

When I came to do ongoing, I decided as a matter of principle that the input had to be XML and had to be read with a real XML processor. Since, once again, I was going to be using every byte of every file, I decided that loading it all into an in-memory data structure so I could run through it inorder was egregiously stupid, and went with callbacks. Which are irritating.

The program that writes ongoing sets up for processing an entry by initializing a bunch of global state variables, unleashes the XML parser, and stands back. I've been writing Perl since 1993 or so and this just feels awkward and unnecessary. The canonical Perl program, in my idiom anyhow, looks something like:

my ($state_var1, $state_var2) = (0, ''); my (%collector1, $collector2); while (<STDIN>) { next if (/rexexp-for-something-I-ignore/); if (/something-I'm interested-in/) { $state_var1 = &foo($1, $4, \%collector1); } elsif (/something-else/) { $state_var2 = &bar($_, $state_var1); } elsif (/yet another/) { $state_var_1 = $state_var2 + $collector1{baz}; } else { print; } }

This may feel primitive to the O-O heavies out there, but it's the way a lot of the Net is stitched together.

I'm not sure what the right solution to the XML awkwardness is in O-O land or close-to-the-metal-ville, but I'm pretty damn sure what I'd like to see in Scripting Village. By example:

while (<STDIN>) { next if (X<meta>X); if (X<h1>|<h2>|<h3>|<h4>X) { $divert = 'head'; } elsif (X<img src="/^(.*\.jpg)$/i>X) { &proc_jpeg($1); } # and so on... }

The idea is that the element-ish and attribute-y syntax in regexps abstracts away all the XML syntax weirdness, igoring line-breaks, attribute orders, choice of quotemakrs and so on. I've invented some Perl syntax off the top of my head which is a highly dangerous thing to do, particularly in the fraught land of regexps, particularly since the Perloids are re-inventing all that right now in the Perl6 project; so let's be clear that the above is not a serious syntax proposal. But essentially, I want to have my idiomatic regexp cake and eat my well-formed XML goodness too. Too much to ask?

Out of the Scripting Basket · I suspect there are parallel proposals to be made for the people who live in the O-O and close-to-the-metal worlds, but they don't leap to the front of my mind. I will make one slightly-brave prediction though: I think that the stream-processing mode of reading and using XML is going to occupy a substantial part of the landscape no matter which basket you're living in; the costs of the alternatives are frequently going to be just too high.

So I think the key first step is to make XML stream processing idiomatic in as many programming languages as possible. Rumor has it that the .NET CLR is going the right way on this one, but I haven't been there.

I guess I ought to say in closing that even given the irritation which programmers encounter in dealing with XML, the benefits are sufficient that the current trend toward using it as the interchange format for more or less everything still seems sound. But we can make people's lives easier I think.