It’s all over the news these days, because it’s A Good Thing: the Web will be smarter and faster and better. And for other reasons involving politics and vituperation. I love parts of HTML5, but it’s clear that other parts are a science project. And as a sometime standards wonk, I’m puzzled by aspects of the way the spec (not the language, the spec for the language) is put together.

What’s Good · I suspect I agree with most external observers: what’s cool are the new elements like video, audio, and canvas. And since I’m a protocols guy, the closely-related Web Socket work; more on that below. I’ve also enjoyed how the video element has shone a remorseless and very useful light on the patent-troll infestation standing in the way of better Web multimedia.

Progress is well under way on implementing the pleasing parts of HTML5, and there are people thinking seriously that it may soon remove the need for compiled “native” applications on a variety of platforms.

That’s good!

What’s Bad · The process is clearly hard to manage. On a couple of occasions I’ve tried to take a sip or two of the HTML5 waters, and instantly been overwhelmed by the volume and intensity of the conversation; “drinking from a firehose” applies. It’s something that you really have to do full-time to do at all, I think.

It’s also self-evidently troubled. This week we have HTML5 Editor Ian Hickson publicly accusing Adobe of placing a “secret block” on the HTML5 spec. Adobe hotly denies it. Simon St. Laurent writes up the story and then hostilities break out in his comments.

Not a pretty picture.

Is it possible that they’ll fight through all this swampy stuff and get a good result? We’ll see.

The Networked-Object-Model Experiment · One of the distinguishing features of the Web is that it has never specified APIs or Object Models. Interoperability has been at the level of syntax: I send you these bits, here’s what they are defined to mean, in response you send me those bits, here’s what they’re defined to mean. And so on.

I have always felt that this is why the Internet and the Web took off so well, exceeding by orders of magnitude the deployment of other attempts to build networked application frameworks (CORBA, DCOM, Jini) that were based on objects and APIs. The lesson, it seems to me, is that we just don’t know how to do that, and interoperability should happen at the level of syntax.

The HTML5 draft seems to disagree. It provides detailed algorithms for parsing HTML, even in the face of severe syntax errors, and specifies how the results of parsing should be used to construct the Object Model. Thus, the syntax is ephemeral; the Object Model, interoperable across the network, is what matters.

The theory is that if all the User-Agent providers implement all these algorithms exactly as specified, complete interoperability will be achieved and people who build Web applications need no longer concern themselves with the differences between User Agents. Which would of course be wonderful.

Will it work? Nobody knows; it’s a science experiment. Just because nobody has ever succeeded in specifying a workable networked object model doesn’t mean this project will likewise fail. But it does mean that when considering the future of HTML5, we should recognize that this is a very hard problem, and there’s no guarantee that that part of it will come off.

Which may not matter that much; User-Agent implementors are increasingly subject to market pressure to be compatible, plus Web application authors increasingly work at a higher level, thinking in terms of things like Rails or jQuery constructs, thus insulating themselves somewhat from the compatibility nasties.

So for my money, I see little harm in the speculative parts of HTML5 if we get those tasty new elements, even at the current imperfect level of interoperability.

How To Spec? · [Note: At this point, I launch into a detailed discussion of the design of specifications for network protocols; the content will be of interest to a very small group of people, including almost nobody who just wants <video> to be here and work today.]

This was provoked by Joe Gregorio’s recent (amusing) Joel-in-a-box, calling out the excellence of the Web Socket protocol spec, which was produced by the same group and editor as HTML5, and is in a similar style. Joe admired the way it was “clearly directed at someone that is going to be implementing the protocol”, finding it refreshing compared to many other current RFCs. By the way, Joe did an outstanding job as co-editor of RFC5023.

So I went and read the Web Socket protocol and my reaction was more or less the opposite. I like the protocol and I gather it’s already been implemented and works. But I found the spec hard to read, amazingly long and complex for such an admirably simple protocol, and missing information that seemed important.

Like HTML5, it doesn’t just specify the bits to be interchanged and what they mean, it provides detailed algorithmic guidance, and I quote for flavor from Section 4.2 Data framing: “the user agent must run through the following state machine for the bytes sent by the server”. I assumed “must” meant “MUST”, and was relieved to find in Section 2. Conformance: “Conformance requirements phrased as algorithms or specific steps may be implemented in any manner, so long as the end result is equivalent.” Thus, we understand that the algorithms are provided for their explanatory value.

Let me deep-dive on a couple of the sections to examine the difference between styles of specification. I’ll start with that state machine that was mentioned earlier.

Framing · The section describing the data framing has six numbered top-level sections, three steps for receiving data and another three for sending them. The receiving-data part has two sub-lists of seven and five steps respectively. It’s all in a almost-pseudocode style and extends across a page and a half.

Here’s how framing’s done:

Messages sent by either side have to consist of Unicode characters encoded in UTF-8. They have to be framed by a leading 0x00 byte and a trailing 0xFF byte.

Either side has to accept (but discard) message frames where the first few bytes, with the high bit set, give the message length. Thus, a frame beginning with 0x81 0x82 0x83 0x04 has a length of (1 * 128 * 128) + (2 * 128) + 3, or 16643 bytes. (Presumably these are artifacts of an earlier version of Web Sockets?)

When clients see busted UTF-8, they replace the damaged text with U+FFFD REPLACEMENT CHARACTER. When servers see busted UTF-8, the behavior is undefined.

That’s all.

Headers · If you go to the first illustrative example at the top of the spec 1.7 Writing a simple Web Socket server, you find:

Listen on a port for TCP/IP. Upon receiving a connection request, open a connection and send the following bytes back to the client:

48 54 54 50 2F 31 2E 31 20 31 30 31 20 57 65 62 20 53 6F 63 6B 65 74 20 50 72 6F 74 6F 63 6F 6C 20 48 61 6E 64 73 68 61 6B 65 0D 0A 55 70 67 72 61 64 65 3A 20 57 65 62 53 6F 63 6B 65 74 0D 0A 43 6F 6E 6E 65 63 74 69 6F 6E 3A 20 55 70 67 72 61 64 65 0D 0A 57 65 62 53 6F 63 6B 65 74 2D 4F 72 69 67 69 6E 3A 20

Send the ASCII serialization of the origin from which the server is willing to accept connections. For example: |http://example.com|

Continue by sending the following bytes back to the client:

0D 0A 57 65 62 53 6F 63 6B 65 74 2D 4C 6F 63 61 74 69 6F 6E 3A 20

At this point, I was wide-eyed; exactly what is going on here? Maybe I’m just not supposed to bother my pretty little head about what I’m sending down the pipe? So I poured the hex into a little scrap of Ruby to find out what I’d be sending:

HTTP/1.1 101 Web Socket Protocol Handshake Upgrade: WebSocket Connection: Upgrade WebSocket-Origin: server to accept connections from WebSocket-Location: script to run

Gosh, that sure looks familiar. And, in fact, it turns out that the Web Socket protocol handshake is a lot like HTTP, in that the messages back and forth begin with a request or status line just like HTTP’s and and continue with CRNL-separated name/value pair headers just like HTTP’s.

So, if I were an implementor, my first question would be “Can I use my existing HTTP header library to read and generate headers?”

The answer turns out to be “probably”. Web Sockets forbid continuation lines (good!) and in some cases require that headers appear in a particular order. It’s possible that your HTTP library might do continuations or store the headers up in a hash and spit them out in a different order.

In fact, if you go to Section 4.1 Handshake, you’ll find a algorithm with 24(!) steps detailing header handling. Steps 15 through 21, with conditionals and GOTOs, detail how to pick apart standard HTTP-header syntax: Name, colon, optional space, value, CRNL. Um, wow.

For the Purposes of Comparison · Creating a spec in the HTML5 style seems like a whole lot of work. The Web Socket draft is long, and contains mind-numbing amounts of detail, much of it replicated in multiple places in the draft.

I was a little uncomfortable that the draft leaves many of its design decisions unexplained (e.g. why client and server must read and discard message frames with a leading byte-count). I’m wondering if the extreme difficulty of writing a spec in this style leads to a certain amount of triage on such niceties.

More evidence of the difficulty is that although this is labeled as draft number 75, it’s still at what I’d call an early/middle state of maturity. There are obvious quality/consistency issues here and there: monocasing of header names, whether to give messages as hex byte sequences or ASCII literals, fractured text about error handling, fuzzification in the counted-frame description. Nothing terribly serious (I’ll submit reports in one of the appropriate places), and since there are apparently interoperable implementations, the spec empirically seems to work.

But I still found it strange and counter-intuitive. I think this argument between the traditional and HTML5 style of specification is interesting, maybe important, and will be with us for a while.

So, as a contribution to the discussion, I whipped up an alternate version with the procedural specifications replaced, where I thought reasonable, by declarative versions. I omitted all the sections which are the same as in the original version and all the closing material. The top-level section numbers are the same, and also the subsections of Section 1, except for I added a Web Sockets and HTTP Headers subsection.

Please note:

I am not proposing to replace or amend the Web Sockets draft; this is purely for comparison of specification styles.

I am not claiming that my version of the spec is complete or accurate; I only put a couple of hours into this. On the other hand, if there are things I got really wrong, that’s useful information because I’m experienced at reading these things, thus others are likely to go similarly astray.

The declarative version is much shorter. Is it better or worse? No, that’s the wrong question. For what kinds of specification is the HTML5 style better and for what kinds does the declarative style win?