Response to “Notes on HTML 5”

The W3C Technical Architecture Group ( TAG ) is in the process of reviewing HTML 5. Noah Mendelsohn recently posted his initial, personal, not-speaking-on-behalf-of-TAG notes on HTML 5. Here are my initial, personal, not-speaking-on-behalf-of-WHATWG responses.

Limitations of the XML serialization

This may be old news, but I was surprised to see that document.write() is not supported when parsing the XML serialization. This seems to put the nail in the coffin of XML as a serialization format for colloquial HTML. I understand that there are a variety of issues in making a sensible definition of how this would work, but my intuition is that it could be done reasonably cleanly (albeit not with most off-the-shelf XML parsers).

Many, many things helped drive the nail in the coffin of XML as a serialization format for colloquial HTML. This was probably one of them. Others that come to mind:

Draconian error handling enforced at runtime does not scale to the complexities of modern-day web applications. Ensuring well-formedness becomes increasingly difficult when content is dynamically cobbled together from multiple sources, some of which are beyond your control (user-generated content, third-party ad servers, and so on).

It provides no perceivable benefit to users. Draconianly handled content does not do more, does not download faster, and does not render faster than permissively handled content. Indeed, it is almost guaranteed to download slower, because it requires more bytes to express the same meaning -- in the form of end tags, self-closing tags, quoted attributes, and other markup which provides no end-user benefit but serves only to satisfy the artifical constraints of an intentionally restricted syntax.

IE never supported it, forcing you down the rabbit hole of polyglot documents.

Other applicable specifications

"Authors must not use elements, attributes, and attribute values for purposes other than their appropriate intended semantic purpose. Authors must not use elements, attributes, and attribute values that are not permitted by this specification or other applicable specifications." This is one of the most important sentences in the entire specification, but it's somewhat vague. If "other applicable specifications" means: any specification that anyone claims is applicable to HTML 5 extension, then we can extend the langauge with most any element without breaking conformance; if "applicable specifications" is a smaller (or empty) set, then this may be saying that HTML 5 has limited (or zero) extensibility.

The phrase "or other application specifications" is indeed quite important, and quite intentional. It explicitly allows something that previous HTML specifications only implicitly allowed, or disallowed-but-everyone-ignored-that-part: future specifications which extend the vocabulary of previous specifications. Ian Hickson uses RDFa as an example: "If an RDFa specification said that text/html could have arbitrary xmlns:* attributes, then the HTML5 specification would (by virtue of the above-quoted sentence) defer to it and thus it would be allowed. ... Of course, if a community doesn't acknowledge the authority of such a spec, and they _do_ acknowledge the authority of the HTML5 spec, then it would be (for them) as if that spec didn't exist. Similarly, there might be a community that only acknowledges the HTML4 spec and doesn't consider HTML5 to be relevant, in which case for them, HTML5 isn't relevant. This is how specs work."

In response to a question about how validators could possible work with such flexible constraints, Ian Hickson writes, "The same way validators work now. The validator implementors decide which specs they think are relevant to their users. The W3C CSS validator, for instance, can be configured to check against CSS2.1 rules or against SVG rules about CSS. It can't be configured to check against CSS2.1 + the :-moz-any-link extension, because the CSS validator implementors have decided that CSS2.1 and SVG are relevant, but not the Mozilla CSS extensions. ... The W3C HTML validator, similarly, supports checking a document against XHTML 1.1, or XHTML + RDFa, or XHTML + SVG, but doesn't support checking it against XHTML + DOCBOOK. So its implementors have decided that RDFa, SVG, and XHTML are relevant, but DOCBOOK is not."

People who think that such lack of constraints can only lead to madness go strangely silent when it is pointed out that they have already violated such constraints, and the world failed to end as a result.

URL terminology

The HTML 5 draft uses the term URL, not URI.

I was under the impression that HTML 5 uses the term "URL" because that's what everyone else in the world uses (outside a few standards wonks who know the difference between URLs, URIs, IRIs, XRIs, LEIRIs, and so on). A few minutes of research supported my impression. Quoth Ian Hickson: "'URL' is what everyone outside the standards world calls them. The few people who understand what on earth IRI, URN, URI, and URL are supposed to mean and how to distinguish them have demonstrated that they are able to understand such complicated terminology and can deal with the reuse of the term 'URL'. Others, who think 'URL' mean exactly what the HTML5 spec defines it as, have not demonstrated an ability to understand these subtleties and are better off with us using the term they're familiar with. The real solution is for the URI and IRI specs to be merged, for the URI spec to change its definitions to match what 'URL' is defined as in HTML5 (e.g. finally defining error handling as part of the core spec), and for everyone to stop using terms other than 'URL'."

It should be noted that not everyone agrees with this. For example, Roy Fielding (who obviously understands the subtle differences between URLs and other things) recently stated: "Use of the term URL in a manner that directly contradicts an Internet standard is negligent and childish. HTML5 can rot until that is fixed." Maciej Stachowiak, recently appointed co-chair of the W3C HTML Working Group, recently stated: "We need to get the references in order first, because whether HTML5 references Web Address, or IRIbis, or something else, makes a difference to what we'll think about the naming issue. We need to decide as a Working Group if it's acceptable to use the term URL in a different way than RFC3986 (while making the difference clear). If it's unacceptable, then we need to propose an alternate term."

As the old saying goes, "There are only two hard problems in Computer Science: cache invalidation and naming things." I personally don't care about the matter either way, but there is obviously a wide spectrum of opinion.

IRI-bis

It's unclear whether the factoring to reference WebAddr and/or IRI-bis will be retained.

"WebAddr" refers to Web Addresses, a now-defunct proposal to split out the definition of "URL" that HTML 5 uses (which intentionally differs from the "official" definition in order to handle existing web content). The work on the "Web Addresses" specification has now been rolled into IRI-bis; "bis" means "next," so "IRI-bis" means "the next version of the IRI specification." According to Ian Hickson, so important definitions were lost in the process of splitting out "Web Addresses" from HTML 5 and subsequently merging "Web Addresses" into IRI-bis. There is also some feedback about newlines within URLs.

Work on IRI-bis is ongoing. As it relates to HTML 5, it is tracked as HTML ISSUE-56.

Content sniffing

HTML 5 calls for user agents to ignore normative Content-type in certain cases.

HTML 5 calls for user agents to ignore normative Content-Type in certain cases because this is required to handle existing web content. Based on [PDF] the research of Adam Barth and others into the content sniffing rules of a certain closed-source market-dominating browser, progress has been made towards reducing the amount of content sniffing on the public web. (Counterpoint: two steps forward, one step back.)

Personally, I have long been opposed to content sniffing, but if sniffing is going to occur, I would vastly prefer documented algorithms to undocumented ones. The "hope" behind documenting the sniffing rules now is that the web community can "freeze" the rules now and forever, i.e. not add any more complexity to an already complex world. I am personally skeptical, since HTML 5 also introduces (or at least promotes-to-their-own-elements) two new media families, audio and video, for which undocumented or underdocumented sniffing may already occur within proprietary browser plug-ins. And with @font-face support shipping or on the verge of shipping in multiple browsers, there may be new sniffing rules introduced there as well. I hope my concerns are unfounded.

Still, having sniffing rules documented in HTML 5 may -- someday soon -- reduce the complexity of a shipping product. And how often does that happen?

Willful violations

HTML 5 acknowledges in several places that it is in "willful violation" of other specifications from the W3C and IETF.

As stated in §1.5.2 Compliance with other specifications, "This specification interacts with and relies on a wide variety of other specifications. In certain circumstances, unfortunately, the desire to be compatible with legacy content has led to this specification violating the requirements of these other specifications. Whenever this has occurred, the transgressions have been noted as 'willful violations'."

This is the complete list of "willful violations" in the August 25th W3C Editor's Draft of HTML 5. (The WHATWG draft changes almost daily, whenever a change is checked in.)

As you can probably guess from these quotes, the HTML 5 community has decided that compatibility with existing web content trumps all other concerns. Other than the use of the term "URL," all of the "willful violations" are instances where other specifications do not adequately describe existing web content. I personally do not understand most of the issues listed here, so I have no opinion on whether the alleged benefit of violating existing standards is worth the alleged cost.

Versioning

In-band global version identifiers, if new implementations handle them reasonably, may be useful for (a) authoring applications that want to track versions used for authoring (b) informative error handling when applications encounter constructs that are apparently 'in error'.

The idea of version identifiers has been hashed and rehashed throughout the 5+ year process of defining HTML 5. Most notably, Microsoft proposed a version identifer shortly after the W3C HTML Working Group reformed around HTML 5. Their proposal generated much discussion, but was not ultimately adopted. Several months later, Microsoft shipped Internet Explorer 8 with a "feature" called X-UA-Compatible, which serves as a kind of IE-specific version identifier. I am personally not a fan of this approach, in part because [PNG] it adds a lot of complexity for web developers who want to figure out why the still-dominant browser doesn't render their content according to standards.

Versioning in HTML 5 is tracked as HTML ISSUE-4.