First for the record, I’m speaking only for myself, not my employer, the W3C, Apple, Google, Microsoft, WWWAC, the DNRC, the NFL, etc.

XML 1.1 failed. Why? It broke compatibility with XML 1.0 while not offering anyone any features they needed or wanted. It was not synchronous with tools, parsers, or other specs like XML Schemas. This may not have been crippling had anyone actually wanted XML 1.1, but no one did. There was simply no reason for anyone to upgrade.

By contrast XML did succeed in replacing SGML because:

It was compatible. It was a subset of SGML, not a superset or an incompatible intersection (aside from a couple of very minor technical points no one cared about in practice) It offered new features people actually wanted. It was simpler than what it replaced, not more complex. It put more information into the documents themselves. Documents were more self-contained. You no longer needed to parse a DTD before parsing a document.

To do better we have to fix these flaws. That is, XML 2.0 should be like XML 1.0 was to SGML, not like XML 1.1 was to XML 1.0. That is, it should be:

Compatible with XML 1.0 without upgrading tools. Add new features lots of folks want (but without breaking backwards compatibility). Simpler and more efficient. Put more information into the documents themselves. You no longer need to parse a schema to find the types of elements.

These goals feel contradictory, but I plan to show they’re not; and map out a path forward.



The technical requirement is to maintain compatibility. Existing parsers/tools still work. On a few edge cases, an XML 2.0 parser may report slightly different content than an XML 1.0 parser. E.g. more or less white space. However all XML 2.0 documents are well-formed XML 1.0 documents. It will be possible to write an XML 2.0 parser that is faster, simpler, and easier to use than an XML 1.0 parser. However XML 1.0 parsers will still be able to parse XML 2.0 documents. XML 2.0 parsers will not, however, be able to parse all XML 1.0 documents, just as XML parsers can’t parse arbitrary SGML.

The attitudinal requirement is that XML 2.0 be useful. It has to solve real problems for real users. And it has to solve all the problems XML 1.0 solves too.

Tim Bray’s Skunkworks meets goals #1, #3, and #4 but it didn’t offer #2, so there was never a strong desire to implement it. Languages don’t get replaced simply because there’s something new that does the same thing a little more cleanly or a little better. They get replaced when there’s something new that does the old things better and does new things too.

What we take out

Internal DTD subsets

The internal DTD subset is responsible for much of the complexity and security issues with XML. Get rid of it. XML 2.0 documents cannot contain internal DTD subsets.

Validity

Validity is defined separately. The DOCTYPE declaration, if present, can point to schemas in any language with any validation rules. Perhaps we can use the public identifier to name the type of the schema and the system identifier to point to it. However validation is optional and outside the scope of the spec.

Namespace well-formedness is required

Build in namespaces. Require namespace well-formedness. All namespaces must be absolute URIs.

Neurotic and psychotic documents

All namespace prefixes must be declared on the root element. No prefix may have two different URIs in two different parts of the document. This may require rewriting of namespace prefixes when combining documents, e.g. with XInclude. This is uncommon, but if we have to do it we’ll do it.

Default namespaces may still be declared on any element.

CDATA sections

Eliminate CData sections. They’re unnecessary syntax sugar and just lead to confusion among users and extra work for parser implementers. Ideally I’d also like to eliminate the special treatment of the three character sequence ]]> . However that might break existing parsers.

C1 controls

The only reason C1 controls show up in an XML document is because someone has mislabeled a Windows text file. By forbidding these characters we will catch this problem much earlier.

There are likely some other Unicode compatibility characters we should forbid as well.

DOM and the Infoset

Abandon DOM. Abandon the Infoset. They’re confusing and not what users want or need. Folks can still use them–XML 2.0 is a subset of XML 1.0, after all–but they have no normative standing.

Encourage a variety of different APIs and data models appropriate to their respective domains and languages. However be very clear that the actual text of the document is the normative form. The data model is a representation of the text. The text is not a serialization of the data model.

White space preservation

White space is significant inside an element if and only if xml:space=”preserve”. Otherwise all consecutive white space is collapsed to a single space.

Alternatively, provide a means of identifying elements in which white space should be preserved in the prolog, e.g. through a processing instruction.

There are counter-examples this this–e.g. the HTML pre element–but I think this is what most people want most of the time so it makes sense to make it the default and make white space preservation the one that requires special casing. Some thought is needed to figure out the algorithm though, especially for white space like this

<foo> bar </foo>

and this

<foo>bar/foo>

and this

The <em>quick</em> <strong>brown</strong> fox jumped over the lazy dog.

Exactly which white space is retained, and to which element is it assigned?

Whatever way we go here, use the same rules for all attribute values. Attribute value normalization was a mistake in XML 1.0 anyway. Drop it.

Most character encodings

Use XML 1.0 name rules but base on Unicode properties for all non-ASCII characters. Provide a means of identifying the Unicode version in use. Default to Unicode 2.0, unless the document declares otherwise. This is the change I’m least interested in, because it may break compatibility, and has no known actual use cases. That is, no one has ever been able to present a document that any actual user wants to produce that cannot be encoded with XML 1.0 name rules.

Require UTF-8 or UTF-16 exclusively. In fact, maybe just require UTF-8. No other encodings are permitted. Use the encoding declaration to identify the Unicode version and recognize XML 2.0? e.g.

<?xml version=”1.0″ encoding=”Unicode_5″?>

Fallback to an XML 1.0 processor if this doesn’t work.

It does feel a little ugly to specify version="1.0" for what I’m calling XML 2.0. However, as long as the document adheres to the XML 1.0 grammar and constraints, this is completely legal. Maybe we shouldn’t even call this a new version of XML but give it a new name. Perhaps YML (because Y comes after X)?

standalone declaration

This means nothing in practice. No one relies on it. Get rid of it.

What we add

More entities

Predefine the HTML 4 character entity set. Otherwise eliminate all general entity references. We can make this work with a required system ID that points to a DTD containing the definitions. Of course XML 2.0 parsers will not actually load this DTD, only XML 1.0 parsers will need to load it.

Links

Build in xml:base and xml:id .

Build in some form of simple links sufficient for use of HTML. Perhaps just xml:link or xml:href , nothing fancier. This contains a URL, and is normally considered semantically equivalent to an HTML <a href=''> . I.e. it’s a blue underlined thing you click on.

Possibly we should even ditch the namespace prefixes. E.g base , id , and link / href attributes will simply be defined with these semantics.

Data Structures and Types

The biggest lack of XML 1.0 is a standard means of encoding basic data structures and types used in programming: lists, sets, maps, structs, ints floats, etc. This is why JSON is so popular. It’s not that these things can’t be encoded in XML 1.0, just that there are so many ways they can be encoded, and libraries provide no support for decoding them.

To support this use case, we will allow an optional xml:type or perhaps just type attribute on all elements. The value is a name from a type library such as XSD primitive types. Predefine a basic type library with the minimal types: integer, decimal, string, boolean, date, time. For example,

<sku type="string">H647345</sku> <date type="date">2010-10-12</date> <quantity type="integer">12</quantity> <price type="decimal" units="dollars">3.45</quantity> <price type="decimal" units="percent">7.25</quantity>

The default simple type is string. I.e. we can instead write

H647345

We do not want the full set of XSD types. In particular, we do not want float, double, short, int, and long. Integers have arbitrary size, and real numbers are expressed in base 10. Parsers may express these with more or less precision as they choose.

We also want list and map types and maybe set:

<crew type="list"> <name>Fred</name> <name>Jane</name> <name>Bob</name> </crew> <dimensions type="map"> <width type="decimal" units="cm">34.3</width> <length type="decimal" units="cm">120.0</length> <height type="decimal" units="cm">3.10</height> </dimensions>

Here I’ve made the keys simply the element names. Possibly with maps we want to allow or require that the keys also be attributes or even elements, which would enable a broader range of key types.

We probably want to define some sort of simple type defintiion that can be used by parsers rather than explicitly specifying the type of each element. E.g.

<dimensions type="map" valuetype="decimal" keytype="string"> <dimension key="width" units="cm">34.3</dimension> <dimension key="length" units="cm">120.0</dimension> <dimension key="height" units="cm">3.10</dimension> </dimensions>

Parsers and APIs are encouraged to make this content available to clients in a more cooked form appropriate to the programming language rather than as raw strings and nodes. However these types are all advisory, not compulsory. Further note that these types could be further parsed by a library that sits on top of an XML 1.0 parser. An XML 2.0 parser is not required.

TBD: should XML 2.0 parsers treat violations of the constraints on the semantic types (e.g. <amount xml:type=’int’>Bob</amount>) as a fatal well-formedness error or a non-fatal validity error? If we just use the type attribute it would have to be the latter to avoid breaking compatibility with existing documents.

The details of the type system remain to be worked out. How do we name and define new types? What does the syntax of a type definition look like? Do we need collections other than list and map? Exactly which primitive types do we predefine? Will the world really let us get away with integers and decimals or will they scream for int and float? Much work here remains to be done. But the basic idea is sound. We don’t need to reinvent the same type annotations for every vocabulary. Just as XML 1.0 improved on SGML by eliminating the freedom to use different syntax to delimit elements and attributes, XML 2.0 will improve on XML 1.0 by eliminating the freedom to use different syntax to denote types. We will not limit which types one can express. We will simply specify a standard form for denoting type information.

What we can’t do

There are a few minor changes I haven’t been able to figure out how to make while maintaining backward compatibility. These include:

Allowing — to appear in comments

Not making ]]> special

version=”2.0″

If you can figure out how to make these compatible, please do let me know. I’m almost willing to compromise on these minor points of backwards compatibility to simplify the parsing process, but I’m not sure that’s wise.

version="2.0" is the trickiest one. XML 1.0 parsers are not required to error out on this, but in practice many do. Perhaps we should drop the XML declaration completely? I.e. any document with an XML declaration is ipso facto not an XML 2.0 document. Instead all XML 2.0 documents will be identified with a specific DOCTYPE:

<!DOCTYPE XML2 PUBLIC "application/xml+xsd http://example.com/optional/actualschema.xsl" http://www.example.com/xml20.dtd">

As mentioned above, the DTD mentioned by the system identifier is a legal XML 1.0 DTD that defines all the HTML entities. XML 2.0 aware processors will ignore this. The public identifier (which may be empty) contains a MIME media type followed by the URL to the actual schema for the document.

The Way Forward

There’s one other reason XML 1.0 succeeded where XML 1.1 failed: XML 1.0 was designed by a small committee of like-minded folks with a common goal. They didn’t always agree on the route, but they were all driving to the same destination. The W3C pretty much ignored them until they were done. By contrast, XML 1.1 was hobbled by a W3C process that took far longer to accomplish much less. If XML 2.0 is to succeed, it needs to follow XML 1.0, not XML 1.1.

A small group of interested folks should convene outside the W3C and write the spec. One month to agree on goals and requirements; one month to write the first draft. Run it up a flagpole and see who salutes.

Step 2 is to write parsers and APIs for the new draft, and gain some implementation experience. Develop a test suite of sample documents. That will take longer, but is necessary. Work the bugs out of the spec. At the point where the goals seem to be satisfied and the spec is reasonably implemented, present it to the world as a fait accompli. If some organization feels like adopting it as a formal standard, that’s fine, though it’s hardly necessary.

The real goal is to take the lessons of the last 12 years of XML, and apply them to create something even better. Who’s with me?

P.S.

If you want to comment, please be aware that you need to escape < as < and > as >. The comments allow basic HTML but aren’t smart enough to distinguish between plain text and real HTML comments.