The term “Semantic Markup” is bandied about freely, and with every year that passes, it makes me more and more nervous. Herewith an exploration of what, if anything, those two terms mean when placed side by side. (Warning: way too long.)

Visiting Waterloo · In late 1986, I paid a visit to the University of Waterloo for a job interview at the New Oxford English Dictionary Project. At some point, they sat me in front of a screen and showed me some of the text of the dictionary. They started to explain what I was seeing, but it didn't really need explanation; entries were marked with <entry> , etymologies with <etym> , definitions with <def> , and so on, and they all nested neatly within each other.

It immediately occurred to me to wonder why every document in the world wasn't marked up like this, and I still wonder that, and observe with pleasure that more and more of them are.

Eventually I co-founded Open Text and did search engines and drifted into the SGML community, and was nervous about the notion of semantics as early as 1992; a certain proportion of that community asserted that SGML markup was semantic and that the semantics came from the DTD.

I hear continuing echoes of this when people hold forth on the virtues of using “semantic” markup on the Web, that is to say <cite> rather than <i> around the name of a book (which, if you do a view source, you'll see is the case with the reference to the dictionary above). Smart people who care a lot, for example Mark Pilgrim, strive mightily to make markup more semantic, then are stricken with anguish and wonder whether it's all worth it.

So, I wonder:

When is markup semantic?

Is semantic-ness a binary condition?

Where do semantics come from?

Taxonomy of Markup · I use a taxonomy of markup which I'm pretty sure was first advanced in the seminal November 1987 CACM article Markup systems and the future of scholarly text processing , by Coombs, Renear, and DeRose, which was the first place I ever encountered all the good arguments for what became XML all written down in an orderly way in one place. But I don't have a copy handy. Anyhow, I didn't invent these terms.

Presentational Markup · This is what MS Word and other WP systems do; they embed codes right in the text saying “this is in bold”, “this is right-justified Palatino”, and so on.

This contains no levels of indirection and is thus inflexible and offers poor longevity and reusability, but can be used to produce nice-looking pages.

Presentational markup is essentially always used with an authoring system that hides it from its user, such systems often making the essentially-false claim that What You See is What You Will Get.

Procedural Markup · This approach includes venerable unix tools like troff as well as TeX and PostScript. The primitives remain presentational, but they are embedded in a procedural framework so you can have macros and subroutines, and the notion of the current graphics state can be made concrete, as anyone who's gotten comfy with PostScript primitives like gsave and grestore knows.

Procedural markup can be authored directly by humans; to this day a substantial proportion of Physics and Math research is published in hand-constructed in TeX or LaTeX.

Also, in the hands of a skilled practitioner, these techniques can produce very beautifully-rendered output.

People who use procedural markup have always tended to build up more and more and more elaborate macros/subroutines. For example, you might take a series of instructions that produce 16-point bold sans-serif text, and name it title . Which brings us to...

Descriptive Markup · This is really the right term for what XML (and its predecessor SGML) are trying to be. The idea is that the markup doesn't tell you what to do with a piece of text, it tells you what it is; as the name suggests, describes it. When I was spending a lot of my time teaching and preaching about XML, the term I finally settled on was “labeling.” All XML does is provide a nice flexible internationalized way to label the elements of a data structure and ship them around with the labels attached.

Descriptive markup was born in the world of publishing technology, and its advantages for serious large-scale publishing are overwhelming, and have been presented more than enough times, so I won't go into them here.

With the application of XML to more or less everything in the world, not just publishing, it's reasonable to ask if the descriptive-markup tag still applies. The answer is, mostly, yes.

Now, some XML tags don't really feel very descriptive at all, for example HTML's <b> and <xslt:apply-templates> . And indeed most serious practitioners are aware of this and feel vaguely guilty in using this kind of thing.

But it turns out (and this is one of the strategically good things about XML) that even if the author had no semantic intent, any programmer can decide to use that tag any way they want in processing the text.

The first time I saw this happen was a dozen years ago on the dictionary project, when we started running statistical analysis of the millions of supporting quotations used to illustrate word usages to track temporal patterns in the arrival into and departure from the language of English words. The dictionary editors were originally horrified (“We didn't design the structure of a definition to support that!”) but it worked fine.

The Importance of Labels · It's kind of surprising that XML has more or less taken over as the default method for wrapping up data to ship it from anywhere to anywhere. Smart people have been working on this problem for years, and probably the most ambitious previous attempt in this direction is something called ASN.1.

I personally have had nothing but horrible experiences with ASN.1, but it has been put to use in all sorts of places including important pieces of the Internet infrastructure, and lots of people like it.

When you get an ASN.1 message, you can unpack it and you get the data items and their types. So you know “This is a fraction with 2 digits of precision, this is a 17-character string, this is a non-negative integer” and so on. But, you don't get labels.

XML, on the other hand, tells you “this is called price , this is called Bill-To , and this is called quantity-shipped ”, but (by default) tells you nothing about data types.

To oversimplify, XML is winning and ASN.1 is losing. There are a variety of reasons for this, but one of them is that it seems to be more important to know what something is called than what data type it is. This result is not obvious from first principles, and has to count as something of a surprise in the big picture.

Semantics, Say What? · Semantics as a word is closely related to “meaning”. For me, there are two scenarios where markup has semantics:

When a human understands it in context and may reasonably consider it as a basis for action. When there is an expectation that there is software which when applied to the markup will produce a useful result.

Let's reason by example. Here is an example of markup that has semantics by both the criteria above:

<a href="http://www.tbray.org/ongoing/">...</a>

Here is an example of markup that probably doesn't have any semantics according to either criterion:

<rm rn="1" />

Here's an example of some markup that has semantics by the first criterion above, but not by the second.

<chunk xmlns="http://www.tbray.org/ns/22">The essays found on the web at <address>http://www.tbray.org</address> are <adj>boring</adj> <adj>pedantic</adj> ravings <adverb>clumsily</adverb> authored by a <adj>self-styled</adj> technologist.</chunk>

I'm quite confident that if there is any software out there that will do anything useful with that markup in that namespace, it hadn't been written when I wrote this. But anyone can read it and see what's up.

For an example of markup that is semantic largely by the second criterion only check out the “WordML” produced by the new Microsoft Word.

Where Do Semantics Come From? · The first (human-oriented) kind of semantics happen when you apply labels to chunks of text that can reasonably be expected to be meaningful to a person, and when you choose the labels with reasonable care. That sounds awfully simple, but is not to be sneezed at.

The second kind of semantics is not a binary condition. The HTML hyperlink uncontrovertibly has it; the knowledge of HTML markup has penetrated the intellectual aether to the point that it can be assumed to be pretty well everywhere. It is quite possible to have markup that is not as widely-known as HTML and provide software for it that does something useful with it, thus conferring semantics on that markup.

For example, the XML for ongoing includes a <finished> element which has semantics provided by the code that publishes ongoing, which exists on only two computers (and is run by only one person), in the world. So while <finished> is semantic, it isn't as semantic as <a href=""> . (Oops, used a <b> in that paragraph, because I don't like the way the semantic alternative renders. Take me out and shoot me.)

In fact, the XML Namespaces mechanism provides the hope of a future in which you might be able to achieve autodiscovery of software-provided semantics for markup vocabularies, which generally would be a good thing.

Schemas · The existence of a schema (of whatever flavour) is neither a necessary nor a sufficient condition for attaching semantics to markup. Clearly, schemas can be helpful in many different ways at design-time and run-time.

But they're not where the meaning lives.

In Conclusion ·

Descriptive markup is better than the alternatives.

Descriptive markup is not necessarily semantic.

The phrase “semantic markup” constitutes a claim that needs to be demonstrated before being accepted.

Semantics don't come from schemas.

Perhaps, then, when we're looking for semantics, we ought to base our search on the old IETF mantra: “rough consensus and running code.”