Abstract This document provides summaries about the two serializations of the HTML 5 [HTML5] vocabulary specification. It documents the differences between the two serializations and their associated mime types. It’s intended to be readable as a “standalone” document. Editorial note

Editor's Draft 5 July 2009

This document is for review by the HTML Working Group and is subject to change without notice. This document has no formal standing within W3C. Please consult the group's home page and the W3C technical reports index for information about the latest publications by this group.

Introduction

HTML is the publishing language of the World Wide Web. HTML 5 specification [HTML5] includes the description of an abstract vocabulary for writing Web pages. It defines the semantics: what should be a title, a list or a paragraph among others. It also defines the grammar: how these elements should be nested.

Serializations

The process of writing down an actual markup for these defined semantics is called a serialization in computer science. The rules for writing the markup can be different depending on the application and processing context. A same language semantics can have multiple serializations (ways of writing it).

The choice of serialization as consequences on the way user agent will process the document.

HTML 5 serializations

The precedent versions of the HTML vocabulary (HTML+, HTML 2.0, HTML 3.2) were written using SGML syntax rules. HTML 4 had already two syntaxes: SGML (called HTML 4.01) and XML (called XHTML 1.0).

HTML 5, the abstract language, can be written, at least, in two different syntaxes: html and XML. You will choose a syntax depending on your developer needs, markets and applications contexts. The rest of this document explains what are the constraints associated to the choice of your serializations.

MIME Types

For the purpose of processing documents in user agents:

A document served with the application/xhtml+xml MIME type is defined as an XHTML document (meaning that conformant users agent will parse it with an XML parser, and process it as a document in the XHTML syntax).

A document served with the text/html MIME type is defined as and HTML document (meaning that conformant users agent will parse it with an HTML parser, and process it as a document in the HTML syntax).

On the desktop

When opening a document with a browser (not sent by a Web server), the MIME type is unknown from the browser. Browsers trust the filename extension to process the file in a particular mode (html or xml).

For processing HTML documents as XML, filename must end with .xhtml or .xht or .xhtm .

or or . For processing HTML documents as HTML, filename must end with .html .

Note: once the file is put on a server, these extensions will trigger the right MIME type (except if they are changed). Most servers are configured to send a specific MIME type depending on the extension of the filename.

MIME Types rules for HTML 5

XHTML must be served with an XML MIME type, such as application/xml or application/xhtml+xml .

or . HTML must be served as text/html .

Versatile Documents

A HTML document is a document that conforms to both the HTML and XHTML syntactic requirements, and which can be processed as either by browsers, depending on the MIME type used. This works by using a common subset of the syntax that is shared by both HTML and XHTML.

Versatile documents are useful to create for situations where a document is intended to be served as either HTML or XHTML, depending on the support in particular browsers, or when it is not known at the time of creation, which MIME type the document will ultimately be served as.

Imagine the following simple document served with utf-8 encoding, it is possible to serve it as text/html and application/xhtml+xml .

HTTP header to send with the document for HTML

Content-Type: text/html; charset=utf-8

HTTP header to send with the document for XHTML

Content-Type: application/xhtml+xml; charset=utf-8

Finally the document itself.

<!DOCTYPE html> <html xmlns="http://www.w3.org/1999/xhtml"> <head> <title>Versatile document</title> </head> <body> <p><q>The medium is the massage.</q> - <cite>Marshall McLuhan</cite> </p> </body> </html>

In order to successfully create and maintain versatile documents, authors need to be familiar with both the similarities and differences between the two syntaxes. This includes not only syntactic differences, but also differences in the way stylesheets, and scripts are handled, and the way in which character encodings are detected.

Syntax

Should I reorganize the tables in family of issues.

Depending on the mime type chosen for sending the document on the Web, some syntax rules become critical for an appropriate document parsing. Both with html and xml parsing, choices will have consequences on the way the document is interpreted.

List of syntax differences between html and xml serializations in HTML 5 Feature HTML XHTML doctype required. The doctype triggers the strict rendering mode. <!doctype html> optional, if used must be well-formed XML. i.e. <!DOCTYPE html> , with optional PUBLIC and/or SYSTEM identifiers. case sensitivity

(attributes, elements, doctype) not case sensitive <p CLASS="poem">A text</P> case sensitive <p class="poem">A text</p> Non self closing tags (ex: tr ) In HTML, certain elements allow the omission of either or both: html (both), head (both), body (both), li (end tag), dt (end tag), dd (end tag), p (end tag), colgroup (both), thead (end tag), tbody (both), tfoot (end tag), tr (end tag), td (end tag), th (end tag) Non self closing tags require both a start and an end tag self closing tags (ex: br ) self closing tag syntax (trailing slash) is allowed on void elements ( base , link , meta , hr , br , img , embed , param , area , col and input ), but forbidden on other elements. <br> <br/> End tags for void elements are forbidden. may use either the self closing tag syntax or have an end tag immediately follow the start tag. <br></br> <br/> attribute minimization allowed <input disabled> <input disabled=disabled> <input disabled='disabled'> <input disabled="disabled"> not allowed <input disabled='disabled'> <input disabled="disabled"> Unquoted attributes allowed <p class=straight> not allowed <p class="straight">…</p> CDATA section not allowed allowed <![CDATA[<p class="poem">A text</p>]]> Processing instructions not allowed allowed Named character references can be used é cannot be used except: & , < , > , " and ' (though you can create your own dtd) Contextual tolerance for < and & characters < allowed in <script> , <style> , <title> , <textarea> & allowed in <script> , <style> not allowed Namespace prefixes not allowed allowed xhtml Namespace declaration allowed <html xmlns="http://www.w3.org/1999/xhtml"> required <html xmlns="http://www.w3.org/1999/xhtml"> Unicode characters set of XML 1.0 lang attribute lang xml:lang id attribute id xml:id and id noscript element allowed forbidden xml:base not allowed allowed character encoding recommended to use HTTP Content-Type header. Also possible to use <meta charset="utf-8"> recommended to use HTTP Content-Type header. Also possible to use <?xml version="1.0" encoding="UTF-8"?>

Scripting

Not sure this section should be included.

Stylesheets

Not sure this section should be included.

Acknowledgements

Michael Smith

References