Back to Web Devout home

Beware of XHTML

If you're a web developer, you've probably worked a lot with XHTML , the markup language developed in 1999 to implement HTML as an XML format. Most people who use and promote XHTML do so because they think it's the “next version” of HTML, and they may have heard of some benefits here and there. But there is a lot more to it than you may realize, and if you're using it on your website, even if it validates, you are probably using it incorrectly.

I believe that XHTML has many good potential applications, and I hope it continues to thrive as a standard. This is precisely why I have written this article. The state of XHTML on the Web today is more broken than the state of HTML, and most people don't realize because the major browsers are using classic HTML parsers that hide the problems. Even among the few sites that know how to trigger the XML parser, the authors tend to overlook some important issues. If you really hope for the XHTML standard to succeed, you should read this article carefully.

What is XHTML? Up XHTML is a markup language originally hoped to someday replace HTML on the Web. For the most part, an XHTML 1.0 document differs from an HTML 4.01 document only in the lexical and syntactic rules: HTML is written in its own unique syntax defined by SGML , while XHTML is written in a different SGML-defined syntax called XML . The syntaxes differ in some of the characters that delimit tags and other constructs, whether or not certain types of shorthand markup may be used, and whether or not tag names or character entities are case sensitive, among other small differences. The Document Type Definition (DTD, which is referenced by the doctype declaration) then defines which elements, attributes, and character entities exist in the language and where those elements may be placed. The DTDs of XHTML 1.0 and HTML 4.01 are nearly identical, meaning that as far as things like elements and attributes go, XHTML 1.0 and HTML 4.01 are basically the same language. The only added benefit of XHTML is that it's written in XML and shares the benefits XML has over HTML's syntax. I'll explain those benefits later in this article, but first I'd like to debunk some of the false benefits you may have heard.

Myths of XHTML Up There are many false benefits of XHTML promoted on the Web. Let's clear up some of them at a glance (with details and other pitfalls provided later): XHTML does not promote separation of content and presentation any more than HTML does. XHTML has all of the same elements and attributes (including presentational ones) that HTML has, and it doesn't offer any additional CSS features. Semantic markup and separation of content and presentation is absolutely possible in HTML, and with equal ease. In terms of semantics, HTML 4.01 and XHTML 1.0 are exactly the same.

Most XHTML pages on the Web are not parsed as XML by today's web browsers. With typical server configurations, browsers will parse your XHTML as HTML “tag soup” instead. The vast majority of XHTML pages on the Web cannot be parsed as XML since they rely on HTML error handling. Even most of the valid XHTML results in disfigured pages when parsed as XML, since they were only tested in HTML-parsing browsers.

XML parsers do not typically check documents for validity. They only check them for well-formedness, which is a separate concept. If you leave out a required element, use deprecated or nonstandard elements or attributes, or put an element somewhere it isn't allowed, the XML parser will provide no indication of the error, and the browser will have to silently deal with the error like HTML parsers do.

HTML is not deprecated and is not being phased out at this time. In fact, the World Wide Web Consortium recently renewed the HTML working group, which is working to develop HTML 5. The developers of Firefox, Opera, and Safari have pushed very hard for the development of HTML 5 and have largely ignored the development of XHTML 2. The Safari development team has even opted to not take part in the XHTML 2 development process. The CTO of Opera said in an interview, “I don't think XHTML is a realistic option for the masses. HTML5 is it.”

XHTML 1.x is not “future-compatible”. XHTML 2, currently in the drafting stages, is not backwards-compatible with XHTML 1.x. XHTML 2 will have lots of major changes to the way documents are written and structured, and even if you already have your site written in XHTML 1.1, a complete site rewrite will usually be necessary in order to convert it to proper XHTML 2. A simple XSL transformation will not be sufficient in most cases, because some semantics won't translate properly. HTML 4.01 is actually more future-compatible. A valid HTML 4.01 document written to modern support levels will be valid HTML 5, and HTML 5 is where the majority of attention is from browser developers and the W3C.

XHTML does not have good browser support. In typical setups, most browsers simply pretend that your XHTML pages are regular HTML (which presents a number of problems, as I'll explain later). Some major browsers like Firefox, Opera, and Safari may attempt to handle the page as proper XHTML if and only if you include a special HTTP header instructing it to do so. But when you do, Internet Explorer and a number of other user agents will choke on it and won't display a page at all. Even when handled as XHTML, the supporting browsers have a number of additional bugs, which I'll also discuss in this article.

Most browsers do not parse valid XHTML dramatically faster than valid HTML, even when they're parsing XHTML correctly. This is partly because most browsers only support a small subset of the HTML/SGML standard to begin with, so the real complexities of proper HTML parsing are mostly ignored anyway. The only major additional complexity of HTML that is well supported is tag omission, but most browsers use hardcoded rules specific to HTML in order to cheat through that with minimal performance impact. The browser can lose some minor shorthand logic with XML, but it now has to use extra logic to confirm that the document is well-formed. Although XHTML, when parsed with an XML parser, may be slightly faster to parse than typical HTML, the difference isn't very significant in most cases. And either way, download speed is usually the bottleneck when it comes to document parsing. Whether it's HTML or XHTML, by the time the page finishes downloading, the whole thing is already parsed. The users won't notice any speed difference.

XHTML is not extensible if you hope to support Internet Explorer or the number of other user agents that can't parse XHTML as XML. They will handle the document as HTML and you will have no extensibility benefit.

XHTML source is not necessarily any “cleaner” than HTML source. If you prefer using lower-case tag and attribute names, you can do so in HTML. If you prefer having quotes around all attribute values, you may do so in HTML. If you prefer making sure all of your non-empty elements have end tags, you may use end tags in HTML, too. In fact, these are considered best practice principles in HTML. The only real markup differences between an HTML document following best practices and an XHTML document following the legacy compatibility guidelines are the doctype choice, XHTML's extra required attributes on the html tag, and XHTML's extra / character in empty element tags. Some argue that the availability of HTML's shorthand constructs is what makes HTML “unclean”. But the only HTML shorthand construct that is required is the omitted end tag on elements that have to be empty, and it's common practice in XHTML (alas, even required in many cases) to also use a shorthand construct on those elements: the so-called “self-closing tag” which originates from SGML's “null end tag” shorthand construct. If you prefer to minimize your use of shorthand markup and would like the validator to enforce those restrictions in HTML as well, you can use Web Devout's HTML Good Practice Checker.

Using XHTML does not encourage better support by web browsers and it is not “a vote for a better Web” if you are still supporting Internet Explorer and various search engines and other user agents that require text/html . If you serve it with the typical text/html content type, you are giving all browsers a thumbs-up to treat it exactly like classic HTML, meaning absolutely no progress is made. Even if you use only application/xhtml+xml and shut out Internet Explorer and various other user agents entirely, it won't mean anything: Microsoft already plans to support real XHTML in an upcoming release of Internet Explorer; they just want to make sure they support it correctly from the initial launch. Even still, XHTML 1.x is a dead-end standard, since it's completely incompatible with XHTML 2.0 and all other future HTML/XHTML standards, as explained aboved, and since the majority of XHTML content on the Web today cannot be safely parsed as XML.

Benefits of XML Up XML does have a number of improvements over HTML's syntax: Although HTML's syntax allowed for a lot of shorthand markup and other flexibility, it proved too difficult to write a correct and fully-featured parser for it, since a truly correct parser would have to support the entire SGML standard. As a result, most user agents, including all of today's major web browsers, make many technically unsound assumptions about the lexical format of HTML documents and don't support a number of shorthand features like Null End Tags ( <tag/Content/ ), unclosed start/end tags ( <tag<tag> ), and empty tags ( <> ). XML was designed to eliminate these extra features and restrict documents to a tight set of rules that are more straight-forward for user agents to implement. In effect, XML defines the assumptions that user agents are allowed to make, while still resulting in a file that a theoretical fully-featured SGML user agent could parse once pointed to XML's SGML declaration. It should be noted that an XML parser for the most part is not dramatically easier to write than the level of HTML support offered by most HTML parsers (which will be thoroughly specified in HTML 5). Most of the features that would make HTML more difficult to write a parser for, such as custom SGML declarations, additional marked sections, and most of the shorthand constructs, have negligible use on the Web anyway and generally have poor or absent support in major web browsers. The most significant difference is XML's lack of support for omitted start and end tags, which in theory could amount to complicated logic in HTML for elements not defined as empty. Even still, most browsers don't bother to implement real DTD-based parsing logic, so it isn't quite so complicated in practice.

In hopes of eliminating some error handling logic, XML user agents are told to not be flexible with errors: if a user agent comes upon a problem in the XML document, it will simply give up trying to read it. Instead, the user will be presented with a “parse error” message instead of the webpage. This eliminates the compatibility issues with incorrectly-written markup and browser-specific error handling methods by requiring documents to be “well-formed”, while giving webpage authors immediate indication of the problem. This does, however, mean that a single minor issue like an unescaped ampersand ( & ) in a URL or a mismatched character encoding in a trackback message would cause the entire page to fail, and so most of today's public web applications can't safely be incorporated in a true XHTML page. While user agents are supposed to fail on any page that isn't well-formed (in other words, one that doesn't follow the generic XML grammar rules), they do not have to fail on a page that is well-formed but invalid. For example, although it is invalid to have a span element as an immediate child of the body element, most XML-supporting web browsers won't provide indication of the error because the page is still well-formed — that is, the DTD is violated, but not the fundamental rules of XML itself. Some user agents may choose to be “validating” agents and will also fail on validity errors, but they aren't common. There is some worry that people may rely too heavily on the well-formedness checker and forget to also check for validity, which could lead to a higher occurrence of invalid pages even among the otherwise standards-conscious developers. Despite popular assumption, even if an XML page is perfectly valid according to some validators, it still might not be well-formed. Well-formedness involves some requirements not present in the classic SGML definition of validity.

Unlike HTML's SGML-defined syntax, which was specifically made for HTML, XML is a common syntax used in many different languages. This means that a single relatively simple set of parsing logic can handle a number of different languages. It also paved the way for the Namespaces in XML standard, which allows multiple documents in different XML formats to be combined in a single XML document, so that you can have, for example, an XHTML page that contains one or more SVG images that use MathML inside them. The practicality of this on webpages is a subject of debate. Separation of content, presentation, and behavior is a defining characteristic of modern web development, as the modular setup provides many benefits over the mixed alternative. Those benefits also hold true when debating the idea of mixed XML formats. Since (X)HTML provides facilities for embedding other XML formats in a more modular fashion (using elements like object and link ), it's usually better to use the modular approach rather than mixing the files in a single document. Moving the SVG or RSS data into files separate from the (X)HTML allows the user agent to cache them and improve performance while reducing bandwidth cost and easing maintainability.

Content type is everything Up When your website sends a document to the visitor's browser, it adds on a special content type header that lets the browser know what kind of document it's dealing with. For example, a PNG image has the content type image/png and a CSS file has the content type text/css . HTML documents have the content type text/html . Web servers typically send this content type whenever the file extension is .html , and server-side scripting languages like PHP also typically send documents as text/html by default. XHTML does not have the same content type as HTML. The proper content type for XHTML is application/xhtml+xml . Currently, many web servers don't have this content type reserved for any file extension, so you would need to modify the server configuration files or use a server-side scripting language to send the header manually. Simply specifying the content type in a meta element will not work over HTTP. When a web browser sees the text/html content type, regardless of what the doctype says, it automatically assumes that it's dealing with plain old HTML. Therefore, rather than using the XML parsing engine, it treats the document like tag soup, expecting HTML content. Because HTML 4.01 and simple XHTML 1.0 are often very similar, the browser can still understand the page fairly well. Most major browsers consider things like the self-closing portion of a tag (as in <br /> ) as a simple HTML error and strip it out, usually ending up with the HTML equivalent of what the author intended. However, when the document is treated like HTML, you get none of the benefits XHTML offers. The browser won't understand other XML formats like MathML and SVG that are included in the document, and it won't do the automatic validation that XML parsers do. In order for the document to be treated properly, the server would need to send the application/xhtml+xml content type. The problems go deeper. Comment markers are sometimes handled differently depending on the content type, and when you enclose the contents of a script or style element with basic SGML-style comments, it will cause your script and style information to be completely ignored when the document is treated like XML. Also, any special markup characters used in the inline contents of a style or script element will be parsed as markup instead of being treated as character data like in HTML. To solve these problems, you must use an elaborate escape sequence described in the article Escaping Style and Script Data, and even then there are situations in which it won't work. Furthermore, the CSS and DOM specifications have special provisions for HTML that don't apply to XHTML when it's treated as XML, so your page may look and behave in unexpected ways. The most common problem is a white gap around your page if you have a background on the body , no background on the html element, and any kind of spacing between the elements, such as a margin , padding , or a body height under 100% (browsers typically have some combination of these by default). In scripting, tag names are returned differently and document.write() doesn't work in XHTML treated as XML. Table structure in the DOM is different between the two parsing modes. These are only a select few of the many differences. The following are some examples of differing behavior between XHTML treated as HTML and XHTML treated as XML. The anticipated results are based on the way Internet Explorer, Firefox, and Opera treat XHTML served as HTML. Some other browsers are known to behave differently. Also note that Internet Explorer doesn't recognize the application/xhtml+xml content type (see below for an explanation), so it will not be able to view the examples in the second column. Differences in XHTML handling text/html application/xhtml+xml Example 1 Example 1 Example 2 Example 2 Example 3 Example 3 Example 4 Example 4 Example 5 Example 5 Example 6 Example 6 Example 7 Example 7 Example 8 Example 8 Example 9 Example 9

HTML compatibility guidelines Up When the XHTML 1.0 specification was first written, there were provisions that allowed an XHTML document to be sent as text/html as long as certain compatibility guidelines were followed. The idea was to ease migration to the new format without breaking old user agents. However, these provisions are now viewed by many as a mistake. The whole point of XHTML is to be an XML alternative to HTML, yet due to the allowance of XHTML documents to be sent as text/html , most so-called XHTML documents on the Web today would break if they were treated like XML (see the real-world examples below). This even includes many valid XHTML documents. Several prominent members of the W3C are now challenging the wisdom of the text/html provisions and advocating that this content type should never be allowed for XHTML. Many authors incorrectly believe that following the HTML compatibility guidelines and validating the document will guarantee that the document is compatible with both the HTML and XHTML specifications. In reality, if you use even a single self-closing tag in the document (which includes any link , img , or br tag), you are already creating incompatibilities between the two specifications. The reason for this particular issue is explained below. In this article, I have already explained a number of other factors not covered in XHTML 1.0 Appendix C that will also cause the document to run into incompatibilities. The truth is that the HTML compatibility guidelines do not actually provide true compatibility between HTML and XHTML; they merely attempt to minimize the damage of using text/html for XHTML documents, and that damage control is very limited in effectiveness. XHTML 1.x already makes no provision for the use of text/html when taking advantage of any XHTML features not present in HTML, and the current draft of XHTML 2 expressly forbids it.

Internet Explorer incompatibility Up Internet Explorer does not support XHTML. Like other web browsers, when a document is sent as text/html , it treats the document as if it was a poorly constructed HTML document. However, when the document is sent as application/xhtml+xml , Internet Explorer won't recognize it as a webpage; instead, it will simply present the user with a download dialog. This issue still exists in Internet Explorer 7. Although all other major web browsers, including Firefox, Opera, Safari, and Konqueror, support XHTML, the lack of support in Internet Explorer as well as major search engines and web applications makes use of it very discouraged.

Content negotiation Up Content negotiation is the idea of sending different content depending on what the user agent supports. Many sites attempt to send XHTML as application/xhtml+xml to those who support it, and either XHTML as text/html or real HTML to those who don't. There are two methods generally used to determine what the user agent supports, using the Accept HTTP header: most often, sites use the incorrect method where they simply look for the string “ application/xhtml+xml ” in the header value; although some sites will use the correct method, where they actually parse the header value, supporting wildcards and ordering by q value. Unfortunately, neither of these methods works reliably. The first method doesn't work because not all XHTML-supporting user agents actually have the text “ application/xhtml+xml ” in the Accept header. Safari and Konqueror are two such browsers. The application/xhtml+xml content type is implied by a wildcard value instead. Meanwhile, not all HTML-supporting user agents have “ text/html ” in the header. Internet Explorer, for example, doesn't mention this content type. Like Safari and Konqueror, it implies this support by using a wildcard. Even among those user agents that support XHTML and mention application/xhtml+xml in the header, it may have a lower q value than text/html (or a matching wildcard), which implies that the user agent actually prefers text/html (in other words, its XHTML support may be experimental or broken). The second method (the correct, 100% standards-complaint one) doesn't work because most major browsers have inaccurate Accept headers: Firefox 2 and below have application/xhtml+xml listed with a higher q value than text/html , even though Mozilla has posted an official recommendation on its site saying that websites should use text/html for these versions if they can, for reasons described below.

listed with a higher value than , even though Mozilla has posted an official recommendation on its site saying that websites should use for these versions if they can, for reasons described below. Internet Explorer doesn't list either text/html or application/xhtml+xml in its Accept header. Instead, both content types are covered by a single wildcard value (which implies that every content type in existence is supported equally well, which is obviously untrue). So Internet Explorer is saying that it supports both text/html and application/xhtml+xml equally, even though it actually doesn't support application/xhtml+xml at all. In the case that a user agent claims to support both equally, the site is supposed to use its own preference. A possible workaround is for the site to “prefer” sending text/html or, in a toss-up situation, only send application/xhtml+xml if it's actually mentioned explicitly in the header. However...

or in its header. Instead, both content types are covered by a single wildcard value (which implies that every content type in existence is supported equally well, which is obviously untrue). So Internet Explorer is saying that it supports both and equally, even though it actually doesn't support at all. In the case that a user agent claims to support both equally, the site is supposed to use its own preference. A possible workaround is for the site to “prefer” sending or, in a toss-up situation, only send if it's actually mentioned explicitly in the header. However... Safari and Konqueror, which support XHTML, also give text/html and application/xhtml+xml the same q value (in fact, like Internet Explorer, they also claim to support everything in existence equally well). But they don't mention application/xhtml+xml explicitly — it's implied by a wildcard. So if you use the above workaround, Safari and Konqueror will receive text/html even though they really do support application/xhtml+xml . As disappointing as it may be, content negotiation simply isn't a reliable approach to this problem.

Null End Tags (NET) Up In XHTML, all elements are required to be closed, either by an end tag or by adding a slash to the start tag to make it self-closing. Since giving empty elements like img or br an end tag would confuse browsers treating the page like HTML, self-closing tags tend to be promoted. However, XML self-closing tags directly conflict with a little-known and poorly supported HTML/SGML feature: Null End Tags. A Null End Tag is a special shorthand form of a tag that allows you to save a few characters in the document. Instead of writing <title>My page</title> , you could simply write <title/My page/ to accomplish the same thing. Due to the rules of Null End Tags, a single slash in an empty element's start tag would close the tag right then and there, meaning <br/ is a complete and valid tag in HTML. As a result, if you have <br/> or <br /> , a browser supporting Null End Tags would see that as a br element immediately followed by a simple > character. Therefore, an XHTML page treated as HTML could be littered with unwanted > characters. This problem is often overlooked because most popular browsers today are lacking support for Null End Tags, as well as some other SGML shorthand features. However, there are still some smaller user agents that properly support Null End Tags. One of the more well-known user agents that support it is the W3C validator. If you send it a page that uses XHTML self-closing tags, but force it to parse the page as HTML/SGML like most user agents do for text/html pages, you can see the results in the outline: immediately after each of the self-closing elements, there is an unwanted > character that will be displayed on the page itself. (It should be noted that the W3C Validator is unusual in that it generally determines the parsing mode from the doctype, rather than from the content type as most other user agents do. Therefore, an HTML doctype was used in the above example just so the validator would attempt to parse the page using the HTML syntax as all major browsers will for text/html pages regardless of the doctype. The Null End Tag rules are actually set in the SGML syntax definition, not the DTD, so this example is accurate to what you should expect in a fully compliant SGML user agent even with an XHTML doctype.) Technically, a restricted and altered form of Null End Tags exists in XML and is frequently used: the self-closing portion of the start tag. While Null End Tags are defined as / ... / in HTML's syntax, they are specially defined as / ... > in XML with the added restriction that it must close immediately after it is opened, meaning the element must have no content. This was designed to look similar to a regular start tag for web developers who are unfamiliar with typical Null End Tags. However, in the process it creates inherent incompatibility with HTML's syntax for all empty elements. In summary, although this issue doesn't show in most popular web browsers, a user agent that more fully supports SGML would see unwanted > characters all over XHTML pages that are sent with the text/html content type. If the goal of using XHTML is to help promote standards, then it's quite counterproductive to cause unnecessary problems for user agents that more correctly comply to the SGML standard.

Firefox and other problems Up Although Firefox supports the parsing of XHTML documents as XML when sent with the application/xhtml+xml content type, its performance in versions 2.0 and below is actually worse than with HTML. When parsing a page as HTML, Firefox will begin displaying the page while the content is being downloaded. This is called incremental rendering. However, when it's parsing XML content, Firefox 2.0 and below will wait until the entire page is downloaded and checked for well-formedness before any of the content is displayed. This means that, although in theory XML is supposed to be faster to parse than HTML, in reality these versions of Firefox usually display HTML content to the user much faster than XHTML/XML content. Thankfully, this issue is expected to be resolved in Firefox 3.0. However, there are also issues in other browsers, such as certain HTML-specific provisions in the CSS and DOM standards being mistakenly applied to XHTML content parsed as XML. For example, if there is a background set on the body element and none on the html element, Opera will apply the background to the html element as it would in HTML. So even when dealing exclusively with XHTML parsed as XML, you still run into a number of the same problems that you do when trying to serve XHTML either way. All in all, true XHTML support in major user agents is still very weak. Because a key user agent — namely, Internet Explorer — has made no visible effort to support XHTML, other major user agents have continued to see it as a relatively low priority and so these bugs have lingered. HTML is recommended over XHTML by both Mozilla and Safari and is generally better supported than XHTML by all major browsers.

Conclusion Up XHTML is a very good thing, and I certainly hope to see it gain widespread acceptance in the future. However, it simply isn't widely supported in its proper form. XHTML is an XML format, and to force a web browser to treat it like HTML is going against the whole purpose of XHTML and also inevitably causes other complications. Assuming you don't want to dramatically limit access to your information, XHTML can only be used incorrectly, be interpretted as invalid markup by most user agents, cause unwanted results in others, and offer no added benefit over HTML. HTML 4.01 Strict is still what most user agents and search engines are most accustomed to, and there's absolutely nothing wrong with using it if you don't need the added benefits of XML. HTML 4.01 is still a W3C Recommendation, and the W3C has even announced plans to further develop HTML alongside XHTML in the future.