This is a guest post from Henri Sivonen, who has been working on Firefox’s new HTML5 parser. The HTML parser is one of the most complicated and sensitive pieces of a browser. It controls how your HTML source is turned into web pages and as such changes to it are rare and need to be well-tested. While most of Gecko has been rebuilt since its initial inception in the late 90s, the parser was one of the stand-outs as being “original.” This replaces that code with a new parser that’s faster, compliant with the new HTML5 standard and enables a lot of new functionality as well.

A project to replace Gecko’s old HTML parser, dating from 1998, has been ongoing for some time now. The parser was just turned on by default on the trunk, so you can now try it out by simply downloading a nightly build without having to flip any configuration switch.

There are four main things that improve with the new HTML5 parser:

You can now use SVG and MathML inline in HTML5 pages, without XML namespaces.

Parsing is now done off Firefox’s main UI thread, improving overall browser responsiveness.

It’s improved the speed of innerHTML calls by about 20%.

calls by about 20%. With the landing of the new parser we’ve fixed dozens of long-standing parser related bugs.

Try the demo with a Firefox Nightly or another HTML5-ready browser. It should look like this:

What Is It?

The HTML5 parser in Gecko turns a stream of bytes into a DOM tree according to the HTML5 parsing algorithm.

HTML5 is the first specification that tells implementors, in detail, how parse HTML. Before HTML5, HTML specifications didn’t say how to turn a stream of bytes into a DOM tree. In theory, HTML before HTML5 was supposed to be defined in terms of SGML. This implied a certain relationship between the source of valid HTML documents and the DOM. However, parsing wasn’t well-defined for invalid documents (and Web content most often isn’t valid HTML4) and there are SGML constructs that were in theory part of HTML but that in reality popular browsers didn’t implement.

The lack of a proper specification led to browser developers filling in the blanks on their own and reverse engineering the browser with the largest market share (first Mosaic, then Netscape, then IE) when in doubt about how to get compatible behavior. This led to a lot of unwritten common rules but also to different behavior across browsers.

The HTML5 parsing algorithm standardizes well-defined behavior that browsers and other applications that consume HTML can converge on. By design, the HTML5 parsing algorithm is suitable for processing existing HTML content, so applications don’t need to continue maintaining their legacy parsers for legacy content. Concretely, in the trunk nightlies, the HTML5 parser is used for all text/html content.

How Is It Different?

The HTML5 parsing algorithm has two major parts: tokenization and tree building. Tokenization is the process of splitting the source stream into tags, text, comments and attributes inside tags. The tree building phase takes the tags and the interleaving text and comments and builds the DOM tree. The tokenization part of the HTML5 parsing algorithm is closer to what Internet Explorer does than what Gecko used to do. Internet Explorer has had the majority market share for a while, so sites have generally been tested not to break when subjected to IE’s tokenizer. The tree building part is close to what WebKit does already. Of the major browser engines WebKit had the most reasonable tree building solution prior to HTML5.

Furthermore, the new HTML5 parser parses network streams off the main thread. Traditionally, browsers have performed most tasks on the main thread. Radical changes like off-the-main-thread parsing are made possible by the more maintainable code base of the HTML5 parser compared to Gecko’s old HTML parser.

What’s In It for Web Developers?

The changes mentioned above are mainly of interest to browser developers. A key feature of the HTML5 parser is that you don’t notice that anything has changed.

However, there is one big new Web developer-facing feature, too: inline MathML and SVG. HTML5 parsing liberates MathML and SVG from XML and makes them available in the main file format of the Web.

This means that you can include typographically sophisticated math in your HTML document without having to recast the entire document as XHTML or, more importantly, without having to retrofit the software that powers your site to output well-formed XHTML. For example, you can now include the solution for quadratic equations inline in HTML:

x = − b ± b 2 − 4 ⁢ a ⁢ c 2 ⁢ a

Likewise, you can include scalable inline art as SVG without having to recast your HTML as XHTML. As screen sized and pixel densities become more varied, making graphics look crisp at all zoom levels becomes more important. Although it has previously been possible to use SVG graphics in HTML documents by reference (using the object element), putting SVG inline is more convenient in some cases. For example, an icon such as a warning sign can now be included inline instead of including it from an external file.

Make yourself a page that starts with <!DOCTYPE html> and put these two pieces of code in it and it should work with a new nightly.

In general, if you have a piece of MathML or SVG as XML, you can just copy and paste the XML markup inline into HTML (omitting the XML declaration and the doctype if any). There are two caveats: The markup must not use namespace prefixes for elements (i.e. no svg:svg or math:math ) and the namespace prefix for the XLink namespace has to be xlink .

In the MathML and SVG snippits above you’ll see that the inline MathML and SVG pieces above are more HTML-like and less crufty than merely XML pasted inline. There are no namespace declarations and unnecessary quotes around attribute values have been omitted. The quote omission works, because the tags are tokenized by the HTML5 tokenizer—not by an XML tokenizer. The namespace declaration omission works, because the HTML5 tree builder doesn’t use attributes looking like namespace declarations to assign MathMLness or SVGness to elements. Instead, <svg> establishes a scope of elements that get assigned to the SVG namespace in the DOM and <math> establishes a scope of elements that get assigned to the MathML namespace in the DOM. You’ll also notice that the MathML example uses named character references that previously haven’t been supported in HTML.

Here’s a quick summary of inline MathML and SVG parsing for Web authors: