Hi, there!

As you might have guessed from the title, today's topic is HTML parsing.

Preface

Once I got an X idea, but its implementation required a calculated DOM with all its styles and goodies. Googling retrieved nothing useful. There are all kinds of bindings for WebKit, but they work not on all platforms, being very crippled to boot. In some projects, WebKit is wrapped in a frontend you work with via JavaScript. Something was tried out — with a poor outcome though. Resource consumption alone required a lot.

Wants and Wishes

What was wanted didn't seemed much:

HTML renderer without super dependencies. Only a renderer, without the network layer. In other words, complete HTML computation till drawing in a window

Ability to fit the envelop to a JavaScript engine

Ability to easily make bindings for other programming languages

And so I joined the unequal battle!

I studied existing HTML and CSS parsers. They all fell into 3 conventional categories:

Those parsing at random, with their own approach to HTML tokenization

Those parsing somehow under specs

Those parsing strictly under specs

Given No. 3, the subject seems to be dropped, doesn't it?! Nope, and here's why: all existing parsers are made on the principle 'Parse and Die'. It's when you give the program complete HTML, the program returns a result, but all subsequent manipulations except reading are impossible. This fact limits the operation of parsers. Remarkably, some push operating DOM off a level up. Here's the principle: we parse with a C parser and then — via bindings — try to work with DOM on, say, Python, which is a bit absurd.

Further on, nobody allowed for wedging into the thread (HTML meant here) during parsing. This is critical for fitting a JavaScript engine. It's a long story — I'd better show:

< script > document . write ( " <div cl " );</ script > ass="future"></ div >

< div class = " future " ></ div >

HTML document fragment:

So, a fully featured DIV element will come out. By the way, SCRIPT tag tokenization is a hell of an effort. I had to draw a graph.

After all that had been seen, the decision was to code from scratch on C. And requirements to code appeared at once:

C99 support

Capability to separate the HTML parser from the renderer, to be used standalone

No external dependencies

Why so tough — on C?! The solution had to be embedded so as to enable relatively easy framing for an external programming language.

Here's what was managed to be drafted hit and miss:

HTML parser

CSS parser

Selectors

Renderer of 'inline', 'inline-block', 'block', 'table'...

The renderer may deserve a long description, as the short phrase 'Renderer of inline elements' conceals a lot: handling fonts under specs, calculating text size, computing 'vertical-align', building an auxiliary tree to draw text and a whole lot more.

As a result of 2–3 years of unhurried development, I started rewriting the draft copy into a production version. The first was — quite logically — the HTML parser.

Now it has the following capabilities:

Parsing HTML asynchronously, processing tokens, building the tree

Full HTML 5 support under specs html.spec.whatwg.org/multipage

Having 2 APIs: high and low-level. The former is a public API having a description and everything it should, but unable to see structures. The latter is using sources directly

Ability to manipulate elements: addition, removal, modification

Ability to manipulate element attributes: addition, removal, modification

Support of 34 input encodings. Output and all internal work is in UTF-8 only

Ability to define text encoding Unicode now available: UTF-8, UTF-16LE, UTF-16BE (+ definition by BOM), and Russian ones: windows-1251, koi8-r, iso-8859-5, x-mac-Cyrillic, ibm866

Ability to run in single mode — without threads

Parsing HTML fragments

Parsing chunks. Parsing HTML cuts (broken in arbitrary places) without prebuffering

No external dependencies

C99 support

Passes all tree construction tests from html5lib-tests

Advanced memory management. Memory is cashed, allocated in chunks and for objects. For example, removing 10 elements and then adding other 10 won't eat away memory for the new ones



+ a whole lot more small but useful features to be described for long.

Next in turn are the CSS parser and Renderer. I'm writing them all by myself, still full of energy.

Any help is very welcome!

Thanks for attention! Hope you'll enjoy it!

The parser itself

See next article: Benchmark. Analyzing and Testing Current HTML Parsers