Edward O’Connor: Fortunately, Node already has an excellent implementation of the HTML5 parser (by Aria Stewart)

I find it rather amusing that the first thing I encounter is a bug. This bug was quickly addressed, and I’ve verified the fix.

Actually, that was the second problem. The first was that if I installed node.js from git, npm wouldn’t install. The symptoms were that npm would download, install to a temporary directory, attempt to install for real, proceed to remove the temporary directory, and then report success. Downloading the script, removing the code that removed the temporary directory, running it again, going into that temporary directory, and running make manually resulted in a failure message (simply return code of 1 with no other information) which apparently didn’t result in the installation being reported as a failure.

Here is the installation instructions that actually worked for me (backing up to the stable version):

sudo apt-get install g++ curl libssl-dev apache2-utils wget http://nodejs.org/dist/node-v0.2.6.tar.gz tar xzf node-v0.2.6.tar.gz cd node-v0.2.6 ./configure --prefix=$HOME make make install curl http://npmjs.org/install.sh | sh

With that fix in place, I was able to proceed to run the test I wanted:

var http = require('http'), html5 = require('html5'), jsdom = require('jsdom'), window = jsdom.jsdom().createWindow(null, null, {parser: html5}); var rubix = http.createClient(80, 'intertwingly.net'); var request = rubix.request('GET', '/blog/', {'host': 'intertwingly.net'}); request.end(); request.on('response', function (response) { var parser = new html5.Parser({document: window.document}); parser.parse(response); jsdom.jQueryify(window, 'jquery-1.4.4.min.js', function(window, $) { $('h3').each(function() { console.log($(this).text()); }); }); });

Observations:

The html5 readme is incorrect, in that there needs to be a call to jsdom.jsdom() in the call to createWindow . Failing to do this causes the script to complete without error. I was able to figure out what needed to be done by looking at the jqueryify example in the jsdom readme.

The default HTTP client that is provided with node.js doesn’t provide any headers on get requests, not even the very much required host header. Overall this is a good thing as it increases the visibility of headers. What it does mean is that functions such as deflate and gzip will not be provided automatically. This, along with niceties as etag handling can be provided by higher level frameworks.

Apparently by default my front page is currently sent as three chunks. As handing things asynchronously is ingrained into everything that is a part of node.js, the parsing of the page can begin as soon as the first bytes are delivered. Even on single user scripts this can create a perceptible improvement in the responsiveness of handling large documents as the parsing can overlap the fetching.

Speaking of asynchronicity, when I run the above script I get a variable number of headers returned. I presume what is happening is that if html5.parser is passed an EventEmitter it returns immediately and provides an event once it is complete. I’ll verify that using the documentation... once I find the documentation that is.

Being able to run jQuery on the server is big win. What I plan to explore is the idea of replacing templates with pure html in some of my scripts. Instead of littering my code with expressions and code to be evaluated, I would like to do the equivalent of Unobtrusive JavaScript, and have a prototype document which is updated (in parallel!) by a number of scripts before the results are returned.

Speaking of running scripts, when I run this script I get the following output: ENOENT, No such file or directory '/js/jquery-1.3.2.min.js' ENOENT, No such file or directory '/js/jq_localize_dates.js' What this indicates to me is that the parser is attempting to execute the scripts, which makes sense as the parser is in JavaScript already after all. All I would need to do is provide the base URI to use, which again I presume is in the documentation. This is exciting in every sense of the word. It is very powerful, and at times could be useful (in this case it would convert all of the dates in the page from GMT to local time). It is also potentially very dangerous. If parsing a remote page could make a script run that obtains access to the full power of node.js, it could access your file system and run commands. I hope that there is an option to turn this off.

Next time I pick this up I’ll have to try something larger.