2717 words

Why clean markup?

Client-side optimization is getting a lot of attention lately, but some of its basic aspects seem to go unnoticed. If you look carefully at pages on the web (even those that are supposed to be highly optimized), it’s easy to spot a good amount of redundancies, and inefficient or archaic structures in their markup. All this baggage adds extra weight to pages that are supposed to be as light as possible.

The reason to keep documents clean is not so much about faster load times, as it is about having a solid and robust foundation to build upon. Clean markup means better accessibility, easier maintenance, and good search engine visibility. Smaller size is just a property of clean documents, and another reason to keep them this way.

In this post, we’ll take a look at HTML optimization: removing some of the common markup smells; reducing document size by getting rid of redundant structures, and employing minification techniques. We’ll look at currently available minification tools, and analyze what they do wrong and right. We’ll also talk about what can be done in a future.

Markup smells

So what are the most common offenders?

One of the gross redundanies nowadays is inclusion of HTML comments — <!-- --> — in script blocks. There’s not much to say here, except that browsers that actually need this error-prevention measure (such as ‘95 Netscape 1.0) are pretty much extinct. Comments in scripts are just an unnecessary baggage and should be removed ferociously.

2. <![CDATA[ … ]> sections

Another often needless error-prevention measure is inclusion of CDATA blocks in SCRIPT elements:

<script type="text/javascript"> //<![CDATA[ ... //]]> </script>

It’s a noble goal that falls short in reality. While CDATA blocks are a perfectly good way to prevent XML processor from recognizing < and & as start of markup, it is only the case in true XHTML documents — those that are served with “application/xhtml+xml” content-type. Majority of the web is still served as “text/html” (since, for example, IE doesn’t understand XHTML to this date), and so is parsed as HTML by the browsers, not as XML.

Unless you’re serving documents as “application/xhtml+xml”, there’s little reason to have CDATA sections hanging around. Even if you’re planning to use xhtml in a future, it might make sense to remove unnecessary weight from the document, and only add it later, when actually needed.

And, of course, an ultimate solution here is to avoid inline scripts altogether (to take advantage of external scripts caching).

3. onclick=”…”, onmouseover=”“, etc.

There are some valid use cases for intrinsic event attributes, such as for performance reasons or to target ancient browsers (although, I’m not aware of any environment that would understand event attributes — onclick="..." , and not property-based assignments — element.onclick = ... ). Besides well-known reasons to avoid them, such as separation of concerns and reusability, there’s a matter of markup pollution. By moving event logic to external script, we can take advantage of that script’s caching. Event handler logic doesn’t need to be transferred to client every time document is requested.

4. onclick=”javascript:…”

An interesting confusion of javascript: pseudo protocol and intrinsic event handlers results in this redundant mix (with 106,000 (!) occurrences). The truth is that entire contents of event handler attribute become a body of a function. That function then serves as an event handler (usually, after having its scope augmented to include some or all of the ancestors and element itself). “javascript:” addition merely becomes an unnecessary label and rarely serves any purpose.

5. href=”javascript:void(0)”

Continuting with “javascript:” pseudo protocol, there’s an infamous href="javascript:void(0)" snippet, as a way to prevent default anchor behavior. This terrible practice of course makes anchor completely inacessible when Javascript is disabled/not available/errors out. It should go without saying that ideal solution is to include proper url in href, and stop default anchor behavior in event handler. If, on the other hand, anchor element is created dynamically, and is then inserted into a document (or is hidden initially, then shown via Javascript), plain href="#" is a leaner and faster alternative to “javascript:” version.

6. style=”…”

There’s nothing inherently wrong with style attribute, except that by moving its contents to an external stylesheet, we can take advantage of resource caching. This is similar to avoiding event attributes, mentioned earlier. Even if you only need to style one particular element and are not planning to reuse its styles, remember that style information has to be transferred every time document is requested. Moving style to external resouce prevents this, as stylesheet is transferred once and then cached on a client.

7. <script language=”Javascript” … >

Probably one of the most misunderstood attributes is SCRIPT’s “language”. This attribute is so archaic that it was already deprecated in 1999 (!), 10 years ago, when HTML 4.01 became an official recommendation. There’s absolutely no reason to use this attribute, except for the rare cases when language version needs to be specified (and even that is somewhat unreliable and should probably be avoided if possible).

8. <script charset=”…” … >

Another misunderstanding of SCRIPT element is that with charset attribute. Sometimes I see documents that include this kind of markup:

<script type="text/javascript" charset="UTF-8"> ... </script>

The thing is that charset attribute only really makes sense on “external” SCRIPT elements — those that have “src” attribute. HTML 4.01 even says:

Note that the charset attribute refers to the character encoding of the script designated by the src attribute; it does not concern the content of the SCRIPT element.

Testing shows that actual browsers behavior also matches specs in this regard.

Searching for this pattern, reveals about 2000 occurrences. Not suprising, given that even popular apps like Textmate include wrong usage of charset.

Additional optimizations

We’ve covered some of the bad practices, that almost always have to be avoided. But there’s still more ahead, and that “more” is removing redundant parts. Optimizations explained below are often questionable, as they compromise clarity for size. Therefore I include them here not as a recommendation, but merely as an option. Employ with careful consideration.

1. <style media=”all” …>

HTML 4.01 defines media attribute on STYLE elements, as a way of targeting specific medium — screen, print, handheld, and so on. One of the possible values for media is “all”, which also happens to be a de-facto standard among modern (and not so modern) browsers. If you find yourself using media="all" , it should be safe to just omit it and let browser set value implicitly.

Interestingly, HTML 4.01 states that default value for media is “screen”. However, none of the browsers I tested [1] implement it as per specs, and default to “all” instead. This is probably why HTML 5 draft specifies default value as “all” — to match actual browsers’ behavior.

2. <form method=”get” …>

Another default value — GET — of FORM element’s “method” attribute is often specified explicitly. There’s no harm in dropping it, except for lesser clarity. Note that HTML 5 draft leaves this behavior untouched.

3. <input type=”text” …>

INPUT element’s “type” defaults to “text” in both — HTML 4.01 and HTML 5 draft. Dropping this attribute can result in substantial size savings on pages with lots of text fields.

4. <meta http-equiv=”Content-type” …>

Specifying document’s character encoding has always been a source of great confusion. Contrary to common belief, META element that specifies Content-type does not have higher priority over “Content-type” HTTP header that document is served with. When both — header and META element are specified, header takes precedence.

If you control server response and can set up Content-type header properly, it’s safe to omit META element. The only reason to keep it, is to specify encoding when document is viewed offline.

5. <a id=”…” name=”…” …>

The main reason “name” attribute is still used together with “id” is for compatibility with ancient browsers (e.g. Netscape 4). Those couldn’t link to anchors by “id”, so “name” had to be used. If you have elements with pairing name/id’s, and don’t care about ancient browsers, feel free to get rid of this archaic pattern.

Watch out for any side effects. If you’re referencing elements by name in scripts ( document.getElementsByName , document.evaluate , document.querySelectorAll , etc.), replacing name’s with id’s might break things. Also remember that document.anchors only returns elements with name attributes.

6. <!doctype html>

A little more than a year ago, Dustin Diaz prposed to use HTML 5 doctype, as a way to cut down on document size. This is not a major optimization, but if you don’t care about validation and need to squeeze every single byte out of the page, using <!doctype html> is a viable option. Tests revealed that this fancy doctype triggers standards mode in a large variety of browsers.

Agressive optimizations

If you’re still craving for more, here are few extreme ideas. Some of these (e.g. omitting optional tags) have been circulating around for a while. Others I haven’t heard mentioned. Even though these might seem way too obtrusive, note that none of them really invalidate a document. That is if document is in HTML, not XHTML. But you’re serving documents as HTML anyway, don’t you? ;)

Remove HTML comments Remove/collapse whitespace Remove optional closing tags ( <p>foo</p> → <p>foo ) Remove quotes around attribute values, when allowed ( <p class="foo"> → <p class=foo> ) Remove optional values from boolean attributes ( <option selected="selected"> → <option selected> ) Munge inline styles, inline scripts and event attributes (if it’s not possible to remove them) Munge classes and ids (needs to be in sync with scripts and style declarations) Strip scheme names off of URLs ( http://example.com → //example.com )

But we have compression!

Do all of these optimizations even matter when document is compressed? Doesn’t gzip eliminate most of the markup overhead? After all, it’s a textual format we’re talking about!

It still matters.

First of all, it’s good to remember that not everyone is getting gzip. This is very sad, but the good thing is that in such cases HTML optimization plays even more significant role.

Second, even if document is served compressed, there are still savings of 5-10KB after compression (on an average document). Savings are even bigger with large documents. This might not seem like a lot, but in reality every byte counts.

As an example of compressing large document, I munged unofficial HTML version of ECMA-262, 3rd edition specs, which originally weighed about 750KB (131KB gzipped), to 606KB (115KB gzipped). That’s a saving of 16 KB after gzipping, simply by removing whitespace, comments, attribute quotes and optional tags. Optimized version looks the same as the old one.

Finally, optimizations like stripping whitespace and comments actually make resulting document tree lighter, potentially improving page rendering performance.

When things go wrong

As with any optimization, it’s very easy to get carried away. HTML Compact is a good example of HTML compression taken too far. This wonderful Windows app takes “unique” approach at compressing HTML… by writing it into a document via Javascript.

Turning this perfectly clean document:

<html> <head> <title></title> </head> <body> <div> <ul> <li>foo</li><li>bar</li><li>baz</li> <!-- few more dozens of list elements ... --> </ul> </div> </body> </html>

into this mess:

<!--hcpage status="compressed"--> <html> <head> <SCRIPT LANGUAGE="JavaScript" SRC="hc_decoder.js"></SCRIPT> <title></title> </head> <BODY> <NOSCRIPT>To display this page you need a browser with JavaScript support.</NOSCRIPT> <SCRIPT LANGUAGE="JavaScript"> <!-- hc_d0("Mv#d|\x3C:,&c@w4YFAtD1 [... and so on, another couple hundreds of characters ...]"); //--> </SCRIPT> </BODY> </html>

Needless to say, this kind of “optimization” should never be performed in the public web. Unless the intention is to make documents inacessible to users and search engines. And it hurts me seeing those NOSCRIPT elements, which fall short in clients behind Javascript-blocking firewalls. Bad idea, bad execution.

Antipatterns

Previous snippet was a good example of optimization anti-pattern. There are, however, few more you should be aware of:

1. Removing doctype

HTML Compresor has an option — on by default — to strip doctype. I can’t think of a case where stripping it would be beneficial. On a contrary, missing doctype triggers quirks mode, and as a result, wreaks havoc on a page layout and behavior. Doctypes should be left alone, or instead, replaced with a shorter — HTML 5 — version.

2. Replacing STRONG with B and EM with I

Another harmful option in the same HTML Compressor is to replace elements with their shorter “alternatives”. The problem here is that B is not really an alternative to STRONG. Neither is I a replacement to EM. STRONG and EM elements have semantic meaning — emphasis, whereas B and I are simply font-style elements; They affect text rendering, but carry no semantic meaning.

Even though browsers usually display these elements identically, screen readers and search engines very much understand the difference.

3. Removing title, alt attributes, and LABEL elements.

A good rule of thumb is to never optimize in exchange of accessibility. You might be tempted to remove that optional “alt” attribute on IMG elements, or “title” on anchors, but saving few dozens of bytes is really not worth often-critical accessibility loss.

It’s more or less trivial to automate most of the tweaks from “additional optimizations” section. There already exist tools that strip comments, whitespaces, and remove quotes around attribute values. But these are still in their infancy and perform a very limited set of optimizations. We can definitely do better.

A couple of months ago, hakunin and I started working on a similar, Ruby-based compressor, but never had a chance to finish it.

So what do we have so far?

Absolute HTML Compressor (desktop, windows) Does great job, but only after turning off options like stripping doctype and replacing STRONG with I. HTML Compact (desktop, windows) Makes document inaccessible. Avoid. HTML Compressor (desktop, windows) Only removes whitespace, and even in whitespace-sensitive elements, such as PRE. Not very useful. Pretty Diff (web-based) Doesn’t have option to completely remove whitespaces (only collapses them). Doesn’t perform any optimizations except collapsing whitespace and removing newlines. Doesn’t respect whitespace-sensitive elements. Not very useful. htmlcompressor (java-based) Performs most of the optimizations described here (but doesn’t remove optional tags or shorten boolean attributes). Respects whitespace-sensitive elements. It is more or less best option at the moment.

As you can see, current state of affairs is pretty disappointing. There seem to be no compression tools for Mac/Linux, and those for Windows are hardly useful.

Future considerations

Whereas munging and stripping can (and should) be done during production, markup smells is something that should never happen in the first place. Neither in production, nor in development. Not unless, for whatever reason, they are absolutely necessary.

Unsurprisingly, the best optimization one can do is often a manual one: changing document structure to avoid repeating classes on multiple elements (and instead moving them to parent element), or eliminating chunks that are not immediately needed, and instead loading them dynamically. Replacing miriads of <br> ‘s or ‘s used inefficiently for presentational purposes, or that old table-based layout are other good examples of manual cleaning.

As far as all the other little tweaks, I expect more compression tools to appear in the near future, pushing size-reduction boundaries even further.

If you know more ways to optimize HTML, please share. I’d be glad to hear any questions, suggestions or corrections.