In the beginning the Web had ASCII. And that was good. But then, not really. The Europeans and their strange accents were a bit of a problem.

So then the Web had iso-latin1. And HTML could be assumed to be using that, by default (RFC2854, section 4). And that was good. But then, not really. There was a whole world out there, with a lot of writing systems, tons of different characters. Many different character encodings…

Today we have Unicode, at long last well adopted in most modern computing systems, and a basic building block of a lot of web technologies. And although there are still a lot of different characters encoding available for documents on the web, this is not an issue, as there are mechanisms, both in HTTP and within HTML for instance, to declare the encoding used, and help tools determine how to decode the content.

All is not always rosy, however. The first issue is that there are quite a lot of mechanisms to declare encoding, and that they don’t necessarily agree. The second issue is that not everyone can configure a Web server to declare encoding of HTML documents at the HTTP level.

Many sources, One encoding

if the box says “dangerous, do not open”, don’t peek inside the box…

A long (web) time ago, there was a very serious discussion to try and determine a Web resource was supposed to know its encoding best, or whether the Web server should be the authoritative source.

In the “resource” camp, some were pushing the rather logical argument that a specific document surely knew best about its own metadata that a misconfigured Web server. Who cares if the server thinks that all HTML document it serves are iso-8859-1 , when I, as document author, know full well that I am authoring this particular resource as utf-8 ?

The other camp had two killer arguments.

The first, and perhaps the simplest, argument was: what’s the point of having user agents sniff garbage in hope to find content, and perhaps a character encoding declaration, when the transport protocol has a way of declaring it? This is the basis for the authoritative metadata principle. This principle is also sometimes summarized as: If I want to show an HTML document as plain text source, rather than have it interpreted by browsers, I should be able to do so. I should be able to serve any document as text/plain if that is my choice. The second killer argument was transcoding. A lot of proxies, they said, transform the content they proxy, sometimes from a character encoding to another. So even though a document might say “I am encoded in iso-2022-jp “, the proxy should be able to say “actually, trust me, the content I am delivering to you is in utf-8 “.

In the end, the apparent consensus was that the “server knows best” camp had the sound architectural arguments behind them, and so, for everything on the web served via the HTTP protocol, HTTP has precedence over any other method in determining the encoding (and content type, etc.) of resources.

This means that regardless of what is in an (x)html document, if the server says “this is a text/html document encoded as utf-8 “, user agents should follow that information. Second guessing is likely to cause more harm than good.

Unlabeled boxes can be full of treasures, or full or trouble

But what if there is no character encoding declared at the HTTP level? This is where it gets tricky.

“Old school” HTML introduced a specific meta tag for the declaration of the encoding within the document:

<META http-equiv="Content-Type" content="text/html; charset=ISO-8859-5">

Over the years, we have seen that this method was plagued by two serious issues:

Its syntax. Nobody seems to get it right (it is just… too complicated!) and the Web is littered with approximate, sometimes comical, variants of this syntax. This is no laughing matter for user agents, however, which can’t even expect to find this encoding declaration properly marked up! The meta elements have to be within the head of a document, but there is no guarantee that it will be anywhere near the top of the document. the head of a document can have lots of other metadata, title, description, scripts and stylesheeets, before declaring the encoding. This means a lot of sniffing and pseudo-parsing of undecoded garbage. In some cases, it can have dreadful consequences, such as security flaws in the approximate sniffing code.

It is worth noting that current work on html5 tries to work around these issues by providing a simpler alternate syntax, and making sure that the declaration of encoding should be present at the very beginning of the head .

XML, on the other hand, had a way to declare encoding at the document level in the XML declaration. The good thing about that being that this declaration MUST be at the very beginning of the document, which alleviates the pain of having to sniff the content.

<?xml version="1.0" encoding="UTF-8"?>

The XML specification also defines, in its Appendix F, a recommended algorithm for the encoding detection.

The Recipe

Given all these potential sources for the declaration (or automatic detection) of the document character encoding, all potentially contradicting the others, what should be the recipe to reliably figure out which encoding to use?

The charset info in the HTTP Content-Type header should have precedence. Always Next in line is the charset information in the XML declaration. Which may be there, or may not. For XHTML documents, and in particular for the XHTML documents served as text/html , it is recommended to avoid using an XML declaration. But let’s remember: XHTML is XML, and XML requires an XML declaration or some other method of declaration for XML documents using encodings other than UTF-8 or UTF-16 (or ascii, which is a convenient subset…). As a result, there is a strong likeliness that anything served as application/xhtml+xml (or text/html and looking a lot like XHTML), with neither encoding declaration at the HTTP level nor in an XML declaration is quite likely to be UTF-8 or UTF-16 Then there is the BOM , a signature for Unicode character encodings. Then comes the search for the meta information that might, just might, provide a character encoding declaration. Beyond that point, it’s the land of defaults and heuristics. You may choose to default to iso-8859-1 for text/html resources, utf-8 for application/xhtml+xml . The rest is heuristics. You could venture towards fallback encodings such as windows-1252 , which many consider a safe bet, but a bet nonetheless. There are quite a few algorithms to determine the likeliness of one specific encoding based on matching at byte level. Martin Dürst wrote a regexp to check whether a document will “fit” as utf-8. If you know other reliable algorithms, feel free to mention them in the comments, I will list them here.

Does this seem really ugly and complicated to you? You will love the excellent Encoding Divination Flow Chart by Philip Semanchuk, the developer of the Web quality checker “Nikita the spider”.

Or, if this is still horribly fuzzy after looking at the flow chart, why not let a tool do that for you? The HTML::Encoding perl module by Björn Höhrmann does just that.

Last word… for HTML authors

If you create content on the Web and never have to read and parse content on the web, and if you have read that far, you are probably considering yourself very lucky right now. But you can make a difference by making sure the content you put on the web is using consistent character encodings, and declare them properly. Your job is actually much easier than the tricky winding road to determining a document’s encoding. In the proverbial three steps: