Good is the enemy of Great

Latin-1 is the enemy of UTF-8

You write web apps. You understand the web is global, and want to support internationalization. You want UTF-8.

UTF-8 is extremely sane. Well, as sane as an encoding can be that features backwards-compatibility with ASCII.

Everything you care about supports UTF-8. Trust me: you want it everywhere.

Problem is, every last part of the web-application stack will fight you on your quest towards UTF-8 purity. What follows is a playbook to win your pervasive-UTF-8 battle.

First, you’re going to need diagnostic tools. There are two main weapons:

A hex editor and traffic dumper. The programs you use to view text, be it dynamic from a tool’s output (Console.app) or a static file like a database dump (TextEdit, BBEdit, TextMate), have encoding logic. They will attempt to auto-detect encoding and paint you a pretty picture. Avoid them. When debugging, you don’t want a pretty picture, you want The Truth. You need to be able to see raw byte-streams to debug this stuff. A common problem is mixed encodings. That is, a file or stream that says it’s UTF-8 but has a chunk of Latin-1 in it. This is invisible corruption since most software won’t alert you when it hits mixed encodings (BBEdit is a notable exception). Using a hex editor or viewing raw hex streams allows you to spot when a character that should be taking up three bytes (UTF-8) is only taking one (Latin-1). A Unicode Canary-in-a-Coal-Mine. You need a chunk of data that exercises the Unicode system: a sentinel value that you can push through your stack and make sure it survives a round-trip intact. Initially I went with something like “tésting”, but it turns out that’s not enough – it will losslessly survive undesired transcoding to Latin-1 and back again. No, you need something hard-core: “Iñtërnâtiônàlizætiøn” (complete with curly quotes). (If you can’t read that word in your browser, it looks like the word “Internationalization” that’s had an umlaut omelet thrown in its face, and you’ve discovered an yet another encoding error somewhere between where I’m typing this and where you’re reading it.) “Iñtërnâtiônàlizætiøn” is a great word to push through your systems because it can’t be represented in Latin-1 and will catch all sorts of hidden failure scenarios. Coupled the viewing raw hex, there’s no place for encoding bugs to hide. (For the record, “Iñtërnâtiônàlizætiøn” looks like E2 80 9C 49 C3 B1 74 C3 AB 72 6E C3 A2 74 69 C3 B4 6E C3 A0 6C 69 7A C3 A6 74 69 C3 B8 6E E2 80 9D in UTF-8 in hex.)

※ ※ ※

OK, those are your weapons. Now for some concrete tips, starting from the bottom-up: