Repeat after me: Unicode is not UTF-\d{1,2}



Filed under: June 22, 2009, 5:23 amFiled under: Uncategorized

Stop. Unicode is not UTF-16, Unicode is not UTF-32, and Unicode is not UTF-8. Stop it.

This is just plainly wrong. Unicode is a standard, and UTF-(8|16|32) are character encodings that support encoding Unicode data into byte strings. UTF-16 is capable of encoding the whole Unicode repertoire, but UTF-16 is not Unicode, and Unicode is not UTF-16.

And no, it’s not almost the same. It’s nowhere to be close. Think about it, you take some beginners, and introduce them to programming. After a while they get the pleasure to be introduced to “Ãœ” “Ã¶” “?” and their friends, and it’s time for them to learn about the huge Unicode monster. You point them to a great article, wait a bit, and head for a little Q&A session.

Inevitably, you will get strange formulations such as “I see, I will encode everything to Unicode”, or “but how can I tell if this string is Unicode-encoded?”, or better “all this time I’ve been using UTF-8 without knowing that it was in fact Unicode magic?”. Well, If you think that these approximate formulations are fine, and that you should not correct your beginner programmer, you’re just doing it wrong.

A byte string cannot be Unicode-encoded. It cannot. You either work with encoded byte strings, or with unicode data (a sequence of code-points). But you can’t “Unicode-encode”, it’s non-sense.

Similarly you cannot encode a string to Unicode. You cannot. You decode a byte string from a source character set to Unicode data: code-point sequences.

Repeat after me:

when transforming data from Unicode to a specific character set, you are encoding Unicode codepoints into bytes, according to a specific character encoding that suits your target character set.

Unicode codepoints into bytes, according to a specific character encoding that suits your target character set. when transforming from (byte) strings to Unicode, you are decoding your data: if you fail to provide the right character encoding to decode from, you will end up with borked Unicode data.

It’s not “okay” to let beginners work with approximate knowledge of the Unicode buzzword. They will eventually get confused, and you will end up losing time re-explaining over and over the same things. Approximate formulations reflect approximate knowledge, and you should not let that be. Approximate Unicode knowledge is the blocker, the main reason why everything we do is not (yet) in Unicode.

Because of these kind of approximations, we had broken Unicode support in Python until Python 3.0, where Unicode data and byte strings were deriving from a common class. Because of these kind of approximations, we have hundreds of beginners not understand the difference between UTF-8 and Unicode, and not understanding why string.encode('utf-8') can throw an error: you see, you just said that it was okay to “Unicode-encode”, and that UTF-8 is Unicode, so basically they are trying to “encode” strings as… Unicode and the fact that it fails is just puzzling them because Unicode was supposed to be the magical remedy to all their encoding errors.

Because of these approximations, the .NET property Encoding.Unicode is the shortcut for what should be Encoding.UTF16 . There are Encoding.UTF8 , Encoding.UTF32 , and Encoding.ASCII , and in the middle of those… Encoding.Unicode . How can developers write such wrong things? Unicode is not an Encoding, UTF16 is not Unicode. Just look at the wonderful C# construct Encoding u16LE = Encoding.Unicode; taken directly from the documentation: congratulations, you are assigning an “Unicode” value to to an “Encoding” type. Crystal clear.

A good image, perhaps, to explain the fundamental type difference between Unicode and, let’s say, UTF-16, would be to assimilate Unicode as an Interface, and UTF-16 as a concrete class implementing the Unicode interface.

In one hand, Unicode does not define any implementation: it defines no data representation, only an international unequivocal way to associate a character to a code-point. You could store a list of code-points to represent Unicode data, yes, but doing this forces you to store 4 bytes per character because of the large code-point range. This is rather inefficient, and this is why UTF-* appeared. The whole idea is to map Unicode data to byte strings, choosing yourself the mapping function, so that the resulting representation fits your needs. In a way, you have many different strategies implementing the same interface, depending on your focus:

UTF-8 focuses on minimizing the byte size for representation of characters from the ASCII set (variable length representation: each character is represented on 1 to 4 bytes, and ASCII characters all fit on 1 byte). As Joel puts it:

“Look at all those zeros!” they said, since they were Americans and they were looking at English text which rarely used code points above U+00FF. Also they were liberal hippies in California who wanted to conserve (sneer). If they were Texans they wouldn’t have minded guzzling twice the number of bytes. But those Californian wimps couldn’t bear the idea of doubling the amount of storage it took for strings

UTF-32 focuses on exhaustiveness and fixed-length representation, using 4 bytes for all characters. It’s the most straightforward translation, mapping directly the Unicode code-point to 4 bytes. Obviously, it’s not very size-efficient.

UTF-16 is a compromise, using 2 bytes most of the time, but expanding to 2 * 2 bytes per character to represent certain characters, those not included in the Basic Multilingual Plane (BMP).

I’m not writing that every developer should know about these details, I’m just writing that you should know about that very basic interface/implementation difference: Unicode is just an intermediate representation, not an actual encoding. I find it awful that .NET is so misleading here. Also, UTF-(8|16|32) are the most famous “implementations”, but there are several other standards, such as UTF-EBCDIC

If you’re an experienced programmer, please do not allow those approximations. If a code “guru” writes those sort of shortcuts, how will behaviors ever change about Unicode? Unicode is not black magic, you just have to be precise, and people will understand. Please use the right words. Over-simplifying concepts is not helping. Thanks.