Note: This article may not display properly if your browser has no font installed with some special Unicode characters.

Unicode is a beautiful thing. Where in the past, we had to be content with text sequences of merely 128 different characters, we can now use over 110,000 characters from almost every script in use today or throughout history. But there’s a problem: despite the fact that Unicode has been around and relatively unchanged for the past 16 years, most modern programming languages do not give programmers a proper abstraction over Unicode strings, which means that we must carefully deal with a range of encoding issues or suffer a plethora of bugs. And there are still languages coming out today that fall into this trap. This blog post is a call to language designers: your fundamental string data type should abstract over Unicode encoding schemes, such that your programmer sees only a sequence of Unicode characters, thus preventing most Unicode bugs.

Over the past few years, I have encountered a surprising amount of resistance to this idea, with counter-arguments including performance reasons, compatibility reasons and issues of low-level control. In this post, I will argue that having a consistent, abstract Unicode string type is more important, and that programmers can be given more control in the special cases that require it.

To give a quick example of why this is a problem, consider the Hello World program on the front page of the Go programming language website. This demo celebrates Go’s Unicode support by using the Chinese word for “world”, “世界”:

fmt.Println("Hello, 世界")

But while Go ostensibly has Unicode strings, they are actually just byte strings, with a strong mandate that they be treated as UTF-8-encoded Unicode text. The difference is apparent if we wrap the string in the built-in len function:

fmt.Println(len("Hello, 世界"))

This code, perhaps surprisingly, prints 13, where one might have expected it to print 9. This is because while there are 9 characters in the string, there are 13 bytes if the string is encoded as UTF-8: in Go, the string literal “Hello, 世界” is equivalent to “Hello, \xe4\xb8\x96\xe7\x95\x8c”. Similarly, indexing into the string uses byte offsets, and retrieves single bytes instead of whole characters. This is a dangerous leaky abstraction: it appears to be a simple sequence of characters until certain operations are performed. Worse, English-speaking programmers have a nasty habit of only testing with ASCII input, which means that any bugs where the programmer has presumed to be dealing with characters (but is actually dealing with bytes) will go undetected.

In a language that properly abstracts over Unicode strings, the above string should have a length of 9. Programmers should only be exposed to encoding details when they explicitly ask for it.

The folly of leaky abstraction

The fundamental problem lies in the difference between characters and code units. Characters are the abstract entities that correspond to code points. Code units are the pieces that represent a character in the underlying encoding, and they vary depending on the encoding. For example, consider the character OSMANYA LETTER BA or ‘𐒁’ (with code point U+10481 — I go into details on the difference between characters and code points in this blog post). In UTF-8, it is represented by the four byte sequence F0 90 92 81; code units are bytes, so we would say this character is represented by four code units in UTF-8. In UTF-16, the same character is represented by two code units: D801 DC81; code units are 16-bit numbers. The problem with many programming languages is that their string length, indexing and, often, iteration operations are in code units rather than in characters (or code points).

(An aside: I find it interesting that the Unicode Standard itself defines a “Unicode string” as “an ordered sequence of code units.” That’s code units, not code points, suggesting that the standard itself disagrees with my thesis that a string should be thought of as merely a sequence of code points. This section of the standard seems concerned with how the strings should be represented, not the programming interface for accessing such strings, so I feel that it doesn’t directly contradict this post. This post is about the string interface, not the underlying representation.)

There are three main types of culprit languages that exhibit this issue:

Languages with byte-oriented string operations and no specified encoding. These leave Unicode support up to programmers and third-party library authors, resulting in a general mish-mash of encoding across different code bases. These include C, Python 2, Ruby, PHP, Perl, Lua, and almost all pre-Unicode (1992) programming languages. Languages with code-unit-oriented string operations and no specified encoding. These are probably the worst offenders, as the language manages the encoding, but leaks abstraction details that the programmer has no control over (for example, different builds of the compiler may have different underlying encoding schemes, affecting the behaviour of programs). These include certain builds of Python 3.2 and earlier, and Mercury. Languages with code-unit-oriented string operations, but a specific encoding scheme. These languages at least have well-defined behaviour, but still expose the programmer to implementation details, which can result in bugs. Most modern languages fit into this category, including Go (UTF-8), JVM-based languages such as Java and Scala (UTF-16), .NET-based languages such as C# (UTF-16) and JavaScript (UTF-16).

Edit: A number of comments indicate that some of the other languages have optional Unicode support. In Perl 5, you can write “use utf8” to turn on proper Unicode strings. In Ruby 1.9, you can attach encodings to strings to make them behave better. In Python 2, as I’ll get to later, you have a separate Unicode string type.

Above, I talked about the leaky abstraction of UTF-8 in Go, but it is equally important to recognise the leaky abstraction of UTF-16 in many modern language frameworks; in particular, the Java platform (1996) and the .NET platform (2001). Both treat strings as a sequence of UTF-16 code units (and the corresponding “char” data type as a single UTF-16 code unit). This is disappointingly close to the ideal. In Java, for instance, you can essentially treat a string to be a sequence of characters. ‘e’ (U+0065) is a character, and ‘汉’ (U+6C49) is a character — it feels like the language is abstracting over Unicode characters, until you realise that it isn’t. ‘𐒁’ (U+10481) is not a valid Java character. It is an astral character (one that cannot be represented in a single UTF-16 code unit) — it needs to be represented by the two-code-unit sequence D801 DC81. So it is possible to store this character in a Java string, but not without understanding the underlying encoding details. For example, the Java string “𐒁” has length 2, and worse, if you ask for “charAt(0)” you will get the character U+D801; “charAt(1)” gives U+DC81. Any code which directly manipulates the characters of the string is likely to fail in the presence of astral characters. Java has the excuse of history: Java was first released in January 1996; Unicode 2.0 was released in July 1996, and before that, there were no astral characters (so for six months, Java had it right!). The .NET languages, such as C#, behave exactly the same, and don’t have the excuse of time. JavaScript is the same again. This is perhaps worse than the UTF-8 languages, because even non-English-speaking programmers are likely to forget about astral characters: there are no writing systems in use today that use astral characters. But astral characters include historical scripts of interest to historians, emoticons which may be used by software, and large numbers of mathematical symbols which are commonly used by mathematicians, including myself, so it is important that software gets them right.

I must stress how harmful the second class of languages are in terms of string encoding. In 2011, the Mercury language switched from a type 1 language (strings are just byte sequences) to a type 2 language — the language now specifies all of the basic string operations in terms of code units, but does not specify the encoding scheme. Worse, Mercury has several back-ends — if compiling to C or Erlang, it uses UTF-8, whereas compiling to Java or C# results in UTF-16-encoded strings. Now programmers must contend with the fact that string.length(“Hello, 世界”) might be 13 or 9 depending on the back-end. (Fortunately, character-oriented alternatives, such as count_codepoints, are provided.) This is a major blow to code portability, which I would recommend language designers avoid at all costs.

Python too has traditionally had this problem: the interpreter can be compiled in “UCS-2” or “UCS-4” modes, which represent strings as UTF-16 or UTF-32 respectively, and expose those details to the programmer (the UCS-2 build behaves much like the other UTF-16 languages, while the UCS-4 build behaves exactly as I want, with one character per code unit). Fortunately, Python is about to correct this little wart entirely with the introduction of PEP 393 in upcoming version 3.3. This version will remove UCS-2/UCS-4 build, so all future versions will behave as the UCS-4 build did, but with a nice optimisation: all strings with only Latin-1 characters are encoded in Latin-1; all strings with only basic multilingual plane (BMP) characters are encoded in UCS-2 (UTF-16); all strings with astral characters are encoded in UTF-32. (It’s more complicated than that, but that’s the gist of it.) This ensures the correct semantics, but allows for compact string representation in the overwhelmingly common case of BMP-only strings.

Somewhat surprisingly, Bash (1989) appears to support character-oriented Unicode strings. The only other languages I have found which properly abstract over Unicode strings are from the functional world: Haskell 98 (1998) and Scheme R6RS (2007). Haskell 98 specifies that “the character type Char is an enumeration whose values represent Unicode characters” (with a link to the Unicode 5.0 specification). Scheme R6RS was the first version of Scheme to mention Unicode, and got it right on the first go. It specifies that “characters are objects that represent Unicode scalar values,” and goes into details explicitly stating that a character value is in the range [0, D7FF 16 ] ∪ [E000 16 , 10FFFF 16 ], then defines strings as sequences of characters.

With only four languages that I know of properly providing character-oriented string operations by default (are there any more?), this is a pretty poor track record. Sadly, many modern languages are being built on top of either the JVM or .NET framework, and so naturally absorb the poor character handling of those platforms. Still, it would be nice to see some more languages that behave correctly.

Some real-world problems

So far, the discussion has been fairly academic. What are some of the actual problems that have come about as a result of the leaky abstraction?

Edit: I mention a heap of technologies here, in a way that could be interpreted to be derisive. I don’t mean any offense towards the creators of these technologies, but rather, I intend to point out how difficult it can be to work with Unicode on a language that doesn’t provide the right abstractions.

A memorable example for me was a program I worked on which sent streams of text from a Python 2 server (UTF-8 byte strings) to a JavaScript client (UTF-16 strings). We had chosen the arbitrary message size of 512 bytes, and were chopping up text arbitrarily on those boundaries, and sending them off encoded in UTF-8 to the client, where they were concatenated back together. This almost worked, but we discovered a strange problem: some non-ASCII characters were being messed up some of the time. It occurred if you happened to have a multi-byte character split across a 512-byte boundary. For example, “ü” (U+00FC) is encoded as hex C3 BC. If byte 511 of a message was C3 and byte 512 was BC, the first packet would be sent ending in the byte C3 (which JavaScript wouldn’t know how to convert to UTF-16), and the second packet would be sent beginning with the byte BC (which JavaScript also wouldn’t know how to convert to UTF-16). So the result is “��”. In general, if a language exposes its underlying code units, then strings may not be split, converted to a different encoding, and re-concatenated.

An obvious example of an operation that goes wrong in these situations is the length operator. If you use UTF-8, Chinese characters are going to each report a length of 3, while in UTF-16, astral characters will each report a length of 2. Think this doesn’t matter? What about Twitter? On Twitter, you have to type messages into 140 characters. That’s 140 characters, not 140 bytes. Imagine if Twitter ate up three “characters” every time you hit a key (if you spoke Chinese, for example). Fortunately, Twitter does it correctly, even for astral characters. I just tweeted the following:

𝐓𝐰𝐢𝐭𝐭𝐞𝐫 𝐬𝐮𝐩𝐩𝐨𝐫𝐭𝐬 𝐚𝐬𝐭𝐫𝐚𝐥 𝐜𝐡𝐚𝐫𝐚𝐜𝐭𝐞𝐫𝐬 — 𝐞𝐚𝐜𝐡 𝐨𝐧𝐞 𝐜𝐨𝐮𝐧𝐭𝐬 𝐚𝐬 𝐚 𝐬𝐢𝐧𝐠𝐥𝐞 𝐜𝐡𝐚𝐫𝐚𝐜𝐭𝐞𝐫. 𝐈𝐭’𝐬 𝐧𝐨𝐭 𝐭𝐡𝐞 𝐬𝐢𝐳𝐞 𝐨𝐟 𝐭𝐡𝐞 𝐭𝐰𝐞𝐞𝐭 𝐭𝐡𝐚𝐭 𝐜𝐨𝐮𝐧𝐭𝐬, 𝐢𝐭’𝐬 𝐡𝐨𝐰 𝐲𝐨𝐮 𝐮𝐬𝐞 𝐢𝐭.

Notice that the text is unusually bold — these are no ordinary letters. They are MATHEMATICAL BOLD letters, which are astral characters. For example, the ‘𝐓’ is U+1D413 MATHEMATICAL BOLD CAPITAL T. The tweet contains exactly 140 characters — 28 ASCII, 3 in the basic multilingual plane, and 109 astral. In UTF-16, it comprises 249 code units, while in UTF-8, it comprises 473 bytes. Yet it was accepted in Twitter’s 140 character limit, because they count characters. I would guess that Twitter’s counting routines are written in both Scala and JavaScript, which means that in both versions, the engineers would need to have specially considered surrogate pairs as a single character. Had Twitter been written in a more Unicode-aware language, they could just have called the length method and not given it another thought.

Let’s quickly look at the way Tomboy (a note taking app written in C# on the Mono/.NET platform) loads notes from XML files. I discovered a bug in Tomboy where the entire document would be messed up if you type an astral character, save, quit, and load. It turns out that one function is keeping an int “offset” which counts the number of “characters” (actually code units) read, and passes the offset to another function, which “returns the location at a particular character offset.” This one really means character. Needless to say, one astral character means that the character offset is too high, resulting in a complete mess. Why is it the programmer’s responsibility to deal with these concepts?

Other software I’ve found that doesn’t handle astral characters correctly includes Bugzilla (ironically, I discovered this when it failed to save my report of the above Tomboy bug) and FileFormat.info (an otherwise-fantastic resource for Unicode character information). The latter shows what can happen when a UTF-16 string is encoded to UTF-8 without a great deal of care, presumably due to an underlying language exposing a UTF-16 string representation.

A common theme here is programmer confusion. While these problems can be solved with enough effort, the leaky abstraction forces the programmer to expend effort that could be spent elsewhere. Importantly, it makes documentation more difficult. Any language with byte- or code-unit-oriented strings forces programmers to clarify what they mean every time they say “length” or “n characters”. Only when the language completely hides the encoding details are programmers free to use the phrase “n characters” unambiguously.

What about performance?

The most common objection to character-oriented strings is that it impacts the performance of strings. If you have an abstract Unicode string that uses UTF-8 or UTF-16, your indexing operation (charAt) goes from O(1) to O(n), as you must iterate over all of the variable-length code points. It is easier to let the programmer suffer an O(1) charAt that returns code units rather than characters. Alternatively, if you have an abstract Unicode string that uses UTF-32, you’ll be fine from a performance standpoint, but you’ve made strings take up two or four times as much space.

But let’s think about what this really means. The performance argument really says, “it is better to be fast than correct.” That isn’t something I’ve read in any software engineering book. Conventional wisdom is the opposite: 1. Be correct — in all cases, not just the common ones, 2. Be fast (here it is OK to optimise for the common cases). While performance is an important consideration, is unbelievable to me that designers of high-level languages would prioritise speed over correctness. (Again, you could argue that a code unit interface is not “incorrect,” merely “low level,” but hopefully this post shows that it frequently results in incorrect code.)

I don’t know exactly what the best implementation is, and I’m not here to tell you how to implement it. But for what it’s worth, I don’t consider UTF-32 to be a tremendous burden in this day and age (in moving to 64-bit platforms, we doubled the size of all pointers — why can’t we also double the size of strings?) Alternatively, Python’s PEP 393 shows that we can have correct semantics and optimise for common cases as well. In any case, we are talking about software correctness here — can we get the semantics right first, then worry about optimisation?

But it isn’t supposed to be abstract — that’s what libraries are for

The second most common objection to character-oriented strings is “but the language just provides the basic primitives, and the programmer can build whatever they like on top of that.” The problem with this argument is that it ignores the ecosystem of the language. If the language provides, for example, a low-level byte-string data type, and programmers can choose to use it as a UTF-8 string (or any other encoding of their choice), then there is going to be a problem at the seams between code written by different programmers. If you see a library function that takes a byte-string, what should you pass? A UTF-8 string? A Latin-1 string? Perhaps the library hasn’t been written with Unicode awareness at all, and any non-ASCII string will break. At best, the programmer will have written a comment explaining how you should encode Unicode characters (and this is what you need to do in C if you plan to support Unicode), but most programmers will forget most of the time. It is far better to have everybody on the same page by having the main string type be Unicode-aware right out of the box.

The ecosystem argument can be seen very clearly if we compare Python 2 and 3. Python 3 is well known for better Unicode support than its predecessor, but fundamentally, it is a rather simple difference: Python 2’s byte-string type is called “str”, while its Unicode string type is called “unicode”. By contrast, Python 3’s byte-string type is called “bytes”, while its Unicode string type is called “str”. They behave pretty much the same, just with different names. But in practice, the difference is immense. Whenever you call the str() function, it produces a Unicode string. Quoted literals are Unicode strings. All Python 3 functions, regardless of who wrote them, deal with Unicode, unless the programmer had a good reason to deal with bytes, whereas in Python 2, most functions dealt with bytes and often broke when given a Unicode string. The lesson of Python 3 is: give programmers a Unicode string type, make it the default, and encoding issues will mostly go away.

What about binary strings and different encodings?

Of course, programming languages, even high-level ones, still need to manipulate data that is actually a sequence of 8-bit bytes, that doesn’t represent text at all. I’m not denying that — I just see it as a totally different data structure. There is no reason a language can’t have a separate byte array type. UTF-16 languages generally do. Java has separate types “byte” and “char”, with separate semantics (for example, byte is a numeric type, whereas char is not). In Java, if you want a byte array, just say “byte[]”. Python 3, similarly, has separate “bytes” and “str” types. JavaScript recently added a Uint8Array type for this purpose (since JavaScript previously only had UTF-16 strings). Having a byte array type is not an excuse for not having a separate abstract string type — in fact, it is better to have both.

Similarly, a common argument is that text might come into the program in all manner of different encodings, and so abstracting away the string representation means programmers can’t handle the wide variety of encodings out there. That’s nonsense — of course even high-level languages need encoding functions. Python 3 handles this ideally — the string (str) data type has an ‘encode’ method that can encode an abstract Unicode string into a byte string using an encoding of your choice, while the byte string (bytes) data type has a ‘decode’ method that can interpret a byte string using a specified encoding, returning an abstract Unicode string. If you need to deal with input in, say, Latin-1, you should not allow the text to be stored in Latin-1 inside your program — then it will be incompatible with any other non-Latin-1 strings. The correct approach is to take input as a byte string, then immediately decode it into an abstract Unicode string, so that it can be mixed with other strings in your program (which may have non-Latin characters).

What about non-Unicode characters?

There’s one last thorny issue: it’s all well and good to say “strings should just be Unicode,” but what about characters that cannot be represented in Unicode? “Like what,” you may ask. “Surely all of the characters from other encodings have made their way into Unicode by now.” Well, unfortunately, there is a controversial topic called Han unification, which I won’t go into details about here. Essentially, some languages (notably Japanese) have borrowed characters from Chinese and changed their appearance over thousands of years. The Unicode consortium officially considers these borrowed characters as being the same as the original Chinese, but with a different appearance (much like a serif versus a sans-serif font). But many Japanese speakers consider them to be separate to the Chinese characters, and Japanese character encodings reflect this. As a result, Japanese speakers may want to use a different character set than Unicode.

This is unfortunate, because it means that a programming language designed for Japanese users needs to support multiple encodings and expose them to the programmer. As Ruby has a large Japanese community, this explains why Ruby 1.9 added explicit encodings to strings, instead of going to all-Unicode as Python did. I personally find that the advantages of a single high-level character sequence interface outweigh the need to change character sets, but I wanted to mention this issue anyway, by way of justifying Ruby’s choice of string representation.

What about combining characters?

I’ve faced the argument that my proposed solution doesn’t go far enough in abstracting over a sequence of characters. I am suggesting that strings be an abstract sequence of characters, but the argument has been put to me that characters (as Unicode defines them) are not the most logical unit of text — combining character sequences (CCSes) are.

For example, consider the string “𐒁̸”, which consists of two characters: ‘𐒁’ (U+10481 OSMANYA LETTER BA) followed by ‘◌̸’ (U+0338 COMBINING LONG SOLIDUS OVERLAY). What is the length of this string?

Is it 6? That’s the number of UTF-8 bytes in the string (4 + 2).

Is it 3? That’s the number of UTF-16 code units in the string (2 + 1).

Is it 2? That’s the number of characters (code points) in the string.

Is it 1? That’s the number of combining character sequences in the string.

My argument is that the first two answers are unacceptable, because they expose implementation details to the programmer (who is then likely to pass them on to the user) — I am in favour of the third answer (2). The argument put to me is that the third answer is also unacceptable, because most people are going to think of “𐒁̸” as a single “character”, regardless of what the Unicode consortium thinks. I think that either the third or fourth answers are acceptable, but the third is a better compromise, because we can count characters in constant time (just use UTF-32), whereas it is not possible to count CCSes in constant time.

I know, I know, I said above that performance arguments don’t count. But I see a much less compelling case for counting CCSes over characters than for counting characters over code units.

Firstly, users are already exposed to combining characters, but they aren’t exposed to code units. We typically expect users to type the combining characters separately on a keyboard. For instance, to type the CCS “e̸”, the user would type ‘e’ and then type the code for ‘◌̸’. Conversely, we never expect users to type individual UTF-8 or UTF-16 code units — the representation of characters is entirely abstract from the user. Therefore, the user can understand when Twitter, for example, counts “e̸” as two characters. Most users would not understand if Twitter were to count “汉” as three characters.

Secondly, dealing with characters is sufficient to abstract over the encoding implementation details. If a language deals with code units, a program written for a UTF-8 build may behave differently if compiled with a UTF-16 string representation. A language that deals with characters is independent of the encoding. Dealing with CCSes doesn’t add any further implementation independence. Similarly, dealing with characters is enough to allow strings to be spliced, re-encoded, and re-combined on arbitrary boundaries; dealing with CCSes isn’t required for this either.

Thirdly, since Unicode allows arbitrarily long combining character sequences, dealing with CCSes could introduce subtle exploits into a lot of code. If Twitter counted CCSes, a user could write a multi-megabyte tweet by adding hundreds of combining characters onto a single base character.

In any case, I’m happy to have a further debate about whether we should be counting characters or CCSes — my goal here is to convince you that we should not be counting code units. Lastly, I’ll point out that Python 3 (in UCS-4 build, or from version 3.3 onwards) and Haskell both consider the above string to have a length of 2, as does Twitter. I don’t know of any programming languages that count CCSes. Twitter’s page on counting characters goes into some detail about this.

Conclusion

My point is this: the Unicode Consortium has provided us with a wonderful standard for representing and communicating characters from all of the world’s scripts, but most modern languages needlessly expose the details of how the characters are encoded. This means that all programmers need to become Unicode experts to make high-quality internationalised software, which means it is far less likely that programs will work when non-ASCII characters, or even worse, astral characters, are used. Since most English programmers don’t test with non-ASCII characters, and almost nobody tests with astral characters, this makes it pretty unlikely that software will ever see such inputs before it is released.

Programming language designers need to become Unicode experts, so that regular programmers don’t have to. The next generation of languages should provide character-oriented string operations only (except where the programmer explicitly asks for text to be encoded). Then the rest of us can get back to programming, instead of worrying about encoding issues.