It’s Not Wrong that "🤦🏼‍♂️".length == 7 But It’s Better that "🤦🏼‍♂️".len() == 17 and Rather Useless that len("🤦🏼‍♂️") == 5

From time to time, someone shows that in JavaScript the .length of a string containing an emoji results in a number greater than 1 (typically 2) and then proceeds to the conclusion that haha JavaScript is so broken—and is rewarded with many likes. In this post, I will try to convince you that ridiculing JavaScript for this is less insightful than it first appears and that Swift’s approach to string length isn’t unambiguously the best one. Python 3’s approach is unambiguously the worst one, though.

What’s Going on with the Title?

"🤦🏼‍♂️".length == 7 evaluates to true as JavaScript. Let’s try JavaScript console in Firefox:

"🤦🏼‍♂️".length == 7 true

Haha, right? Well, you’ve been told that the Python community suffered the Python 2 vs. Python 3 split, among other things, to Get Unicode Right. Let’s try Python 3:

$ python3 Python 3.6.8 (default, Jan 14 2019, 11:02:34) [GCC 8.0.1 20180414 (experimental) [trunk revision 259383]] on linux Type "help", "copyright", "credits" or "license" for more information. >>> len("🤦🏼‍♂️") == 5 True >>>

OK, then. Now, Rust has the benefit of learning from languages that came before it. Let’s try Rust:

$ cargo new -q length $ cd length $ echo 'fn main() { println!("{}", "🤦🏼‍♂️".len() == 17); }' > src/main.rs $ cargo run -q true

That’s better!

What?

The string contains a single emoji consisting of five Unicode scalar values:

Unicode scalar UTF-32 code units UTF-16 code units UTF-8 code units UTF-32 bytes UTF-16 bytes UTF-8 bytes U+1F926 FACE PALM 1 2 4 4 4 4 U+1F3FC EMOJI MODIFIER FITZPATRICK TYPE-3 1 2 4 4 4 4 U+200D ZERO WIDTH JOINER 1 1 3 4 2 3 U+2642 MALE SIGN 1 1 3 4 2 3 U+FE0F VARIATION SELECTOR-16 1 1 3 4 2 3 Total 5 7 17 20 14 17

The string that contains one graphical unit consists of 5 Unicode scalar values. First, there’s a base character that means a person face palming. By default, the person would have a cartoonish yellow color. The next character is an emoji skintone modifier the changes the color of the person’s skin (and, in practice, also the color of the person’s hair). By default, the gender of the person is undefined, and e.g. Apple defaults to what they consider a male appearance and e.g. Google defaults to what they consider a female appearance. The next two scalar values pick a male-typical appearance specifically regardless of font and vendor. Instead of being an emoji-specific modifier like the skin tone, the gender specification uses an emoji-predating gender symbol (MALE SIGN) explicitly ligated using the ZERO WIDTH JOINER with the (skin-toned) face-palming person. (Whether it is a good or a bad idea that the skin tone and gender specifications use different mechanisms is out of the scope of this post.) Finally, VARIATION SELECTOR-16 makes it explicit that we want a multicolor emoji rendering instead of a monochrome dingbat rendering.

Each of the languages above reports the string length as the number of code units that the string occupies. Python 3 strings store Unicode code points each of which is stored as one code unit by CPython 3, so the string occupies 5 code units. JavaScript (and Java) strings have (potentially-invalid) UTF-16 semantics, so the string occupies 7 code units. Rust strings are (guaranteed-valid) UTF-8, so the string occupies 17 code units. We’ll come to back to the actual storage as opposed to semantics later.

Note about Python 3 added on 2019-09-09: Originally this article claimed that Python 3 guaranteed UTF-32 validity. This was in error. Python 3 guarantees that the units of the string stay within the Unicode code point range but does not guarantee the absence of surrogates. It not only allows unpaired surrogates, which might be explained by wishing to be compatible with the value space of potentially-invalid UTF-16, but Python 3 allows materializing even surrogate pairs, which is a truly bizarre design. The previous conclusions stand with the added conclusion that Python 3 is even more messed up than I thought! With the way the example string was constructed in Python 3, the Python 3 string happens to match the valid UTF-32 representation of the string, so it is still illustrative of UTF-32, but the rest of the article has been slightly edited to avoid claiming that Python 3 used UTF-32.

But I Want the Length to Be 1!

There’s a language for that. The following used Swift 4.2.3, which was the latest release when I was researching this, on Ubuntu 18.04:

$ mkdir swiftlen $ cd swiftlen/ $ swift package init -q --type executable $ swift package init --type executable Creating executable package: swiftlen Creating Package.swift Creating README.md Creating .gitignore Creating Sources/ Creating Sources/swiftlen/main.swift Creating Tests/ Creating Tests/LinuxMain.swift Creating Tests/swiftlenTests/ Creating Tests/swiftlenTests/swiftlenTests.swift Creating Tests/swiftlenTests/XCTestManifests.swift $ echo 'print("🤦🏼‍♂️".count == 1)' > Sources/swiftlen/main.swift $ swift run swiftlen 2>/dev/null true

(Not using the Swift REPL for the example, because it does not appear to accept non-ASCII input on Ubuntu! Swift 5.0.3 prints the same and the REPL is still broken.)

OK, so we’ve found a language that thinks the string contains one countable unit. But what is that countable unit? It’s an extended grapheme cluster. (“Extended” to distinguish from the older attempt at defining grapheme clusters now called legacy grapheme clusters.) The definition is in Unicode Standard Annex #29 (UAX #29).

The Lengths Seen So Far

We’ve seen four different lengths so far:

Number of UTF-8 code units (17 in this case)

Number of UTF-16 code units (7 in this case)

Number of UTF-32 code units or Unicode scalar values (5 in this case)

Number of extended grapheme clusters (1 in this case)

Given a valid Unicode string and a version of Unicode, all of the above are well-defined and it holds that each item higher on the list is greater or equal than the items lower on the list.

One of these is not like the others, though: The first three numbers have an unchanging definition for any valid Unicode string whether it contains currently assigned scalar values or whether it is from the future and contains unassigned scalar values as far as software written today is aware. Also, computing the first three lengths does not involve lookups from the Unicode database. However, the last item depends on the Unicode version and involves lookups from the Unicode database. If a string contains scalar values that are unassigned as far as the copy of the Unicode database that the program is using is aware, the program will potentially overcount extended grapheme clusters in the string compared to a program whose copy of the Unicode database is newer and has assignments for those scalar values (and some of those assignments turn out to be combining characters).

More Than One Length per Programming Language

It is not the case that a given programming language has to choose only one of the above. If we run this Swift program:

var s = "🤦🏼‍♂️" print(s.count) print(s.unicodeScalars.count) print(s.utf16.count) print(s.utf8.count)

it prints:

1 5 7 17

Let’s try Rust with unicode-segmentation = "1.3.0" in Cargo.toml :

use unicode_segmentation::UnicodeSegmentation; fn main() { let s = "🤦🏼‍♂️"; println!("{}", s.graphemes(true).count()); println!("{}", s.chars().count()); println!("{}", s.encode_utf16().count()); println!("{}", s.len()); }

The above program prints:

2 5 7 17

That’s unexpected! It turns out that unicode-segmentation does not implement the latest version of the Unicode segmentation rules, so it gives the ZERO WIDTH JOINER generic treatment (break right after ZWJ) instead of the newer refinement in the emoji context.

Let’s try again, but this time with unic-segment = "0.9.0" in Cargo.toml :

use unic_segment::Graphemes; fn main() { let s = "🤦🏼‍♂️"; println!("{}", Graphemes::new(s).count()); println!("{}", s.chars().count()); println!("{}", s.encode_utf16().count()); println!("{}", s.len()); }

1 5 7 17

In the Rust case, strings (here mere string slices) know the number of UTF-8 code units they contain. The len() method call just returns this number that has been stored since the creation of the string (in this case, compile time). In the other cases, what happens is the creation of an iterator and then instead of actually examining the values (string slices correspoding to extended grapheme clusters, Unicode scalar values or UTF-16 code units) that the iterator would yield, the count() method just consumes the iterator and returns the number of items that were yielded by the iteration. The count isn’t stored anywhere on the string (slice) afterwards. If we wanted to later know the counts again, we’d have to iterate over the string again.

Know in Advance or Compute When Needed?

This introduces a notable question in the design space: Should a given type of length quantity be eagerly computed when the string is created? Or should the length be computed when someone asks for it? Or should it be computed when someone asks for it and then automatically stored on the string object so that it’s available immediately if someone asks for it again?

The answer Rust has is that the length in the code units of the Unicode Encoding Form of the language is stored upon string creation, and the rest are computed when someone asks for them (and then forgotten and not stored on the string).

Swift is a higher-level language and doesn’t document the exact nature of its string internals as part of the API contract. In fact, the internal representation of Swift strings changed substantially between Swift 4.2 and Swift 5.0. It’s not documented if different views to the string are held onto once created, for example. The documentation does say that strings are copy-on-write, so the first mutation may involve copying the string’s storage.

Notably, the design space includes not remembering anything. The C programming language is a prominent example of this case. C strings don’t even remember their number of code units. To find out the number of code units, you have to iterate over the string until a sentinel value. In the case of C, the sentinel is the code unit for U+0000, so it excludes one Unicode scalar value from the possible string contents. However, that’s not a strictly necessary property of a sentinel-based design that doesn’t remember any lengths. 0xFF does not occur as a code unit in any valid UTF-8 string and 0xFFFFFFFF does not occur in any valid UTF-32 string, so they could be used as sentinels for UTF-8 and UTF-32 storage, respectively, without excluding a scalar value from the Unicode value space. There is no 16-bit value that never occurs in a valid UTF-16 string. However, a valid UTF-16 string does not contain unpaired surrogates, so an unpaired low surrogate could, in principle, be used as a sentinel in a design that wanted to use guaranteed-valid UTF-16 strings that don’t remember their code unit length.

Knowing the Storage-Native Code Unit Length is Extremely Reasonable

The length of the string as counted in code units of its storage-native Unicode Encoding Form (i.e. whichever of UTF-8, UTF-16, and UTF 32 the programming language has chosen for its string semantics) is not like the other lengths. It is the length that the implementation cannot avoid having to know at the time of creating a new string, because it is the length that is required to be known in order to be able to allocate storage for a string. Even C, which promptly forgets about the code unit length in the storage-native Unicode Encoding Form after string has been created, has to know this length when allocating storage for a new string.

That is, the design decision is about whether to remember this length. It is not about whether to compute it eagerly. You just have to have it at string creation time—i.e. eagerly.

Considering that remembering this quantity makes string concatenation, which is a common operation, substantially faster to implement compared to not remembering this quantity, remembering this quantity is fundamentally reasonable. Also, it means that you don’t need to maintain a sentinel value, which means that a substring operation can yield results that share the buffer with the original string instead of having to copy in order to be able to insert sentinel. (Note that you can easily foil this benefit if you wish to eagerly maintain zero-termination for the sake of C string compatibility.)

What About Knowing the Other Lengths?

Even if we’ve established that it makes sense for string implementation to remember the storage length of the string in code units all the storage-native Unicode encoding form, it doesn’t answer whether a string implementation should also remember other lengths or which kind of length should be offered in the most ergonomic API. (As we see above, Swift makes the number of extended grapheme clusters more ergonomic to obtain that the code unit or scalar value length.)

Also, if any other length is to be remembered, there is the question of whether it should be eagerly computed as string creation time or lazily computed the first time someone asks for it. It is easy to see why at least the latter does not make sense for multi-threaded systems-programming language like Rust. If some properties of an object are lazily initialized, in a multi-threaded case you also need to solve synchronization of these computations. Furthermore, you need to allocate space at least for a pointer to auxiliary information if you want to be able to add auxiliary information later or you need to have a hashtable of auxiliary information where the string the information is about is the key, so auxiliary information, even when not present, has storage implications or implications of having to have global state in a run-time system. Finally, for systems programming, it may be more desirable to know the time complexity of a given operation clearly even if it means “always O(n)” instead of “possibly O(n) but sometimes O(1)”. Even if the latter looks strictly better, it is less predictable.

For a higher-level language, arguments from space requirements or synchronization issues might not be decisive. It’s more relevant to consider what a given length quantity is used for. This is often forgotten in Internet debates that revolve around what length is the most “correct” or “logical” one. So for the lengths that don’t map to the size of storage allocation, what are they good for?

It turns out that in the Firefox code base there are two places where someone wants to know the number of Unicode scalar values in a string that is not being stored as UTF-32 and attention is not paid to what the scalar values actually are. The IETF specification for Session Traversal Utilities for NAT (STUN) used for WebRTC has the curious property that it places length limits on certain protocol strings such that the limits are expressed as number of Unicode scalar values but the strings are transmitted in UTF-8. Firefox validates these limits. (The limit looks like an arbitrary power-of-two (128 scalar values). The spec has remarks about the possible resulting byte length, which was wrong according to the IETF UTF-8 RFC that was current and already nearly five years old at the time of publication of the STUN RFC. Specifically, the STUN RFC repeatedly says that 128 characters as UTF-8 may be as long as 763 bytes. To arrive at that number, you have to assume that a UTF-8 character can be up to six bytes long, as opposed to up to 4 bytes long as in the prevailing UTF-8 RFC and in the Unicode Standard, and that the last character of the 128 is a zero terminator and, therefore, known to take just one byte.) In this case, the reason for wishing to know a non-storage length is to impose a limit. The other case is reporting the column number for the source location of JavaScript errors.

Length limits, which we’ll come back to, probably aren’t a frequent enough a use case to justify making strings know a particular kind of length as opposed to such length being possible to compute when asked for. Neither are error messages.

Another use case for asking for a length is iterating by index and using the length as the loop termination condition 1990s Java style. Like this:

for (int i = 0; i < s.length(); i++) { // Do something with s.charAt(i) }

In this case, it’s actually important for the length to be precomputed number on the string object. This use case is coupled with the requirement that indexing into the string to find the nth unit corresponding to the count of units that the “length” represents should be a fast operation.

The above pattern is a lot less conclusive in terms of what lengths should be precomputed (and what the indexing unit should be) than it first appears. The above loop doesn’t do random access by index. It sequentially uses every index from zero up to, but not including, length . Indeed, especially when iterating over a string by Unicode scalar value, typically when you examine the contents of a string, you iterate over the string in order. Programming languages these days provide an iterator facility for this, and e.g. to iterate over a UTF-8 string by scalar value, the iterator does not need to know the number of scalar values up front. E.g. in Rust, you can do this in O(n) time despite string slices not knowing their number of Unicode scalar values:

for (c in s.chars()) { // Do something with c }

(Note that char is an 8-bit code unit (possibly UTF-8 code unit) in C and C++, char is a UTF-16 code unit in Java, char is a Unicode scalar value in Rust, and Character is an extended grapheme cluster in Swift.)

A programming language together with its library ecosystem should provide iteration over a string by Unicode scalar value and by extended grapheme cluster, but it does not follow that strings would need to know the scalar value length or the extended grapheme cluster length up front. Unlike the code unit storage length, those quantities aren’t useful for accelerating operations like concatenation that don’t care about the exact content of the string.

Which Unicode Encoding Form Should a Programming Language Choose?

The observation that having strings know their code unit length in their storage-native Unicode encoding form is extremely reasonable does not answer how many bits wide the code units should be.

The usual way to approach this question is to argue that UTF-32 is the best, because it provides O(1) indexing by “character” in the sense of a character meaning a Unicode scalar value, or the argument focuses on whether UTF-8 is unfair to some languages relative to UTF-16. I think these are bad ways to approach this question.

First of all, the argument that the answer should be UTF-32 is bad on two counts. First, it assumes that random access scalar value is important, but in practice it isn’t. It’s reasonable to want to have a capability to iterate over a string by scalar value, but random access by scalar value is in the YAGNI department. Second, arguments in favor of UTF-32 typically come at a point where the person making the argument has learned about surrogate pairs in UTF-16 but has not yet learned about extended grapheme clusters being even larger things that the user perceives as unit. That is, if you escape the variable-width nature of UTF-16 to UTF-32, you pay by doubling the memory requirements and extended grapheme clusters are still variable-width.

I’ll come back to the length fairness issue later, but I think a different argument is much more relevant in practice for the choice of in-memory Unicode encoding form. The more relevant argument is this: Implementations that choose UTF-8 actually accept the UTF-8 storage requirements. When wider-unit semantics are chosen for a language that doesn’t provide raw memory access and, therefore, has the opportunity to tweak string storage, the implementations try to come up with ways to avoid actually paying the cost of the wider units in some situations.

JavaScript and Java strings have the semantics of potentially-invalid UTF-16. SpiderMonkey and V8 implement an optimization for omitting the leading zeros of each code unit in a string, i.e. storing the string as ISO-8859-1 (the actual ISO-8859-1, not the Web notion of “ISO-8859-1” as a label of windows-1252), when all code units in the string have zeros in the most-significant half. The HotSpot JVM also implements this optimization, though enabling it is optional. Swift 4.2 implements a slightly different variant of the same idea, where ASCII-only strings are stored as 8-bit units and everything else is stored as UTF-16. CPython since 3.3 makes the same idea three-level with code point semantics: Strings are stored with 32-bit code units if at least one code point has a non-zero bit above the low 16 bits. Else if a string has a non-zero bits above the low 8 bits for at least one code point, the string is stored as 16-bit units. Otherwise, the string is stored as 8-bit units (Latin1).

I think the unwillingness of implementations of languages that have chosen UTF-16 or UTF-32 (or UTF-32-ish as in the case of Python 3) string semantics to actually use UTF-16 or UTF-32 storage when they can get away with not using actual UTF-16 or UTF-32 storage is the clearest indictment against UTF-16 or UTF-32 (and other wide-unit semantics like what Python 3 uses).

Languages that choose UTF-8, on the other hand, stick to actual UTF-8 for the purpose of storing Unicode scalar values. When languages that choose UTF-8 deviate from UTF-8, they do so in order to represent values that are not Unicode scalar values for compatibility with external constraints. Rust uses a representation called WTF-8 for file system paths on Windows. All UTF-8 strings are WTF-8 strings, but WTF-8 can also represent unpaired surrogates for compatibility with Windows file paths being sequences of 16-bit units that can contain unpaired surrogates. Perl 6 uses an internal representation called UTF-8 Clean-8 (or UTF8-C8), which represents strings that consist of Unicode scalar values in Unicode Normalization Form C the same way as UTF-8 but represents non-NFC content differently and can represent sequences of bytes that are not valid UTF-8.

UTF-8 is the only one of the Unicode encoding forms that is also a Unicode encoding scheme, and of the Unicode encoding schemes, UTF-8 has clearly won for interchange. (Unicode encoding forms are what you have in RAM, so UTF-16 consists of native-endian, two-byte-aligned 16-bit code units. Unicode encoding schemes are what can be used for byte-oriented interchange, so e.g. UTF-16LE consist of 8-bit code units every pair of which form a potentially-unaligned little-endian 16-bit number, which in turn may form a surrogate pair.) When UTF-8 is used as the in-RAM representation, input and output operations are less expensive than with UTF-16 or UTF-32. UTF-16 or UTF-32 in RAM requires conversion from UTF-8 when reading input and conversion to UTF-8 when writing output. A system that guarantees UTF-8 validity internally, such as Rust, needs only to validate UTF-8 upon reading input and no conversion is needed when writing output. (Go takes a garbage in, garbage out approach to UTF-8: input is not validated at input time and output is written without conversion. However, iteration by scalar value can yield REPLACEMENT CHARACTERs when iterating over invalid UTF-8. That is, the input step is less expensive than in Rust, but iterating by scalar value is marginally more expensive. The output step is less correct.)

Finally, in terms of nudging developers to write correct code, UTF-8 has the benefit of being blatantly variable-width, so even with languages such as English, Somali, and Swahili, as soon as you have a dash or a smart quote, the variable-width nature of UTF-8 shows up. In this context, extended grapheme clusters are just extending the variable-width nature. Meanwhile, UTF-16 allows programmers to get too far while pretending to be working with something where the units they need to care about are fixed-width. Reacting to surrogate pairs by wishing to use UTF-32 instead is a bad idea, because if you want to write correct software, you still need to deal with variable-width extended grapheme clusters.

The choice of UTF-32 (or Python 3-style code point sequences) arises from wanting the wrong thing. The choice of UTF-16 is a matter of early-adopter legacy from the time when Unicode was expected to be capped to 16 bits of code space and, once UTF-16 has been committed to, not breaking compatibility with already-written programs is important and justified the continued use of UTF-16, but if you aren’t bound by that legacy and are designing a new language, you should go with UTF-8. Occasionally even systems that appear to be bound by the UTF-16 legacy can break free. Even though Swift is committed to interoperability with Cocoa, which uses UTF-16 strings, Swift 5 switched to UTF-8 for Swift-native strings. Similarly, PyPy has gone UTF-8 despite Python 3 having code point semantics.

Shouldn’t the Nudge Go All the Way to Extended Grapheme Clusters?

Even if we accept that the storage should be UTF-8 and that the string implementation should maintain knowledge of the string length in UTF-8 code units, if the blatant variable-widthness of UTF-8 is argued to be a nudge toward dealing with the variable-widthness of extended grapheme clusters, shouldn’t the Swift approach of making extended grapheme cluster access and count the view that takes the least ceremony to use be the thing that every language should do?

Swift is still too young to draw definitive conclusions from. It’s easy to believe that the Swift approach nudges programmers to write more extended grapheme cluster-correct code and that the design makes sense for a language meant primarily for UI programming on a largely evergreen platform (iOS). It isn’t clear, though, that the Swift approach is the best for everyone.

Earlier, I said that the example used “Swift 4.2.3 on Ubuntu 18.04”. The “18.04” part is important! Swift.org ships binaries for Ubuntu 14.04, 16.04, and 18.04. Running the program

var s = "🤦🏼‍♂️" print(s.count) print(s.unicodeScalars.count) print(s.utf16.count) print(s.utf8.count)

in Swift 4.2.3 on Ubuntu 14.04 prints:

3 5 7 17

So Swift 4.2.3 on Ubuntu 18.04 as well as the unic_segment 0.9.0 Rust crate counted one extended grapheme cluster, the unicode-segmentation 1.3.0 Rust crate counted two extended grapheme clusters, and the same version of Swift, 4.2.3, but on a different operating system version counted three extended grapheme clusters!

Swift 4 delegates Unicode segmentation to operating system-provided ICU, and “Long-Term Support” in the Ubuntu case means security patches but does not mean rolling forward the Unicode version that the system copy of ICU knows about. In the case of iOS, delegating to system ICU is probably OK and will not lead to too high probability of the text being from the future from the point of view of the OS copy of ICU, since the iOS ecosystem stays exceptionally well up-to-date. However, delegating to system ICU is not such a great match for the idea of using Swift on the server side if the server side means running an old LTS distro.

(Swift 5 appears to no longer use system ICU for this. That is, Swift 5.0.3 on Ubuntu 14.04 sees one extended grapheme cluster in the string. I haven’t investigated what Swift 5 uses, but I assume that the switch to UTF-8 string representation necessitated using something other than ICU, which is heavily UTF-16-oriented. However, the result with Swift 4.2.3 nicely illustrates the issue related to using extended grapheme clusters.)

If you are doing things that have to be extended grapheme cluster-aware, there just is no way around the issue of not being able to correctly segment text that comes from the future relative to the Unicode segmentation implementation that your program is using. This is not a reason to avoid extended grapheme clusters for tasks that require awareness of extended grapheme clusters.

However, pushing extended grapheme clusters onto tasks that do not really require the use of extended grapheme cluster introduces failure modes arising from the Unicode version dependency where such a dependency isn’t strictly necessary. For example, the Unicode version dependency of extended grapheme clusters means that you should never persist indices into a Swift strings and load them back in a future execution of your app, because an intervening Unicode data update may change the meaning of the persisted indices! The Swift string documentation does not warn against this.

You might think that this kind of thing is a theoretical issue that will never bite anyone, but even experts in data persistence, the developers of PostgreSQL, managed to make backup restorability dependent on collation order, which may change with glibc updates.

Let’s consider other languages a bit.

C++ is often deployed such that the application developer doesn’t ship the standard library with the program. Most obviously, relying on GNU libstdc++ provided by an LTS Linux distribution presents similar problems as Swift 4 relying on ICU provided by an LTS Linux distribution. This isn’t a Linux-specific issue. Old supported branches of Windows generally don’t get new system-level Unicode data, either. Even though there is some movement towards individual applications shipping their own copy of LLVM libc++ with the application and the increased pace of C++ standard development starting with C++11 has made using a system-provided C++ standard library more problematic even ignoring Unicode considerations, it doesn’t seem like a good idea for C++ to develop a tight coupling with extended grapheme clusters for operations that don’t strictly necessitate it as longs as stuck-in-the-past system libraries (whether the C++ standard library itself or another library that it delegates to) are a significant part of the C++ standard library distribution practice.

There’s a proposal to expose extended grapheme cluster segmentation to JavaScript programs. The main problem with this proposal is the implication on APK sizes on Android and the effect of APK sizes on browser competition on Android. But if we ignore that for the moment and imagine this was part of the Web Platform, it would still be problematic to build this dependency into operations for which working on extended grapheme clusters isn’t strictly necessary. While the most popular browsers are evergreen, there’s still a long tail of browser instances that aren’t on the latest engine versions. When JavaScript executes on such browsers, there’d be effects similar to running Swift 4 on Ubuntu 14.04.

In contrast to C++ or JavaScript, the current Rust approach is to statically link all Rust library code, including the standard library, into the executable program. This means that the application distributor is in control of library versions and doesn’t need to worry about the program executing in the context of out-of-date Rust libraries. The flip side is concerns about the size of the executable. People already (rightly or wrongly) complain about the sizes of Rust executables. Pulling in a lot of Unicode data due to baking extended grapheme cluster processing into programs whose problem domain doesn’t strictly require working with extended grapheme clusters would be problematic in embedded contexts where the executable size is a real problem and not just a perceived problem—and would obviously make the perceived problem worse, too. Furthermore, in order to avoid problems similar to those involved in relying on system libraries, baking tight coupling with Unicode data into the standard library necessitates the organizational capability of keeping up with new Unicode versions in this area where not only data in the tables keeps changing but the format of the tables and, therefore, the associated algorithms have still been changing recently. Right now of the two extended grapheme cluster crates outside the Rust standard library, the one that’s organizationally closer to the standard library is the one that’s out of date.

Why Do We Want to Know Anyway?

“String length is about as meaningful a measurement as string height” – @qntm

Being able to allocate memory for strings gives a legitimate use case for knowing the storage length. However, in cases of Unicode scalar values or extended grapheme clusters, you typically want to iterate over them and look at each one instead of just knowing the count. So why do people want to know the count? As far as I can tell, there are two broad categories: Placing a quota limit that is fuzzy enough that it doesn’t need to be strictly tied to storage and trying to estimate how much text fits for display. Let’s look at the issue of estimating how much display space text takes, because it involves introducing yet another measurement of string length.

Display Space

Simply looking at the Latin letters i and m should make it clear that the display size of a string depends on the font and on the specific characters in the string. From this observation, the whole notion of estimating display space by counting characters seems folly. Indeed, if you want to know exactly how much text fits into a given space, you need to run a typesetting algorithm with a specific font, which may have a complex relationship between scalar values and glyphs, to actually see where the overflow starts. Yet, even in the case of the Latin script that has letters such as i and m, e.g. magazine editors can find character counts useful enough for estimating how many print pages an article of a given character count length is going to fill.

As for computer user interfaces, character terminal user interfaces use a monospaced font where both i and m take up one character cell on a grid. In the context of a monospaced font, the extended grapheme cluster count in the context of the Latin script corresponds directly to display space taken. The same obviously applies to the Greek and Cyrillic scripts, which are so close to the Latin script that fonts even intend to reuse glyphs across these scripts. In contrast, CJK ideographs, Japanese kana, and Hangul syllables take two cells of a terminal grid. From the CJK perspective, these are full-width characters and the ASCII characters are half-width characters. There exist also half-width katakana characters which fit into an 8-bit encoding with ASCII and take one cell on the terminal grid and, therefore, are technically easier to fit to Latin script-oriented terminal systems. The display width on a terminal also has a correspondence to byte with the legacy CJK encodings: ASCII takes one byte, a CJK ideograph, a full-width kana or a Hangul syllable takes two bytes. In the case of Shift_JIS, half-width katakana takes one byte per character.

This brings us to the concept of East Asian Width. ASCII and half-width katakana characters are narrow. CJK ideographs, full-width kana, and Hangul syllables are wide. However, even in the worldview that is split to Latin, Greek, and Cyrillic on one hand and Chinese, Japanese, and Korean on the other hand, there are ambiguities. From the perspective of European legacy encodings, Greek and Cyrillic (as well as accented Latin) is equally wide as ASCII. However, in legacy CJK encodings, Greek and Cyrillic characters take two bytes. This means that in terms of East Asian Width, a string can have a general-purpose width, which resolves these ambiguous characters as narrow, or legacy CJK-context width, which resolves these ambiguous characters as wide.

So is the general-purpose variant (that resolves Greek and Cyrillic characters as narrow) of East Asian Width the one true string length measure? Well, no.

First of all, the concept ignores all scripts that are geographically and in Unicode order between Latin, Greek, and Cyrillic on one hand and CJK on the other (even though some other scripts that are structurally similar to the Latin, Greek, and Cyrillic scripts and make sense for a monospaced font, such as Armenian and the Georgian scripts, fit this concept, too, despite not having a history in pre-Unicode CJK context). As it happens, though, emoji do fit into the concept, except for weird errors in the Unicode database. After all, emoji originate from Japan and were two bytes each when represented using the private use area of Shift_JIS.

Second, the concept assumes that there is one-to-one correspondence between scalar values and extended grapheme clusters. If we run this Rust program:

use unicode_width::UnicodeWidthStr; fn main() { println!("{}", "🤦🏼‍♂️".width()); }

it prints:

5

This is because the base emoji is wide (2), the combining skin tone modifier is also wide (2), the male sign is counted as narrow (1), and the zero-width joiner and the variation selector are treated as control characters that don’t count towards width. Obviously, this is not the answer that we want. The answer we want is 2. Ideas that come to mind immediately, such as only counting the width of the first character in an extended grapheme cluster or taking the width of the widest character in an extended grapheme cluster, don’t work, because flag emoji consist of two regional indicator symbol letter characters both of which have East Asian Width of Neutral (i.e. they are counted as narrow but are not marked as narrow, because they are considered to exist outside the domain of East Asian typography). I’m not aware of any official Unicode definiton that would reliably return 2 as the width of every kind of emoji. 😭

If you really must estimate display size without running text layout with a font, whether the extended grapheme cluster count or the East Asian Width of the string works better depends on context.

Arbitrary but Fair Quotas

In some cases there is a desire to impose a length limit that doesn’t arise from a strict storage limitation. For example, in the STUN protocol given earlier, presumably there is a desire to make it so that human-readable error messages cannot make protocol messages arbitrarily long. For example, in the case of Twitter, tweets being short is a core part of the type of expression that Twitter is about, so some definition of “short” is needed. In the case of string-based browser localStorage , there is a need to have some limit, but the limit is necessarily arbitrary and does not need to strictly map to bytes on disk.

In cases like this, there seems to be some concern that the limit should be internationally fair. Observations that UTF-8 and UTF-16 take a different amount of storage per character depending on the character superficially suggests that the UTF-8 length or the UTF-16 length might be unfair internationally.

What’s fair, though? The usual concern goes that UTF-8 favors English, because English takes one byte per character, and disfavors CJK, because Chinese, Japanese, and Korean take three bytes per character, so UTF-8 in unfair to CJK. This kind of analysis ignores how much information is conveyed per character. To assess what lengths we get for different languages when the amount of information conveyed is kept constant, I looked at the counts for the translations of the Universal Declaration of Human Rights. This is a document for which translation of the same content is available in particularly many languages, which is why I used it as the measurement corpus.

Unfortunately, not all translations contain the same text, so one needs to be careful when preparing the data for comparison. Some translations are incomplete, in some cases, very incomplete. For this reason, I included only translations in stage 4 or stage 5 along the 5-stage scale. Some translations carry the preamble with the recitals, but some do not. Some also carry historical notes. To make the length comparable, the preamble, notes, and whitespace-only text nodes were omitted. The rest of the XML text nodes were concatenated and normalized to Unicode Normalition Form C before counting. (Source code is available.)

Let’s look at the result. The table at the end of this document is sortable and is initially sorted by UTF-8 length. Each Δ% column shows how much the count in the column to its left deviates from the median count for that. (A note about color-coding. Coloring longer than median as red should not be taken to imply that those languages are somehow bad. It’s meant to imply that a length quota treats those languages badly.) In the table, the name of each language links to the translation in that language hosted on the site of the Unicode Consortium. The linked HTML versions may include the preamble and/or notes.

The CJK concern is alleviated when considering information conveyed. When measuring UTF-8 length, Mandarin using traditional characters is the shortest of the languages that have global name recognition! This should be expected, since the Han script pretty obviously packs more information per character than e.g. alphabetic scripts. (The globally less-known languages whose UTF-8 length is shorter than Mandarin’s (using traditional characters) are African and American Latin-script languages with a relatively small native speaker population for each—only one with a native speaker population exceeding a million and many whose native speaker population is smaller than 100 000, which explains why you might not recognize their names.)

Korean is also shorter than median in UTF-8 length. This also makes sense, since Hangul syllables pack three or two alphabetic jamo into one three-byte character. The UTF-8 length of Japanese is over median but only by 4.1%. The Japanese version of the text is 48% kanji and 52% hiragana. Japanese Wikipedia has almost the same kana to kanji ratio, though different kana: 46% kanji and the rest almost evenly split between hiragana and katakana, so we may assume the Universal Declaration of Human Rights to be representative of Japanese text in terms of kana to kanji ratio.

When sorting by UTF-16 code unit count, UTF-32 / scalar value count, or extended grapheme cluster count, CJK are the shortest. While it’s true that UTF-8 takes more bytes for CJK than UTF-16, the notion of UTF-8 being particularly disfavorable to CJK is not true relative to other languages. Rather, UTF-16 is particularly favorable to CJK. In particular, the Han script is so information-dense that even when sorting by East Asian Width, which effectively doubles the length of CJK but not other languages, Han-script languages stay clustered at the start of the table. Korean and Japanese move further but remain below median.

The language with the longest UTF-8 length is Shan, which uses the Burmese script. The Burmese language, also using the Burmese script, is the second-longest in UTF-8 length. There are a number of other Brahmic-script languages among the ones with the longest UTF-8 length. They use three bytes per character but don’t have CJK-like information per character density. These languages are below median in extended grapheme cluster count. In scalar value count, they intermingle with alphabetic languages.

It’s not clear if the concepts of median and mean (average) are meaningful. Does it make sense for a language with tens of millions of native speakers to count as an equal data point as a language with tens of thousands native speakers? Since this is about writing, should the numbers of writers be considered instead? (I.e. should literacy rates be taken into account?) In the hope that with a large number of languages in the table, median hand-wavily sorts out this kind of issue, I chose to compare with median. At least the Han-script languages have comparable numbers of native speakers as the Bhramic-script languages and provide a counter-weight at the other end of the spectrum of UTF-8 length. In any case, for measures other than UTF-8 length, median and mean are very close to each other.

Saying that Brahmic-script languages intermingle with alphabetic languages in character count is rather meaningless, though. In character count, after CJK (and Han-script Vietnamese and Yi-script Nousu), the language with the smallest character count is a Latin-script language (Waama). Also, the language with the largest character count is a Latin-script language (Ashéninka, Pichis). ( I find it odd that in UTF-8 length Ashéninka Perené is the second-shortest but Ashéninka, Pichis is long enough to reach the Brahmic cluster. I don’t know what the relation of these two languages is and what explains two languages whose name suggests close relation ending up in opposite extremes in length. Update: It has been pointed out to me that the supposed Ashéninka Perené translation is a mislabeled duplicate of the Cashinahua translation.)

One might hypothesize that the Latin script has just been put to so many uses that some of the uses have to be far from what it has been optimized for. Yet, when considering language-specific alphabets, the character counts for Greek and Georgian are above median. It just is the case that languages are different. In that sense, the whole notion of trying to find a simple length measure that is fair across languages seems folly.

Let’s look at the the factor between the minimum and maximum of each measure, i.e. the factor with which the minimum needs to be multiplied to get the maximum. Let’s even ignore the outlier for maximum for each measure and use the second largest value instead of the largest value for each count. (Otherwise, Ashéninka, Pichis alone would skew the numbers a lot.) We get these factors:

UTF-8 8.6 UTF-16 7.9 UTF-32 7.9 EGC 7.9 EAW 4.3

UTF-16, UTF-32, and extended grapheme clusters aren’t distinguished by this measure, because the languages at the extremes use characters from the Basic Multilingual Plane with one character per grapheme cluster. Considering that there are supplementary-plane scripts, arguably the UTF-32 count would be fairer than the UTF-16 count even though this factor doesn’t show the difference. It’s not clear that counting extended grapheme clusters would be particularly fair compared to counting characters: It favors scripts that are visually joining over scripts that aren’t visually joining even if there’s no logical difference. While looking at just the factor, East Asian Width makes the gap the smallest, but it’s a rather imprecise fairness solution. It just counts CJK as double. Even after this, the Han-script languages are still among the ones with the smallest counts. On the other hand, it seems unfair to recognize Hangul syllables and kana as carrying more information than an alphabetic character while not giving the same treatment to other syllabaries, such as the Ethiopic script, Ge’ez.

Twitter counts each CJK character (including three-jamo Hangul syllables; i.e. it is not decomposing Hangul and treating it as alphabetic) as consuming 2 units of the quota (as when counting East Asian Width), counts emoji as consuming two units (even when East Asian Width of the cluster would be more), and, unlike East Asian Width, counts each Ethiopic syllable as consuming two units of the quota. What Twitter does seems fairer than just applying East Asian Width, but the result is still that the amount of information that can be packed in a tweet can vary four-fold depending on language. That still doesn’t seem exactly fair across languages.

In closing:

There is no simple measure of string length that would be fair in terms of how much information can be conveyed within a length quota regardless of language.

Of solutions that don’t depend on the Unicode database and, therefore, the Unicode version and that don’t ad-hoc hard-code character ranges according to a particular version of Unicode, counting characters aka. scalar values i.e. UTF-32 length is the best that can be done. It’s still wildly unfair leading to almost eight-fold differences in how much information can be conveyed. This is not a flaw of Unicode but arises from differences in languages and writing systems.

While counting scalar values is fairer than just counting UTF-8 or UTF-16 code units, the factor between minimum and maximum UTF-8 length is so close to the factor between minimum and maximum UTF-32 length, both of which are pretty large, that instead of putting thought into using the scalar value length instead of the UTF-8 length or the UTF-16 length, it’s probably better to put the thought into reconsidering if you really need to impose such a limit.

need to impose such a limit. Unicode doesn’t provide a good database-based definition that would improve upon the character count in terms of normalizing the amount of information conveyed. While East Asian Width brings minimum and maximum closer, it unfairly singles out Hangul syllables and kana without considering other syllabaries, because normalizing length for information conveyed is not the purpose of East Asian Width.

Even if per-script (possibly non-integer) weights assigned to characters could make things fairer, it wouldn’t work well for the Latin script, which is all over the place in terms of language-dependent length.