Unicode Character Classes in ECMAScript Regular Expressions

April 27, 2009

Technical note: This post uses Unicode characters, especially at the end. If your browser or operating system does not have full Unicode coverage you may see boxes or rectangles in place of the correct glyphs.

The problem

ECMAScript, the standardized version of the language JavaScript, defines string values as sequences of UTF-16 code units, not as sequences of characters. This language misfeature complicates Unicode handling considerably. For characters in the Basic Multilingual Plane (BMP) a single UTF-16 code unit (one 16-bit word) suffices. For characters outside this range, two code units are necessary. As an example, the Latin letter A is both one character and one code unit: "A".length === 1 , but the Unicode character U+1D400 MATHEMATICAL BOLD CAPITAL A is one character but two code units: "𝐀".length === 2 . A better language would hide this ugly implementation detail from users, and string attributes such as length would be in terms of characters, not code units. Unfortunately, for historical reasons, ECMAScript forces programmers who want proper Unicode support to deal with raw UTF-16 directly.

One of the features broken by this kludge is regular expression character classes. In a character class, you can easily use ranges of characters, so long as those ranges fall within the BMP, e.g. [a-z] will match any character in the Latin alphabet. If you want to match character ranges outside the BMP, things are a little more complicated. To match, say, the range of tetragrams, U+1D306 through U+1D356, we might naïvely use them directly: [𝌆-𝍖] . To understand why this doesn't work, we must consider the UTF-16 representation of the characters "𝌆" and "𝍖".

In UTF-16, characters outside of the BMP are represented by a surrogate pair comprised of two code units: a high surrogate in the range 0xD800‒0xDB7F followed by a low surrogate in the range 0xDC00‒0xDFFF. Each of these 16-bit surrogates contains six bits that identify the code unit as part of a surrogate pair, followed by 10 bits of the represented code point less 0x10000. A more complete introduction can be found in the Wikipedia UTF-16 article.

The character U+1D306, "𝌆", is represented by the surrogate pair D834 DF06. Likewise U+1D356, "𝍖", is represented by D834 DF56.

The string "[𝌆-𝍖]", then, contains five characters but is represented in ECMAScript by seven code units:

Character Code Units [ 002F 𝌆 D834 DF06 - 002D 𝍖 D834 DF56 ] 005D

The interpretation of ECMAScript regular expression character classes is according to code units, not characters. Despite the fact that "[𝌆-𝍖]" contains 5 characters, since "[𝌆-𝍖]".length === 7 , the meaning when used as a character class is surprising. "[𝌆-𝍖]" is equivalent to [\uD834\uDF06-\uD834\uDF56] and means "match either D834, or something in DF06‒D834, or DF56," just as if we had written "[am-qz]" to match an "a", an "m"‒"q", or a "z". Obviously this is not what was intended.

In character classes, then, we cannot use characters outside the BMP. Even a single character outside the BMP, if appearing in a character class, will have an undesired interpretation: the character class will match either of the two surrogate code points (but not both) which is clearly not the intention.

The Solution

Fortunately there is a way to match what we want, though not by using a character class. In matching the range 𝌆-𝍖, we want to match two consecutive code units. The first will be D834, and second will be DF06, DF56, or anything between. So, using escape sequences to represent the code units directly, we can write:

\uD834[\uDF06-\uDF56]

The first escape will match the high surrogate, and the second range will match any of the low surrogates which may follow it to complete a character in the desired range.

Now consider a longer range, between U+1D306 "𝌆" and U+1F004 MAHJONG TILE RED DRAGON "🀄". The lowest surrogate pair to match, as before, is D834 DF06. The highest pair is D83C DC04. Additionally, anything "between" those two pairs should match. What does "between" mean here? In contrast to the previous example, not only the low surrogate but also the high surrogate now varies over the range we want to match. Any code point between these two will be represented as either (a) a D834 high surrogate followed by a low surrogate between DF06 and the top of the low surrogate range, DFFF, or (b) any high surrogate between D835 and D83B followed by any low surrogate whatsoever, or (c) a D83C high surrogate followed by a low surrogate between DC00 (the bottom of the low surrogate range) and DC04. This gives the following regular expression:

\uD834[\uDF06-\uDFFF]|[\uD835-\uD83B][\uDC00-\uDFFF]|\uD83C[\uDC00-\uDC04]

Three alternatives are necessary, each of which, if it matches, consumes two consecutive code units.

The Code

Writing regular expressions like those above by hand is tedious and error-prone. Instead we can write a program to generate them.

First we need an efficient representation of sets of code points. For this we provide a set datatype which represents a set of code points (i.e. integers) in the Unicode range (0 - 0x10FFFF). Sets may be constructed and manipulated with the following functions:

Constructors

universe These two are actually constants, not constructors.

These two are actually constants, not constructors. nil

fromCharRange(from,to) Where from and to are any Unicode characters.

Where and are any Unicode characters. fromChar(char)

fromString(string) Returns a cset containing every unique character in string , which may include Unicode characters outside the BMP.

Sets can be constructed from strings, individual characters, or character ranges, among other ways. (There are also Unicode properties and categories, but that's for another post.)

Set Operations

complement(cset)

difference(a,b)

union(a,b)

intersection(a,b)

Once a character set is constructed, we can output an ECMAScript regular expression which will match any character from that set using toRegex().

Output

toRegex(cset)

Examples

All of these are live, so you can edit the code and watch the corresponding output update in real time right here on the page.

CSET.import() toRegex(universe)

The CSET.import() call simply makes the functions from the CSET module available locally. I've written a separate post about modules.

A regular expression to match the tetragrams:

toRegex(fromCharRange("𝌆","𝍖"))

Here is the longer range that was explained above:

toRegex(fromCharRange("𝌆","🀄"))

Of course, if the range covers characters in the BMP, there is no need to use the "\u" Unicode escapes, and we can use the characters directly to represent themselves, making the regular expression a little more readable and saving a few bytes:

latinLetters = fromCharRange('a','z') digits = fromCharRange('0','9') both = union(latinLetters,digits) toRegex(both)

var BMP = fromIntRange(0,0xFFFF) toRegex(difference(BMP,both)) // everything in the BMP except a‒z and 0‒9.

Here's a regex to match any character that appears in the first sentence of this post:

toRegex(fromString("ECMAScript, the standardized version of the language JavaScript, defines string values as sequences of UTF-16 code units, not as sequences of characters."))

Finally, here's a regex to match any single character in the Unicode category Ll (lowercase letters in any language). The other Unicode General Categories (except Cn) are also supported: Lu Ll Lt Lm Lo Mn Mc Me Nd Nl No Pc Pd Ps Pe Pi Pf Po Sm Sc Sk So Zs Zl Zp Cc Cf Cs and Co. The Unicode Character Database has an explanation of these.

toRegex(fromUnicodeGeneralCategory('Ll'))

Download

The CSET code used on this page is in cset_production.js, which is generated from cset_source.js, which contains detailed comments. The code is released under the MIT license.