Most programming languages evolved awkwardly during the transition from ASCII to 16-bit UCS-2 to full Unicode. They contain internationalization features that often aren’t portable or don’t suffice.

Unicode is more than a numbering scheme for the characters of every language – although that in itself is a useful accomplishment. Unicode also includes characters’ case, directionality, and alphabetic properties. The Unicode standard and specifications describe the proper way to divide words and break lines, sort text, format numbers, display text in different directions, split/combine/reorder vowels South Asian languages, and determine when characters may look visually confusable.

Human languages are highly varied and internally inconsistent, and any application which treats strings as more than an opaque byte stream must embrace the complexity. Realistically this means using a mature third-party library.

This article illustrates text processing ideas with example programs. We’ll use the International Components for Unicode (ICU) library, which is mature, portable, and powers the international text processing behind many products and operating systems.

IBM (the maintainers of ICU) officially support a C, C++ and Java API. We’ll use the C API here for a better view into the internals. Many languages have bindings to the library, so these concepts should be applicable to your language of choice.

Table of Contents:

Concepts

Before getting into the example code, it’s important to learn the terminology. Let’s start at the most basic question.

What is a “character?”

“Character” is an overloaded term. What a native speaker of a language identifies as a letter or symbol is often stored as multiple values in the internal Unicode representation. The representation is further obscured by an additional encoding in memory, on disk, or during network transmission.

Let’s start at the abstraction closest to the user: the grapheme cluster. A “grapheme” is a graphical unit that a reader recognizes as a single element of the writing system. It’s the character as a user would understand it. For example, 山, ä and క్క are graphemes. Pieces of a single grapheme always stay together in print; breaking them apart is either nonsense or changes the meaning of the symbol. They are rendered as “glyphs,” i.e. markings on paper or screen which vary by font, style, or position in a word.

You might imagine that Unicode assigns each grapheme a unique number, but that is not true. It would be wasteful because there is a combinatorial explosion between letters and diacritical marks. For instance (o, ô, ọ, ộ) and (a, â, ạ, ậ) follow a pattern. Rather than assigning a distinct number to each, it’s more efficient to assign a number to o and a, and then to each of the combining marks. The graphemes can be built from letters and combining marks e.g. ậ = a + ◌̂ + ◌̣.

In reality Unicode takes both approaches. It assigns numbers to basic letters and combining marks, but also to some of their more common combinations. Many graphemes can thus be created in more than one way. For instance ộ can be specified in five ways:

A: U+006f (o) + U+0302 (◌̂) + U+0323 (◌̣)

B: U+006f (o) + U+0323 (◌̣) + U+0302 (◌̂)

C: U+00f4 (ô) + U+0323 (◌̣)

D: U+1ecd (ọ) + U+0302 (◌̂)

E: U+1ed9 (ộ)

The numbers (written U+xxxx) for each abstract character and each combining symbol are called “codepoints.” Every Unicode string is expressed as a list of codepoints. As illustrated above, multiple strings of codepoints may render into the same sequence of graphemes.

To meaningfully compare strings codepoint by codepoint for equality, both strings should both be represented in a consistent way. A standardized choice of codepoint decomposition for graphemes is called a “normal form.”

One choice is to decompose a string into as many codepoints as possible like examples A and B (with a weighting factor of which combining marks should come first). That is called Normalization Form Canonical Decomposition (NFD). Another choice is to do the opposite and use the fewest codepoints possible like example E. This is called Normalization Form Canonical Composition (NFC).

A core concept to remember is that, although codepoints are the building blocks of text, they don’t match up 1-1 with user-perceived characters (graphemes). Operations such as taking the length of an array of codepoints, or accessing arbitrary array positions are typically not useful for Unicode programs. Programs must also be mindful of the combining characters, like diacritical marks, when inserting or deleting codepoints. Inserting U+0061 into the asterisk position U+006f U+0302 (*) U+0323 changes the string “ộ” into “ôạ” rather than “ộa”.

Glyphs vs graphemes

It’s not just fonts that cause graphemes to be rendered into varying glyphs. The rules of some languages cause glyphs to change through contextual shaping. For instance the Arabic letter “heh” has four forms, depending on which sides are flanked by letters. When isolated it appears as ﻩ and in the final/initial/medial position in a word it appears as ﻪ/ﻫ/ﻬ respectively. Similarly, Greek displays lower-case sigma differently at the end of the word (final form) than elsewhere. Some glyphs change based on visual order. In a right-to-left language the starting parenthesis “(” mirrors to display as “)”.

Not only do individual graphemes’ glyphs vary, graphemes can combine to form single glyphs. One way is through ligatures. The latin letters “fi” often join the dot of the i with the curve of the f (presentation form U+FB01 ﬁ). Another way is language irregularity. The Arabic ا and ل, when contiguous, must form ﻻ.

Conversely, a single grapheme can split into multiple glyphs. For instance in some Indic languages, vowels can split and surround preceding consonants. In Bengali, U+09CC ৌ surrounds U+09AE ম to become মৌ . Try placing a cursor at the end of this text box and pressing backspace:

How are codepoints encoded?

In 1990, Unicode codepoints were 16 bits wide. That choice turned out to be too small for the symbols and languages people wanted to represent, so the committee extended the standard to 21 bits. That’s fine in the abstract, but how the 21 bits are stored in memory or communicated between computers depends on practical factors.

It’s an unusual memory size. Computer hardware doesn’t typically access memory in 21-bit chunks. Networking protocols, too, are better geared toward transmitting eight bits at a time. Thus, codepoints are broken into sequences of more conventionally sized blocks called code units for persistence on disk, transmission over networks, and manipulation in memory.

The Unicode Transformation Formats (UTF) describe different ways to map between codepoints and code units. The transformation formats are named after the bit width of their code units (7, 8, 16, or 32), as well as the endianness (BE or LE). For instance: UTF-8, or UTF-16BE. In addition to the UTFs, there’s another – more complex – encoding called Punycode. It is designed to conform with the limited ASCII character subset used for Internet host names.

A final bit of terminology. A “plane” is a continuous group of 65,536 code points. There are 17 planes, identified by the numbers 0 to 16. Plane 0 is the Basic Multilingual Plane (BMP), which contains most commonly-used characters. The higher planes (1 through 16) are called “supplementary planes.”

Which encoding should you choose?

For transmission and storage, use UTF-8. Programs which move ASCII data can handle it without modification. Machine endianness does not affect UTF-8, and the byte-sized units work well in networks and filesystems.

Some sites, like UTF-8 Everywhere go even further and recommend using UTF-8 for internal manipulation of text in program memory. However, I would suggest you use whatever encoding your Unicode library favors for this. You’ll be performing operations through the library API, not directly on code units. As we’re seeing, there is too much complexity between glyphs, graphemes, codepoints and code units to be manipulating the units directly. Use the encoding preferred by your library and convert to/from UTF-8 at the edges of the program.

It’s unwise to use UTF-32 to store strings in memory. In this encoding it’s true that every code unit can hold a full codepoint. However, the relationship between codepoints and glyphs isn’t straightforward, so there isn’t a programmatic advantage to storing the string this way.

UTF-32 also wastes at minimum 11 (32 - 21) bits per codepoint, and typically more. For instance, UTF-16 requires only one 16-bit code unit to encode points in the Base Multilingual Plane (the most commonly encountered points). Thus UTF-32 can typically double the space required for the BMP.

There are times to manipulate UTF-32, such as when examining a single codepoint. We’ll see examples below.

ICU example programs

The programs in this article are ready to compile and run. They require the ICU C library called ICU4C, which is available on most platforms through the operating system package manager.

ICU provides five libraries for linking (we need the first two):

Package Contents icu-uc Common (uc) and Data (dt/data) libraries icu-io Ustdio/iostream library (icuio) icu-i18n Internationalization (in/i18n) library icu-le Layout Engine icu-lx Paragraph Layout

To use ICU4C, set the compiler and linker flags with pkg-config in your Makefile. (Pkg-config may also need to be installed on your computer.)

CFLAGS = -std=c99 -pedantic -Wall -Wextra \ `pkg-config --cflags icu-uc icu-io` LDFLAGS = `pkg-config --libs icu-uc icu-io`

The examples in this article conform to the C89 standard, but we specify C99 in the Makefile because the ICU header files use C99-style ( // ) comments.

Generating random codepoints

To start getting a feel for ICU’s I/O and codepoint manipulation, let’s make a program to output completely random (but valid) codepoints. You could use this program as a basic fuzz tester, to see whether its output confuses other programs. A real fuzz tester ought to have the ability to take an explicit seed for repeatable output, but we will omit that functionality from our simple demo.

This program has limited portability because it gets entropy from /dev/urandom , a Unix device. To generate good random numbers using only the C standard library, see my other article. Also POSIX provides pseudo-random number functions.

/* for constants like EXIT_FAILURE */ #include <stdlib.h> /* we'll be using standard C I/O to read random bytes */ #include <stdio.h> /* to determine codepoint categories */ #include <unicode/uchar.h> /* to output UTF-32 codepoints in proper encoding for terminal */ #include <unicode/ustdio.h> int main( int argc, char **argv) { long i = 0 , linelen; /* somewhat non-portable: /dev/urandom is unix specific */ FILE *f = fopen( "/dev/urandom" , "rb" ); UFILE *out; /* UTF-32 code unit can hold an entire codepoint */ UChar32 c; /* to learn about c */ UCharCategory cat; if (!f) { fputs( "Unable to open /dev/urandom

" , stderr); return EXIT_FAILURE; } /* optional length to insert line breaks */ linelen = argc > 1 ? strtol(argv[ 1 ], NULL, 10 ) : 0 ; /* have to obtain a Unicode-aware file handle. This function * has no failure return code, it always works. */ out = u_get_stdout(); /* read a random 32 bits, presumably forever */ while (fread(&c, sizeof c, 1 , f)) { /* Scale 32-bit value to a number within code planes * zero through fourteen. (Planes 15-16 are private-use) * * The modulo bias is insignificant. The first 65535 * codepoints are minutely favored, being generated by * 4370 different 32-bit numbers each. The remaining * 917505 codepoints are generated by 4369 numbers each. */ c %= 0xF0000 ; cat = u_charType(c); /* U_UNASSIGNED are "non-characters" with no assigned * meanings for interchange. U_PRIVATE_USE_CHAR are * reserved for use within organizations, and * U_SURROGATE are designed for UTF-16 code units in * particular. Don't print any of those. */ if (cat != U_UNASSIGNED && cat != U_PRIVATE_USE_CHAR && cat != U_SURROGATE) { u_fputc(c, out); if (linelen && ++i >= linelen) { i = 0 ; /* there are a number of Unicode * linebreaks, but the standard ASCII *

is valid, and will interact well * with a shell */ u_fputc( '

' , out); } } } /* should never get here */ fclose(f); return EXIT_SUCCESS; }

A note about the mysterious U_UNASSIGNED category, the “non-characters.” These are code points that are permanently reserved in the Unicode Standard for internal use. They are not recommended for use in open interchange of Unicode text data. The Unicode Standard sets aside 66 non-character code points. The last two code points of each plane are noncharacters (U+FFFE and U+FFFF on the BMP). In addition, there is a contiguous range of another 32 noncharacter code points in the BMP: U+FDD0…U+FDEF.

Applications are free to use any of these noncharacter code points internally. They have no standard interpretation when exchanged outside the context of internal use. They are not illegal in interchange, nor does their presence cause Unicode text to be ill-formed.

Manipulating codepoints

We discussed non-characters in the previous section, but there are also Private Use codepoints. Unlike non-characters, those for private use are designated for interchange between systems. However the precise meaning and glyphs for these characters is specific to the organization using them. The same codepoints can be used for different things by different people.

Unicode provides a large area for private use. Both a small code block in the BMP, as well as two entire planes: 15 and 16. Because no browser or text editor will render PUA codepoints beyond (typically) empty boxes, we can exploit plane 15 to make a visually confusing code. Ultimately it’s a cheesy transposition cypher, but it’s kind of fun.

Below is a program to shift characters in the BMP to/from plane 15, the Private Use Area A. Example output of an encoded string: 󰁂󰁥󰀠󰁳󰁵󰁲󰁥󰀠󰁴󰁯󰀠󰁤󰁲󰁩󰁮󰁫󰀠󰁹󰁯󰁵󰁲󰀠󰁏󰁶󰁡󰁬󰁴󰁩󰁮󰁥󰀡󰀊

#include <stdio.h> #include <stdlib.h> /* for strcmp in argument parsing */ #include <string.h> #include <unicode/ustdio.h> void usage( const char *prog) { puts( "Shift base multilingual plane to/from PUA-A

" ); printf( "Usage: %s [-d]



" , prog); puts( "Encodes stdin (or decode with -d)" ); exit(EXIT_SUCCESS); } int main( int argc, char **argv) { UChar32 c; UFILE *in, *out; enum { MODE_ENCODE, MODE_DECODE } mode = MODE_ENCODE; if (argc > 2 ) usage(argv[ 0 ]); else if (argc > 1 ) { if (strcmp(argv[ 1 ], "-d" ) == 0 ) mode = MODE_DECODE; else usage(argv[ 0 ]); } out = u_get_stdout(); in = u_finit(stdin, NULL, NULL); if (!in) { fputs( "Error opening stdout as UFILE

" , stderr); return EXIT_FAILURE; } /* u_fgetcx returns UTF-32. U_EOF happens to be 0xFFFF, * not -1 like EOF typically is in stdio.h */ while ((c = u_fgetcx(in)) != U_EOF) { /* -1 for UChar32 actually signifies invalid character */ if (c == (UChar32) 0xFFFFFFFF ) { fputs( "Invalid character.

" , stderr); continue ; } if (mode == MODE_ENCODE) { /* Move the BMP into the Supplementary * Private Use Area-A, which begins * at codepoint 0xf0000 */ if ( 0 < c && c < 0xe000 ) c += 0xf0000 ; } else { /* Move the Supplementary Private Use * Plane down into the BMP */ if ( 0xf0000 < c && c < 0xfe000 ) c -= 0xf0000 ; } u_fputc(c, out); } /* if you u_finit it, then u_fclose it */ u_fclose(in); return EXIT_SUCCESS; }

Examining UTF-8 code units

So far we’ve been working entirely with complete codepoints. This next example gets into their representation as code units in a transformation format, namely UTF-8. We will read the codepoint as a hexadecimal program argument, and convert it to between 1-4 bytes in UTF-8, and print the hex values of those bytes.

/*** utf8.c ***/ #include <stdio.h> #include <stdlib.h> #include <unicode/utf8.h> int main( int argc, char **argv) { UChar32 c; /* ICU defines its own bool type to be used * with their macro */ UBool err = FALSE; /* ICU uses C99 types like uint8_t */ uint8_t bytes[ 4 ] = { 0 }; /* probably should be size_t not int32_t, but * just matching what their macro expects */ int32_t written = 0 , i; char *parsed; if (argc != 2 ) { fprintf(stderr, "Usage: %s codepoint

" , *argv); exit(EXIT_FAILURE); } c = strtol(argv[ 1 ], &parsed, 16 ); if (!*argv[ 1 ] || *parsed) { fprintf(stderr, "Cannot parse codepoint: U+%s

" , argv[ 1 ]); exit(EXIT_FAILURE); } /* this is a macro, and updates the variables * directly. No need to pass addresses. * We're saying: write to "bytes", tell us how * many were "written", limit it to four */ U8_APPEND(bytes, written, 4 , c, err); if (err == TRUE) { fprintf(stderr, "Invalid codepoint: U+%s

" , argv[ 1 ]); exit(EXIT_FAILURE); } /* print in format 'xxd -r' can read */ printf( "0: " ); for (i = 0 ; i < written; ++i) printf( "%2x" , bytes[i]); puts( "" ); return EXIT_SUCCESS; }

Suppose you compile this to a program named utf8 . Here are some examples:

# ascii characters are unchanged $ ./utf8 61 0 : 61 # other codepoints require more bytes $ ./utf8 1F41A 0 : f09f909a # format is compatible with "xxd" $ ./utf8 1F41A | xxd -r 🐚 # surrogates (used in UTF-16) are not valid codepoints $ ./utf8 DC00 Invalid codepoint: U+DC00

Reading lines into internal UTF-16 representation

Unlimited line length

Here’s a useful helper function named u_wholeline() which reads a line of any length into a dynamically allocated buffer. It reads as UChar*, which is ICU’s standard UTF-16 code unit array.

/* to properly test realloc */ #include <errno.h> #include <stdlib.h> #include <unicode/ustdio.h> /* line Feed, vertical tab, form feed, carriage return, * next line, line separator, paragraph separator */ #define NEWLINE(c) ( \ ((c) >= 0xa && (c) <= 0xd) || \ (c) == 0x85 || (c) == 0x2028 || (c) == 0x2029 ) /* allocates buffer, caller must free */ UChar *u_wholeline(UFILE *f) { /* assume most lines are shorter * than 128 UTF-16 code units */ size_t i, sz = 128 ; UChar c, *s = malloc(sz * sizeof (*s)), *s_new; if (!s) return NULL; /* u_fgetc returns UTF-16, unlike u_fgetcx */ for (i = 0 ; (s[i] = u_fgetc(f)) != U_EOF && !NEWLINE(s[i]); ++i) if (i >= sz) { /* double the buffer when it runs out */ sz *= 2 ; errno = 0 ; s_new = realloc(s, sz * sizeof (*s)); if (errno == ENOMEM) free(s); if ((s = s_new) == NULL) return NULL; } /* if terminated by CR, eat LF */ if (s[i] == 0xd && (c = u_fgetc(f)) != 0xa ) u_fungetc(c, f); /* s[i] will either be U_EOF or a newline; wipe it */ s[i] = '\ 0 '; return s; }

Limited line length

The previous example reads an entire line. However, reading a limited number of code units from UTF-16 lines is more tricky. Truncating a Unicode string is always a little dangerous due to possibly splitting a word and breaking contextual shaping.

UTF-16 also has surrogate pairs, which are how that translation format expresses codepoints outside the BMP. Ending a UTF-16 string early can split surrogate pairs without the proper precaution.

The following example reads lines in chunks of at most three UTF-16 code units at a time. If it reads two consecutive codepoints from supplementary planes it will fail. The program accepts a “fix” argument to make it push a final unpaired surrogate back onto the stream for a future read.

/*** codeunit.c ***/ #include <stdlib.h> #include <string.h> #include <unicode/ustdio.h> #include <unicode/ustring.h> #include <unicode/utf16.h> /* BUFSZ set to be very small so that lines must be read in * many chunks. Helps illustrate split surrogate pairs */ #define BUFSZ 4 void printHex( const UChar *s) { while (*s) printf( "%x " , *s++); putchar( '

' ); } /* yeah, slightly annoying duplication */ void printHex32( const UChar32 *s) { while (*s) printf( "%x " , *s++); putchar( '

' ); } int main( int argc, char **argv) { UFILE *in; /* read line into ICU's default UTF-16 representation */ UChar line[BUFSZ]; /* A buffer to hold codepoints of "line" as UTF-32 code * units. The length is sufficient because it requires * fewer (or at least no greater) code units in UTF-32 to * encode the string */ UChar32 codepoints[BUFSZ]; UChar *final; UErrorCode err = U_ZERO_ERROR; if (!(in = u_finit(stdin, NULL, NULL))) { fputs( "Error opening stdin as UFILE

" , stderr); return EXIT_FAILURE; } /* read lines one small BUFSZ chunk at a time */ while (u_fgets(line, BUFSZ, in)) { /* correct for split surrogate pairs only * if the "fix" argument is present */ if (argc > 1 && strcmp(argv[ 1 ], "fix" ) == 0 ) { final = line + u_strlen(line); /* want to consider the character before \0 * if such exists */ if (final > line) final--; /* if it is the lead unit of a surrogate pair */ if (U16_IS_LEAD(*final)) { /* push it back for a future read, and * truncate the string */ u_fungetc(*final, in); *final = '\ 0 '; } } printf( "UTF-16 : " ); printHex(line); u_strToUTF32( codepoints, BUFSZ, NULL, line, - 1 , &err); printf( "Error? : %s

" , u_errorName(err)); printf( "Codepoints: " ); printHex32(codepoints); /* reset potential errors and go for another chunk */ err = U_ZERO_ERROR; *codepoints = '\ 0 '; } u_fclose(in); return EXIT_SUCCESS; }

If the program reads two weird numerals 𝟘𝟙 (different from 01), neither of which are in the BMP, it finds one codepoint but chokes on the broken pair:

$ echo -n 𝟘𝟙 | ./codeunit UTF-16 : d835 dfd8 d835 Error? : U_INVALID_CHAR_FOUND Codepoints : 1d7d8 UTF-16 : dfd9 Error? : U_INVALID_CHAR_FOUND Codepoints :

However if we pass the “fix” argument, the program will read two complete codepoints:

$ echo -n 𝟘𝟙 | ./codeunit fix UTF-16 : d835 dfd8 Error? : U_ZERO_ERROR Codepoints : 1d7d8 UTF-16 : d835 dfd9 Error? : U_ZERO_ERROR Codepoints : 1d7d9

Perhaps a better way to read a line with limited length is to use a “break iterator” to stop on a word boundary. We’ll see more about that later.

Extracting, iterating codepoints in UTF-16 string

Our next example will rather laboriously remove diacritical marks from a string. There’s an easier way to do this called “transformation,” but doing it manually provides an opportunity to decompose characters and iterate over them with the U16_NEXT macro.

/*** nomarks.c ***/ #include <stdlib.h> #include <unicode/uchar.h> #include <unicode/unorm2.h> #include <unicode/ustdio.h> #include <unicode/utf16.h> /* Limit to how many decomposed UTF-16 units a single * codepoint will become in NFD. I don't know the * correct value here so I chose a value that seems * to be overkill */ #define MAX_DECOMP_LEN 16 int main( void ) { long i, n; UChar32 c; UFILE *in, *out; UChar decomp[MAX_DECOMP_LEN]; UErrorCode status = U_ZERO_ERROR; UNormalizer2 *norm; out = u_get_stdout(); in = u_finit(stdin, NULL, NULL); if (!in) { /* using stdio functions with stderr and ustdio * with stdout. Mixing the two on a single file * handle would probably be bad. */ fputs( "Error opening stdin as UFILE

" , stderr); return EXIT_FAILURE; } /* create a normalizer, in this case one going to NFD */ norm = (UNormalizer2 *)unorm2_getNFDInstance(&status); if (U_FAILURE(status)) { fprintf(stderr, "unorm2_getNFDInstance(): %s

" , u_errorName(status)); return EXIT_FAILURE; } /* consume input as UTF-32 units one by one */ while ((c = u_fgetcx(in)) != U_EOF) { /* Decompose c to isolate its n combining character * codepoints. Saves them as UTF-16 code units. FYI, * this function ignores the type of "norm" and always * denormalizes */ n = unorm2_getDecomposition( norm, c, decomp, MAX_DECOMP_LEN, &status ); if (U_FAILURE(status)) { fprintf(stderr, "unorm2_getDecomposition(): %s

" , u_errorName(status)); u_fclose(in); return EXIT_FAILURE; } /* if c does not decompose and is not itself * a diacritical mark */ if (n < 0 && ublock_getCode(c) != UBLOCK_COMBINING_DIACRITICAL_MARKS) u_fputc(c, out); /* walk canonical decomposition, reuse c variable */ for (i = 0 ; i < n; ) { /* the U16_NEXT macro iterates over UChar (aka * UTF-16, advancing by one or two elements as * needed to get a codepoint. It saves the result * in UTF-32. The macro updates i and c. */ U16_NEXT(decomp, i, n, c); /* output only if not combining diacritical */ if (ublock_getCode(c) != UBLOCK_COMBINING_DIACRITICAL_MARKS) u_fputc(c, out); } } u_fclose(in); /* u_get_stdout() doesn't need to be u_fclose'd */ return EXIT_SUCCESS; }

Here’s an example of running the program:

$ echo "résumé façade" | ./nomarks resume facade

Transformation

ICU provides a rich domain specific language for transforming strings. For example, our entire program in the previous section can be replaced by the transformation NFD; [:Nonspacing Mark:] Remove; NFC . This means to perform a canonical decomposition, remove nonspacing marks, and then canonically compose again. (In fact our program above didn’t re-compose.)

The program below echoes stdin to stdout, but passes the output through a transformation.

/*** trans-stream.c ***/ #include <stdlib.h> #include <string.h> #include <unicode/ustdio.h> #include <unicode/ustring.h> #include <unicode/utrans.h> int main( int argc, char **argv) { UChar32 c; UParseError pe; UFILE *in, *out; UTransliterator *t; UErrorCode status = U_ZERO_ERROR; UChar *xform_id; size_t n; if (argc != 2 ) { fprintf(stderr, "Usage: %s \" translation rules \"

" , argv[ 0 ]); return EXIT_FAILURE; } /* the UTF-16 string should never be longer than the UTF-8 * argv[1], so this should be safe */ n = strlen(argv[ 1 ]) + 1 ; xform_id = malloc(n * sizeof (UChar)); u_strFromUTF8(xform_id, n, NULL, argv[ 1 ], - 1 , &status); /* create transliterator by identifier */ t = utrans_openU(xform_id, - 1 , UTRANS_FORWARD, NULL, - 1 , &pe, &status); /* don't need the identifier any more */ free(xform_id); if (U_FAILURE(status)) { fprintf(stderr, "utrans_open(%s): %s

" , argv[ 1 ], u_errorName(status)); return EXIT_FAILURE; } out = u_get_stdout(); if (!(in = u_finit(stdin, NULL, NULL))) { fputs( "Error opening stdin as UFILE

" , stderr); return EXIT_FAILURE; } /* transparently transliterate stdout */ u_fsettransliterator(out, U_WRITE, t, &status); if (U_FAILURE(status)) { fprintf(stderr, "Failed to set transliterator on stdout: %s

" , u_errorName(status)); u_fclose(in); return EXIT_FAILURE; } /* what looks like a simple echo loop actually * transliterate characters */ while ((c = u_fgetcx(in)) != U_EOF) u_fputc(c, out); utrans_close(t); u_fclose(in); }

As mentioned, it can emulate our earlier “nomarks” program:

$ echo "résumé façade" | ./trans "NFD; [:Nonspacing Mark:] Remove; NFC" resume facade

It can also transliterate between scripts like this:

$ echo "miirekkaḍiki veḷutunnaaru?" | ./trans "Telugu" మీరెక్కడికి వెళుతున్నఅరు ?

Applying the transformation to a stream with u_fsettransliterator is a simple way to do things. However I did discover and file an ICU bug which will be fixed in version 65.1.

A more robust way to apply transformations is by manipulating UChar strings directly. The technique is also probably more applicable in real applications.

Here’s a rewrite of trans-stream that operates on strings directly:

/*** trans-string.c ***/ #include <stdlib.h> #include <string.h> #include <unicode/ustdio.h> #include <unicode/ustring.h> #include <unicode/utrans.h> /* max number of UTF-16 code units to accumulate while looking * for an unambiguous transliteration. Has to be fairly long to * handle names in Name-Any transliteration like * \N{LATIN CAPITAL LETTER O WITH OGONEK AND MACRON} */ #define CONTEXT 100 int main( int argc, char **argv) { UErrorCode status = U_ZERO_ERROR; UChar c, *end; UChar input[CONTEXT] = { 0 }, *buf, *enlarged; UFILE *in, *out; UTransPosition pos; int32_t width, sizeNeeded, bufLen; size_t n; UChar *xform_id; UTransliterator *t; /* bufLen must be able to hold at least CONTEXT, and * will be increased as needed for transliteration */ bufLen = CONTEXT; buf = malloc( sizeof (UChar) * bufLen); if (argc != 2 ) { fprintf(stderr, "Usage: %s \" translation rules \"

" , argv[ 0 ]); return EXIT_FAILURE; } /* allocate and read identifier, like earlier example */ n = strlen(argv[ 1 ]) + 1 ; xform_id = malloc(n * sizeof (UChar)); u_strFromUTF8(xform_id, n, NULL, argv[ 1 ], - 1 , &status); t = utrans_openU(xform_id, - 1 , UTRANS_FORWARD, NULL, - 1 , NULL, &status); free(xform_id); if (U_FAILURE(status)) { fprintf(stderr, "utrans_open(%s): %s

" , argv[ 1 ], u_errorName(status)); return EXIT_FAILURE; } out = u_get_stdout(); if (!(in = u_finit(stdin, NULL, NULL))) { fputs( "Error opening stdin as UFILE

" , stderr); return EXIT_FAILURE; } end = input; /* append UTF-16 code units one at a time for incremental * transliteration */ while ((c = u_fgetc(in)) != U_EOF) { /* we consider at most CONTEXT consecutive code units * for transliteration (minus one for \0) */ if (end - input >= CONTEXT- 1 ) { fprintf(stderr, "Exceeded max (%i) code units " "for context.

" , CONTEXT); break ; } *end++ = c; *end = '\ 0 '; /* copy string so far to buf to operate on */ u_strcpy(buf, input); pos.start = pos.contextStart = 0 ; pos.limit = pos.contextLimit = end - input; sizeNeeded = - 1 ; utrans_transIncrementalUChars( t, buf, &sizeNeeded, bufLen, &pos, &status ); /* if buf not big enough for transliterated result */ if (status == U_BUFFER_OVERFLOW_ERROR) { /* utrans_transIncrementalUChars sets sizeNeeded, * so resize the buffer */ if ((enlarged = realloc(buf, sizeof (UChar)*sizeNeeded)) == NULL) { fprintf(stderr, "Unable to grow buffer.

" ); /* fail gracefully and display * what we can */ break ; } buf = enlarged; bufLen = sizeNeeded; u_strcpy(buf, input); pos.start = pos.contextStart = 0 ; pos.limit = pos.contextLimit = end - input; sizeNeeded = - 1 ; /* one more time, but with sufficient space */ status = U_ZERO_ERROR; utrans_transIncrementalUChars( t, buf, &sizeNeeded, bufLen, &pos, &status ); } /* handle errors other than U_BUFFER_OVERFLOW_ERROR */ if (U_FAILURE(status)) { fprintf(stderr, "utrans_transIncrementalUChars(): %s

" , u_errorName(status)); break ; } /* print buf[0 .. pos.start - 1] */ u_printf( "%.*S" , pos.start, buf); /* Remove the code units which were processed, * shifting back the remaining ones which could * not be unambiguously transliterated. Then hit * the loop to get another code unit and try again. */ u_strcpy(input, buf+pos.start); end = input + (pos.limit - pos.start); } /* if any leftovers from incremental transliteration */ if (end > input) { /* transliterate input array in place, do our best */ width = end - input; utrans_transUChars( t, input, NULL, CONTEXT, 0 , &width, &status); u_printf( "%S" , input); } utrans_close(t); u_fclose(in); free(buf); return U_SUCCESS(status) ? EXIT_SUCCESS : EXIT_FAILURE; }

Punycode

Punycode is a representation of Unicode within the limited ASCII character subset used for internet host names. If you enter a non-ASCII URL into a web browser navigation bar, the browser translates to Punycode before making the actual DNS lookup.

The encoding is part of the more general process of Internationalizing Domain Names in Applications (IDNA), which also normalizes the string.

Note that not all Unicode strings can be successfully encoded. For instance codepoints like “⒈” include a period in the glyph and are used for numbered lists. Converting that dot to the ASCII hostname would inadvertently specify a subdomain. ICU turns the offending character into U+FFFD (the “replacement character”) in the output and returns an error.

The following program uses uidna_nameToASCII or uidna_nameToUnicode as needed to translate between Unicode and punycode.

/*** puny.c ***/ #include <stdio.h> #include <stdlib.h> #include <string.h> /* uidna stands for International Domain Names in * Applications and contains punycode routines */ #include <unicode/uidna.h> #include <unicode/ustdio.h> #include <unicode/ustring.h> void chomp(UChar *s) { /* unicode characters that split lines */ UChar splits[] = { 0xa , 0xb , 0xc , 0xd , 0x85 , 0x2028 , 0x2029 , '\ 0 '}; if (s) s[u_strcspn(s, splits)] = '\ 0 '; } int main( int argc, char **argv) { UFILE *in; UChar input[ 1024 ], output[ 1024 ]; UIDNAInfo info = UIDNA_INFO_INITIALIZER; UErrorCode status = U_ZERO_ERROR; UIDNA *idna = uidna_openUTS46(UIDNA_DEFAULT, &status); /* default action is performing punycode */ int32_t (*action)( const UIDNA*, const UChar*, int32_t , UChar*, int32_t , UIDNAInfo*, UErrorCode* ) = uidna_nameToASCII; if (!(in = u_finit(stdin, NULL, NULL))) { fputs( "Error opening stdin as UFILE

" , stderr); return EXIT_FAILURE; } /* the "decode" option reverses our action */ if (argc > 1 && strcmp(argv[ 1 ], "decode" ) == 0 ) action = uidna_nameToUnicode; /* u_fgets includes the newline, so we chomp it */ u_fgets(input, sizeof (input)/ sizeof (*input), in); chomp(input); action(idna, input, - 1 , output, sizeof (output)/ sizeof (*output), &info, &status); if (U_SUCCESS(status) && info.errors!= 0 ) fputs( "Bad input.

" , stderr); u_printf( "%S

" , output); uidna_close(idna); u_fclose(in); return 0 ; }

Example of using the program:

$ echo "façade.com" | ./puny xn--faade-zra.com # not every string is allowed $ echo "a⒈.com" | ./puny Bad input. a �.com

Changing case

The C standard library has functions like toupper which operate on a single character at a time. ICU has equivalents like u_toupper , but working on single codepoints isn’t sufficient for proper casing. Let’s examine the program and see why.

/*** pointcase.c ***/ #include <stdlib.h> #include <string.h> #include <unicode/uchar.h> #include <unicode/ustdio.h> int main( int argc, char **argv) { UChar32 c; UFILE *in, *out; UChar32 (*op)(UChar32) = NULL; /* set op to one of the casing operations * in uchar.h */ if (argc < 2 || strcmp(argv[ 1 ], "upper" ) == 0 ) op = u_toupper; else if (strcmp(argv[ 1 ], "lower" ) == 0 ) op = u_tolower; else if (strcmp(argv[ 1 ], "title" ) == 0 ) op = u_totitle; else { fprintf(stderr, "Unrecognized case: %s

" , argv[ 1 ]); return EXIT_FAILURE; } out = u_get_stdout(); if (!(in = u_finit(stdin, NULL, NULL))) { fputs( "Error opening stdin as UFILE

" , stderr); return EXIT_FAILURE; } /* operates on UTF-32 */ while ((c = u_fgetcx(in)) != U_EOF) u_fputc(op(c), out); u_fclose(in); return EXIT_SUCCESS; }

# not quite right, ß should become SS: $ echo "Die große Stille" | ./pointcase upper DIE GROßE STILLE # also wrong, final sigma should be ς: $ echo "ΣΊΣΥΦΟΣ" | ./pointcase lower σίσυφοσ

As you can see, some graphemes need to “expand” into a greater number, and others are position-sensitive. To do this properly, we have to operate on entire strings rather than individual characters. Here is a program to do it right:

/*** strcase.c ***/ #include <locale.h> #include <stdlib.h> #include <string.h> #include <unicode/ustdio.h> #include <unicode/ustring.h> #define BUFSZ 1024 /* wrapper function for u_strToTitle with signature * matching the other casing functions */ int32_t title(UChar *dest, int32_t destCapacity, const UChar *src, int32_t srcLength, const char *locale, UErrorCode *pErrorCode) { return u_strToTitle(dest, destCapacity, src, srcLength, NULL, locale, pErrorCode); } int main( int argc, char **argv) { UFILE *in; char *locale; UChar line[BUFSZ], cased[BUFSZ]; UErrorCode status = U_ZERO_ERROR; int32_t (*op)( UChar*, int32_t , const UChar*, int32_t , const char *, UErrorCode* ) = NULL; /* casing is locale-dependent */ if (!(locale = setlocale(LC_CTYPE, "" ))) { fputs( "Cannot determine system locale

" , stderr); return EXIT_FAILURE; } if (argc < 2 || strcmp(argv[ 1 ], "upper" ) == 0 ) op = u_strToUpper; else if (strcmp(argv[ 1 ], "lower" ) == 0 ) op = u_strToLower; else if (strcmp(argv[ 1 ], "title" ) == 0 ) op = title; else { fprintf(stderr, "Unrecognized case: %s

" , argv[ 1 ]); return EXIT_FAILURE; } if (!(in = u_finit(stdin, NULL, NULL))) { fputs( "Error opening stdin as UFILE

" , stderr); return EXIT_FAILURE; } /* Ideally we should change case up to the last word * break and push the remaining characters back for * a future read if the line was longer than BUFSZ. * Currently, if the string is truncated, the final * character would incorrectly be considered * terminal, which affects casing rules in Greek. */ while (u_fgets(line, BUFSZ, in)) { op(cased, BUFSZ, line, - 1 , locale, &status); /* if casing increases string length, and goes * beyond buffer size like the german ß -> SS */ if (status == U_BUFFER_OVERFLOW_ERROR) { /* Just issue a warning and read another line. * Don't treat it as severely as other errors. */ fputs( "Line too long

" , stderr); status = U_ZERO_ERROR; } else if (U_FAILURE(status)) { fputs(u_errorName(status), stderr); break ; } else u_printf( "%S" , cased); } u_fclose(in); return U_SUCCESS(status) ? EXIT_SUCCESS : EXIT_FAILURE; }

This works better.

$ echo "Die große Stille" | ./strcase upper DIE GROSSE STILLE $ echo "ΣΊΣΥΦΟΣ" | ./strcase lower σίσυφος

Counting words and graphemes

Let’s make a version of wc (the Unix word count program) that knows more about Unicode. Our version will properly count grapheme clusters and word boundaries.

For example, regular wc gets confused by the ancient Ogham script. This was a series of notches scratched into fence posts, and has a space character which is nonblank.

$ echo "ᚈᚐ ᚋᚓ ᚔ ᚍᚏᚐ " | wc 1 1 37

One word, you say? Puh-leaze, if your program can’t handle Medieval Irish carvings then I want nothing to do with it. Here’s one that can:

/*** uwc.c ***/ #include <locale.h> #include <stdlib.h> #include <unicode/ubrk.h> #include <unicode/ustdio.h> #include <unicode/ustring.h> #define BUFSZ 512 /* line Feed, vertical tab, form feed, carriage return, * next line, line separator, paragraph separator */ #define NEWLINE(c) ( \ ((c) >= 0xa && (c) <= 0xd) || \ (c) == 0x85 || (c) == 0x2028 || (c) == 0x2029 ) int main( void ) { UFILE *in; char *locale; UChar line[BUFSZ]; UBreakIterator *brk_g, *brk_w; UErrorCode status = U_ZERO_ERROR; long ngraph = 0 , nword = 0 , nline = 0 ; size_t len; /* word breaks are locale-specific, so we'll obtain * LC_CTYPE from the environment */ if (!(locale = setlocale(LC_CTYPE, "" ))) { fputs( "Cannot determine system locale

" , stderr); return EXIT_FAILURE; } if (!(in = u_finit(stdin, NULL, NULL))) { fputs( "Error opening stdin as UFILE

" , stderr); return EXIT_FAILURE; } /* create an iterator for graphemes */ brk_g = ubrk_open( UBRK_CHARACTER, locale, NULL, - 1 , &status); /* and another for the edges of words */ brk_w = ubrk_open( UBRK_WORD, locale, NULL, - 1 , &status); /* yes, this is sensitive to splitting end of line * surrogate pairs and can be improved by our previous * function for reading bounded lines */ while (u_fgets(line, BUFSZ, in)) { len = u_strlen(line); ubrk_setText(brk_g, line, len, &status); ubrk_setText(brk_w, line, len, &status); /* Start at beginning of string, count breaks. * Could have been a for loop, but this looks * simpler to me. */ ubrk_first(brk_g); while (ubrk_next(brk_g) != UBRK_DONE) ngraph++; ubrk_first(brk_w); while (ubrk_next(brk_w) != UBRK_DONE) if (ubrk_getRuleStatus(brk_w) == UBRK_WORD_LETTER) nword++; /* count the newline if it exists */ if (len > 0 && NEWLINE(line[len- 1 ])) nline++; } printf( "locale : %s

" "Grapheme: %zu

" "Word : %zu

" "Line : %zu

" , locale, ngraph, nword, nline); /* clean up iterators after use */ ubrk_close(brk_g); ubrk_close(brk_w); u_fclose(in); }

Much better:

$ echo "ᚈᚐ ᚋᚓ ᚔ ᚍᚏᚐ " | ./uwc locale : en_US.UTF-8 Grapheme : 14 Word : 4 Line : 1

String search

When comparing strings, we can be more or less strict. A familiar example is case sensitivity, but Unicode provides other options. Comparing strings for equality is a degenerate case of sorting, where the strings must not only be determined as equal, but put in order. Sorting is called “collation” and the Unicode collation algorithm supports multiple levels of increasing strictness.

Level Description Primary base characters Secondary accents Tertiary case/variant Quaternary punctuation

Each level acts as a tie-breaker when strings match in previous levels. When searching we can choose how deep to check before declaring strings equal. To illustrate, consider a text file called words.txt containing these words:

Cooperate coöperate COÖPERATE co-operate final ﬁdes

We will write a program called ugrep , where we can specify a comparison level and search string. If we search for “cooperate” and allow comparisons up to the tertiary level it matches nothing:

$ ./ugrep 3 cooperate < words.txt # it's an exact match, no results

It is possible to shift certain “ignorable” characters (like ‘-’) down to the quaternary level while conducting the original level 3 search:

$ ./ugrep 3i cooperate < words.txt 4 : co-operate

Doing the same search at the secondary level disregards case, but is still sensitive to accents.

$ ./ugrep 2 cooperate < words.txt 1 : Cooperate

Once again, can allow ignorables at this level.

$ ./ugrep 2i cooperate < words.txt 1 : Cooperate 4 : co-operate

Finally, going only to the primary level, we match words with the same base letters, modulo case and accents.

$ ./ugrep 1 cooperate < words.txt 1 : Cooperate 2 : coöperate 3 : COÖPERATE

Note that the idea of a “base character” is dependent on locale. In Swedish, the letters o and ö are quite distinct, and not minor variants as in English. Setting the locale prior to search restricts the results even at the primary level.

$ LC_COLLATE= sv_SE ./ugrep 1 cooperate < fun.txt 1 : Cooperate

One note about the tertiary level. It distinguishes not just case, but ligature presentation forms.

$ ./ugrep 3 ﬁ < words.txt 6 : ﬁdes # vs $ ./ugrep 2 ﬁ < words.txt 5 : final 6 : ﬁdes

Pretty flexible, right? Let’s see the code.

/*** ugrep.c ***/ #include <locale.h> #include <stdlib.h> #include <string.h> #include <unicode/ucol.h> #include <unicode/usearch.h> #include <unicode/ustdio.h> #include <unicode/ustring.h> #define BUFSZ 1024 int main( int argc, char **argv) { char *locale; UFILE *in; UCollator *col; UStringSearch *srch = NULL; UErrorCode status = U_ZERO_ERROR; UChar *needle, line[BUFSZ]; UColAttributeValue strength; int ignoreInsignificant = 0 , asymmetric = 0 ; size_t n; long i; if (argc != 3 ) { fprintf(stderr, "Usage: %s {1,2,@,3}[i] pattern

" , argv[ 0 ]); return EXIT_FAILURE; } /* cryptic parsing for our cryptic options */ switch (*argv[ 1 ]) { case '1' : strength = UCOL_PRIMARY; break ; case '2' : strength = UCOL_SECONDARY; break ; case '@' : strength = UCOL_SECONDARY, asymmetric = 1 ; break ; case '3' : strength = UCOL_TERTIARY; break ; default : fprintf(stderr, "Unknown strength: %s

" , argv[ 1 ]); return EXIT_FAILURE; } /* length of argv[1] is >0 or we would have died */ ignoreInsignificant = argv[ 1 ][strlen(argv[ 1 ])- 1 ] == 'i' ; n = strlen(argv[ 2 ]) + 1 ; /* if UTF-8 could encode it in n, then UTF-16 * should be able to as well */ needle = malloc(n * sizeof (*needle)); u_strFromUTF8(needle, n, NULL, argv[ 2 ], - 1 , &status); /* searching is a degenerate case of collation, * so we read the LC_COLLATE locale */ if (!(locale = setlocale(LC_COLLATE, "" ))) { fputs( "Cannot determine system collation locale

" , stderr); return EXIT_FAILURE; } if (!(in = u_finit(stdin, NULL, NULL))) { fputs( "Error opening stdin as UFILE

" , stderr); return EXIT_FAILURE; } col = ucol_open(locale, &status); ucol_setStrength(col, strength); if (ignoreInsignificant) /* shift ignorable characters down to * quaternary level */ ucol_setAttribute(col, UCOL_ALTERNATE_HANDLING, UCOL_SHIFTED, &status); /* Assumes all lines fit in BUFSZ. Should * fix this in real code and not increment i */ for (i = 1 ; u_fgets(line, BUFSZ, in); ++i) { /* first time through, set up all options */ if (!srch) { srch = usearch_openFromCollator( needle, - 1 , line, - 1 , col, NULL, &status ); if (asymmetric) usearch_setAttribute( srch, USEARCH_ELEMENT_COMPARISON, USEARCH_PATTERN_BASE_WEIGHT_IS_WILDCARD, &status ); } /* afterward just switch text */ else usearch_setText(srch, line, - 1 , &status); /* check if keyword appears in line */ if (usearch_first(srch, &status) != USEARCH_DONE) u_printf( "%ld: %S" , i, line); } usearch_close(srch); ucol_close(col); u_fclose(in); free(needle); return EXIT_SUCCESS; }

Comparing strings modulo normalization

In the concepts section, we saw a single grapheme can be constructed with different combinations of codepoints. In many cases when comparing strings for equality, we’re most interested in the strings being perceived by the user in the same way rather than a simple byte-for-byte match.

The ICU library provides a unorm_compare function which returns a value similar to strcmp, and acts in a normalization independent way. It normalizes both strings incrementally while comparing them, so it can stop early if it finds a difference.

Here is code to check that the five ways of representing ộ are equivalent:

#include <stdio.h> #include <unicode/unorm2.h> int main( void ) { UErrorCode status = U_ZERO_ERROR; UChar s[][ 4 ] = { { 0x006f , 0x0302 , 0x0323 , 0 }, { 0x006f , 0x0323 , 0x0302 , 0 }, { 0x00f4 , 0x0323 , 0 , 0 }, { 0x1ecd , 0x0302 , 0 , 0 }, { 0x1ed9 , 0 , 0 , 0 } }; const size_t n = sizeof (s)/ sizeof (s[ 0 ]); size_t i; for (i = 0 ; i < n; ++i) printf( "%zu == %zu: %d

" , i, (i+ 1 )%n, unorm_compare( s[i], - 1 , s[(i+ 1 )%n], - 1 , 0 , &status)); }

Output:

0 == 1: 0 1 == 2: 0 2 == 3: 0 3 == 4: 0 4 == 0: 0

A return value of 0 means the strings are equal.

Confusable strings

Because Unicode introduces so many graphemes, there are more possibilities for scammers to confuse people using lookalike glyphs. For instance, domains like adoḅe.com or pаypal.com (with Cyrillic а) can direct unwary visitors to phishing sites. ICU contains an entire module for detecting “confusables,” those strings which are known to look too similar when rendered in common fonts. Each string is assigned a “skeleton” such that confusable strings get the same skeleton.

For an example, see my utility utofu. It has a little extra complexity with sqlite access code, so I am not reproducing it here. It’s designed to check Unicode strings to detect changes over time that might be spoofing.

The method of operation is this:

Read line as UTF-8 Convert to Normalization Form C for consistency Calculate skeleton string Insert UTF-8 version of normalized input and its skeleton into a database if the skeleton doesn’t already exist Compare the normalized input string to the string in the database having corresponding skeleton. If not an exact match die with an error.

Further reading

Unicode and internationalization is a huge topic. I could only scratch the surface in this article. I read and enjoyed sections from these books and reference materials, and would recommend them: