Handling Text Correctly

Software53 Jun 01, 2013 – Filed under:

A remarkable number of programs do not handle text in a reasonable fashion, which causes those programs to break when confronted with non-English characters and symbols. Even the standard libraries of most programming languages are not immune.1

This chapter describes the foundation of how text handling works in software and gives examples of common pitfalls when working with text in real code. Upon completion you should be able to evaluate your favorite programming environment’s built-in support for handling text and be able to write programs that handle text correctly and consistently across operating systems.

Characters, Sets, and Encodings

Text is made up of multiple characters (or codepoints2), each of which represents a different letter, symbol or punctuation mark. The word “Hello”, for example is made of the characters “H”, “e”, “l”, “l”, and “o”. This collection of characters is called a string. Each character is assigned a number using a character set (sometimes called a code page or a charset).

For example the ASCII character set assigns numbers in the range (0-127) for characters in most Western European languages.

H → 72

e → 101

l → 76

l → 76

o → 111

ASCII is one of the oldest character sets. Most other sets use the same mappings as ASCII, while defining additional mappings of their own for numbers above 127.

After mapping a character to a number, that number is converted to an actual byte sequence for storage. The entire process of converting a character to a byte sequence is defined by a character encoding.

Since many character sets provide mappings to numbers in the range 0-255 (which fit in 8-bit bytes), you can output the character number as a single byte. For example, to output “Hello” in ASCII, you would emit the byte sequence:

H e l l o 72 101 76 76 111

Many English-speakers, being only familiar with the Western European character sets which are always encoded with individual 8-bits bytes, often use the terms character set and character encoding interchangably.

However the East Asian languages such as Chinese, Japanese, and Korean (CJK) have many more than 256 characters, making it impossible to fit them in only 8-bit bytes. Therefore each character must be encoded using multiple bytes, possibly a variable number of bytes. Thus either a fixed-width encoding (often 16-bits per character) or a variable width encoding (with differing numbers of bytes per character) may be used.

Important Character Sets

The following are the most prevalent non-Unicode character sets you are likely to encounter. All of these character sets are encoded with a single byte per character.

Character Set Character Encoding ASCII single byte (7-bit) Windows-1252 (Windows Latin 1) single byte ISO 8859-1 (ISO Latin 1) single byte Mac OS Roman single byte

Informally (and in practice), these may also be described as character encodings. So a text file “in ASCII encoding” refers to a file in the ASCII character set with its standard single-byte encoding.

ASCII 7-bit character set for representing characters common to most Western European languages. ⚠ Many programs and APIs that claim to input or output “ASCII” (or “ANSI” 3 ) are actually unaware of character sets and will accept strings in whatever the operating system’s default character set happens to be (often Windows-1252).

Windows-1252 (AKA Windows Latin 1 ) This is the default character set on English Windows systems.

(AKA ) ISO 8859-1 (AKA ISO Latin 1 ) The default character set for web pages and other HTTP traffic. ⚠ In practice this is very frequently confused with Windows-1252 , which differs in only a handful of characters. In fact this confusion is to such a degree that the draft HTML 5 specification requires that documents advertised as ISO-8859-1 actually be parsed as Windows-1252. 4

(AKA ) Mac OS Roman This is the default character set on most Macintosh systems prior to Mac OS X.



Not every character set can represent every character in the world. For example Windows-1252 cannot represent any Chinese or Japanese characters, although GBK and Shift JIS respectively can. This means that you cannot in general mix text from two sources that uses a different character sets without converting to a new character set (that can represent all characters in both sets). Typically the Unicode character set (discussed below) is used for this purpose.

Character Set Character Encoding Windows-932 (“Shift JIS”) variable width, 1-2 bytes Windows-936 (“GBK”) variable width, 1-2 bytes

Unicode

Unicode is a character set just like ASCII or Windows-1252: it maps characters to numbers.5 However it is sufficiently important to justify special mention.

The Unicode character set is designed to support all characters in all character sets that came before it, plus many more. If you can think of a character, it’s almost certainly in Unicode. (And if it isn’t, it’s likely not in any character set.) In this sense, Unicode can be viewed as the universal character set.

Any program written today that wishes to represent characters correctly should be using the Unicode character set and one of its associated encodings.

Originally, Unicode mapped characters to numbers in the range 0x0000-0xFFFF, requiring only 16 bits to represent each character. At that time it was possible to encode each Unicode character using a 2-byte fixed width encoding, known as UCS-2. Many early Unicode-aware systems, such as Java, were designed around this original specification.

In 1996, Unicode was extended to map characters to numbers in the larger range of 0x00000-0x10FFFF, requiring up to 20 bits per character. Thus UCS-2, being limited to 16-bits, was no longer capable of representing all Unicode characters. In its place arose the UTF-16 encoding which, like UCS-2, uses a single 16-bit value to represent characters in range 0x0000-0xFFFF (the basic multilingual plane) and two 16-bit values to represent characters in range 0x10000-0x10FFFF (the supplemental planes). And so many UCS-2 systems were retroactively upgraded to use UTF-16 in place of UCS-2.

Today, UTF-16 is the most common in-memory representation for Unicode strings. However, many programs incorrectly treat individual 16-bit values from UTF-16 directly as characters, due to ignorance of UTF-16’s variable-width nature. In many cases this causes no problems, since most programs operate on strings and substrings opaquely, as opposed to working with individual characters. However problems will arise if unaware programs attempt to manipulate characters directly, such as by counting the number of characters in a string or by filtering individual characters.

Unicode Character Encodings

Unlike the other character sets discussed previously, the Unicode character set has multiple different encodings.

Character Set Character Encoding Character Encoding Scheme Unicode UTF-8 variable width, 1-4 bytes UCS-2 fixed width, 2 bytes UTF-16 variable width, 2 or 4 bytes UTF-32 / UCS-4 fixed width, 4 bytes

UTF-8 This is the default on-disk encoding for most modern Unicode-aware programs. Many text editors save in this format by default. Python 2.5+ source files are assumed to be UTF-8 by default. It can represent any character in the Unicode character set, and thus any character in the world. It is compact, representing all ASCII characters as single bytes and most other characters as two bytes. ⚠ Programs may optionally prepend a byte-order-mark (BOM) at the beginning of a file to mark it as UTF-8. Most Windows programs do this, for example. Programs that input UTF-8 files should be prepared to handle BOMs.

UCS-2 This is the native encoding used by early Unicode-aware systems, such as Java 1.4 and below. It can only represent Unicode characters in the basic plane (0x0000-0xFFFF), but no higher.

In particular supplementary characters in the supplementary planes (0x10000-0x10FFFF) cannot be represented. These are sometimes called astral characters . Many APIs that originally only supported UCS-2 were retroactively upgraded to use UTF-16.

UTF-16 This is the default in-memory encoding for most modern Unicode-aware systems, including Windows, Mac OS X, the Java 1.5+ runtime 6 , the .NET runtime (including C#), and Python 2.x 7 . It can represent any character in the Unicode character set. Characters in the basic plane (0x0000-0xFFFF) are encoded as a single 16-bit value.

Supplementary characters (0x10000-0x10FFFF) are encoded as two 16-bit surrogate values. ⚠ It is common for code to incorrectly manipulate UTF-16 data as if it were fixed-width UCS-2 instead. ⚠ Programs may optionally prepend a byte-order-mark (BOM) at the beginning of a file to mark it as UTF-16 or to specify a byte-ordering other than big-endian. Programs that input UTF-16 files should be prepared to handle BOMs. ⚠ Some outdated documentation and APIs may refer to the UTF-16 encoding as the “Unicode encoding”. Notably C#’s UnicodeEncoding, Mac OS X’s NSUnicodeStringEncoding or Python 2.2-3.2’s unicode type on “narrow” builds, which are the default.

UTF-32 / UCS-4 This is a fixed width encoding capable of representing any Unicode character directly as a 32-bit value. Because it is space inefficient, this encoding is rarely seen in practice. Python 2.2-3.2 offers “wide” builds that use UTF-32. ⚠ Programs may optionally prepend a byte-order-mark (BOM) at the beginning of a file to mark it as UTF-32 or to specify a byte-ordering other than big-endian. Programs that input UTF-32 files should be prepared to handle BOMs.



Common Mistakes & Practical Tips

You cannot interpret a byte array as a string without knowing its encoding.

Reading Text Files

You cannot read a text file correctly without knowing its encoding.

If you do not specify an encoding explicitly when opening a text file, your language’s standard library or operating system will usually pick a default encoding, which depends on the spoken language it is running in, among other factors.

Unfortunately most filesystems do not store the encoding of a text file.8 So there are a few options for determining an encoding:

Define a particular encoding as the expected input. For example, a receipt-processing program may explicitly document UTF-8 as the encoding for its input files.

Look at the contents of the input file to auto-detect the encoding: You could detect UTF byte-order-marks at the beginning of a file to automatically assume one of the UTF encodings. You could define a special syntax at the beginning of the input files to indicate the encoding. XML uses a prelude at the top of a file to indicate what encoding it uses. For example <?xml version="1.0" encoding="windows-1252"?> specifies that the document is in Window-1252 encoding. (Of course to even read this initial text, you have to make the working assumption that the top of the file is some encoding that is a superset of ASCII.) Python source files use a line such as # -*- coding: utf-8 -*- to indicate the encoding is other than the default. (Python 2.0-2.4 uses Windows Latin 1 as the default encoding; Python 2.5-2.7 uses ASCII, Python 3.x uses UTF-8.) Failing these options, you could fall back to a defined encoding (such as ASCII for Python source files) or to the operating system’s default encoding (which can vary).

Always use the operating system’s default encoding. This is often the behavior if you use your favorite programming language’s default mechanism for reading a file, such as Java’s FileReader class. (Interestingly, C#’s StreamReader and StreamWriter classes always use UTF-8 instead of the operating system default.)



Converting Between Bytes and Strings

You cannot correctly convert a byte array to a string without specifying the encoding to use.

Unfortunately many languages allow you to omit the encoding, and then will try to guess the encoding (usually incorrectly) if you fail to specify it.

Consider the following Java program:

byte[] footBytes = {'f', 'u', (byte)0xC3, (byte)0x9F}; String footString = new String(footBytes); // WRONG: OS-dependent

This program will decode different strings on different operating systems! On Mac OS X and Linux where the platform’s default encoding is UTF-8, the correct result (“fuß”) will be obtained since the original bytes were encoded in UTF-8. However on Windows the bogus result “fuÃŸ” will be decoded because the default encoding is Windows-1252.

Here’s the fixed program, which specifies the UTF-8 encoding explicitly:

byte[] footBytes = {'f', 'u', (byte)0xC3, (byte)0x9F}; String footString = new String(footBytes, "UTF-8"); // CORRECT

As another example, consider the Java InputStreamReader and FileReader classes, both of which convert from byte streams to character streams.

byte[] footBytes = {'f', 'u', (byte)0xC3, (byte)0x9F}; InputStream footStream = new ByteArrayInputStream(footBytes); Reader footReader = new InputStreamReader(footStream); // WRONG: OS-dependent

Or the even more innocent-looking:

Reader footReader = new FileReader("foot.txt"); // WRONG: OS-dependent

Both of these examples are wrong for the same reason: they don’t specify the encoding.

The former example can be fixed by adding "UTF-8" as the second constructor argument:

byte[] footBytes = {'f', 'u', (byte)0xC3, (byte)0x9F}; InputStream footStream = new ByteArrayInputStream(footBytes); Reader footReader = new InputStreamReader(footStream, "UTF-8"); // CORRECT

Fixing the second example unfortunately requires using an entirely different class since FileReader has no constructer with an encoding parameter.

Reader footReader = new InputStreamReader( new FileInputStream("foot.txt"), "UTF-8"); // CORRECT

Of course the same problems happen when encoding a string to a byte stream:

String footString = "fu\u00DF"; byte[] footBytes = footString.getBytes(); // WRONG: OS-dependent

And when encoding a character stream to a byte stream:

ByteArrayOutputStream footStream = new ByteArrayOutputStream(); Writer footWriter = new OutputStreamWriter(footStream); // WRONG: OS-dependent footWriter.write("fu\u00DF"); byte[] footBytes = footStream.toByteArray();

And when writing to text files:

Writer footWriter = new FileWriter("foot.txt"); // WRONG: OS-dependent footWriter.write("fu\u00DF");

A “char” (in your favorite language) is probably not a character.

Many programming languages have a “char” datatype that is intended for representing a character. Usually this “char” datatype could do this effectively at the time the language was written but not in the present day, as the notion of a character has been extended over time.

8-bit chars (C/C++)

In C/C++, a “char” holds one byte. When C was first invented, 8-bit fixed-width character encodings were the norm. Therefore a single “char” was able to represent a single character precisely. However with the advent of CJK languages and multi-byte encodings, this no longer worked. Therefore a C string by itself today can only be safely interpreted as a raw byte stream. As mentioned above, you can only process it properly if you know what encoding it is in.

Without any further information, C string is often assumed to be in the operating system’s default encoding, although you cannot be sure. The correct encoding to use depends on where the string was input from.

A program can work with strings in a few ways:

Choose a particular in-memory encoding that all functions should use. All foreign strings will be converted to this encoding at the time of input (regardless of source). And upon output, strings will be converted to the appropriate proper output encoding. UTF-8 and UTF-16 are both good candidates for such an in-memory encoding since they can both represent the full repertoire of Unicode characters. Therefore you won’t lose any data by converting to/from them. UTF-8 is compact and a superset of ASCII, so you can pass UTF-8 strings to brain-dead functions that are encoding unaware and get correct behavior as long as only ASCII characters are being used. UTF-16 is convenient because functions that are unaware of supplementary characters will still get correct behavior as long as basic-plane Unicode characters are used, which are the most common.

Pass around the text encoding around with the underlying byte array, possibly with a custom string datatype. No encoding conversion overhead with reading the input stream. Cannot mix text with different encodings. Ruby takes this approach with its built-in String type.

Blithely ignore encoding issues altogether and get unexpected results when working with international characters. Many programs written in languages where the default string type is not Unicode take this option out of ignorance. In particular this includes many programs in C, PHP, and Python 2.x .



Gotchas:

Beware of methods that take exactly one char or return exactly one char . They almost certainly aren’t aware of non-ASCII or Unicode characters.

or return exactly one . Beware of any libraries that don’t document what text encoding it assumes. They almost certainly aren’t aware of non-ASCII or Unicode characters.



16-bit chars (Java, C#, Objective-C, Python 2.2-3.2, C/C++’s wchar )

Java and C#’s char are 16-bits wide. So are C/C++’s wchar and Mac OS X’s unichar . And so are the elements of a Python 2.x string when it is compiled in the default “narrow” mode.

16-bits is sufficient to hold a Unicode character in the basic plane (0x0000-0xFFFF) but not an supplementary character in a supplemental plane (0x10000-0x10FFFF). In the case of these languages, a char represents a single UTF-16 code unit (i.e. either a character in the basic plane or a surrogate) as opposed to an actual character.

Therefore text-aware programs in these languages need to be particularly careful to deal with supplementary characters correctly, since those characters cannot fit into a single char variable.

Here is a typical Java program that is unaware of supplementary characters:

String str = "Hello"; for (int i=0, n=str.length(); i<n; i++) { // WRONG: Does not handle characters outside the basic plane (0x0000-0xFFFF) char c = str.charAt(i); // ... Do something with the character, like filtering out invalid characters. }

And here is a much-longer but correct version that correctly identifies surrogates and decodes them to supplementary characters correctly:

String str = "Hello"; for (int i=0, n=str.length(); i<n; i++) { char c1 = str.charAt(i); // CORRECT. Handles all Unicode characters. int codepoint; if (Character.isHighSurrogate(c1)) { if (i+1 < n) { char c2 = str.charAt(i+1); if (Character.isLowSurrogate(c2)) { // Surrogate pair codepoint = Character.toCodePoint(c1, c2); i++; } else { // High-surrogate alone codepoint = (int) c1; } } else { // High-surrogate alone at end of string codepoint = (int) c1; } } else { // Not a surrogate pair codepoint = (int) c1; } // ... Do something with the character, like filtering out invalid characters. }

Gotchas:

Beware of methods that take exactly one char or return exactly one char . They almost certainly aren’t aware of supplementary characters.

or return exactly one . You can’t iterate over the characters in a “string” by iterating over the char s. Instead you have to use a loop like the above to iterate over the true characters.

This example stores the true character in the codepoint variable.

s. You can’t get the i th character of a “string” by getting the i th char . No general workaround.

character of a “string” by getting the i .

32-bit chars and variable-bit chars (Python 3.3+, Haskell)

If you’re fortunate enough to work in an environment with 32-bit or variable-bit chars then your char is in fact a character. Horray!

The only popular environment I know of with real characters is Python 3.3+, or Python 2.2-3.2 when configured to be in “wide” mode (which is not the default).

End of Line Sequences

Improper handling of end-of-line (EOL) sequences is not uncommon.

There are three common ways to end a line:

Linux:

(line feed alone)

(line feed alone) Windows: \r

(carriage return + line feed)

(carriage return + line feed) Mac OS 9: \r (carriage return alone)

It is possible for multiple styles to occur in the same file or string.

It should also be noted that the last line in a file or string might or might not be followed by an EOL sequence. Therefore you can’t assume that every line ends with an EOL sequence.

As the following examples demonstrate, you need to read your language’s documentation carefully if you want to process lines in a consistent fashion.

Java Line Reader Example

Consider the following Java program:

// Prints the specified file to standard output. public static void main(String[] args) { String filePath = args[0]; BufferedReader lineReader = new BufferedReader(new FileReader(filePath, "UTF-8")); try { String nextLine; while ((nextLine = lineReader.readLine()) != null) { System.out.println(nextLine); } } finally { lineReader.close(); } }

The BufferedReader class can deal with all end-of-line sequences. Therefore this program is resilient against mixed input.

However println (in the PrintWriter class) emits the OS-specific end-of-line sequence, which means that this program will have different output on different operating systems. Not necessarily what you’d expect.

Python 2.x Line Reader Example

Consider the following Python 2.x program:

import codecs file_path = sys.argv[1] with codecs.open(file_path, 'rb', 'utf-8') as stream: for line_with_terminator in stream: line = line_with_terminator.rstrip(u'\r

') # remove any trailing '\r' and '

's print line

Notice that in the Python version it is necessary to explicitly remove the \r and

characters, since Python’s line iteration behavior is to return the entire line plus the end-of-line sequence (if available).

Python’s print statement always uses

as the end-of-line sequence, regardless of what OS it is running on. It’s nice that this is a consistent behavior, but it might not be what you expect if you are developing on Windows.

Here is perhaps a more typical program that fails to handle the last line correctly if it doesn’t end with an EOL sequence:

import codecs file_path = sys.argv[1] with open(file_path, 'rU') as stream: # the U mode converts all line endings to '

' for line_with_terminator in stream: # WRONG: If last line lack an EOL this will chop off its trailing character improperly line = line_with_terminator[:-1] # remove trailing '

' # WRONG: Treating a bytestring as if it were a Unicode string print line

Errors like this explain why lots of Unix programs warn about or get confused by files that don’t end with a final EOL.

And here is another typical variation that does not handle end-of-line sequences properly:

import codecs file_path = sys.argv[1] with open(file_path, 'rb') as stream: for line_with_terminator in stream: # WRONG: Assumes EOL is one byte long, which is incorrect on Windows line = line_with_terminator[:-1] # remove trailing '

' # WRONG: Treating a bytestring as if it were a Unicode string print line

This kind of program will get extra \r characters on the end of each line when run on Windows. It also fails to handle final lines that lack an EOL.

Summary

Working with text is tricky. Your programming language probably has default handling that isn’t quite what you want (or expect) so always read the documentation carefully. And if your program is intended to be usable in multiple languages, you actually should write tests that check for proper handling of Unicode characters.

Series This article is part of the Programming for Perfectionists series.

Updates: