Processing Text Files in Python 3¶

A recent discussion on the python-ideas mailing list made it clear that we (i.e. the core Python developers) need to provide some clearer guidance on how to handle text processing tasks that trigger exceptions by default in Python 3, but were previously swept under the rug by Python 2’s blithe assumption that all files are encoded in “latin-1”.

While we’ll have something in the official docs before too long, this is my own preliminary attempt at summarising the options for processing text files, and the various trade-offs between them.

What changed in Python 3?¶ The obvious question to ask is what changed in Python 3 so that the common approaches that developers used to use for text processing in Python 2 have now started to throw UnicodeDecodeError and UnicodeEncodeError in Python 3. The key difference is that the default text processing behaviour in Python 3 aims to detect text encoding problems as early as possible - either when reading improperly encoded text (indicated by UnicodeDecodeError ) or when being asked to write out a text sequence that cannot be correctly represented in the target encoding (indicated by UnicodeEncodeError ). This contrasts with the Python 2 approach which allowed data corruption by default and strict correctness checks had to be requested explicitly. That could certainly be convenient when the data being processed was predominantly ASCII text, and the occasional bit of data corruption was unlikely to be even detected, let alone cause problems, but it’s hardly a solid foundation for building robust multilingual applications (as anyone that has ever had to track down an errant UnicodeError in Python 2 will know). However, Python 3 does provide a number of mechanisms for relaxing the default strict checks in order to handle various text processing use cases (in particular, use cases where “best effort” processing is acceptable, and strict correctness is not required). This article aims to explain some of them by looking at cases where it would be appropriate to use them. Note that many of the features I discuss below are available in Python 2 as well, but you have to explicitly access them via the unicode type and the codecs module. In Python 3, they’re part of the behaviour of the str type and the open builtin.

Unicode Basics¶ To process text effectively in Python 3, it’s necessary to learn at least a tiny amount about Unicode and text encodings: Python 3 always stores text strings as sequences of Unicode code points. These are values in the range 0-0x10FFFF. They don’t always correspond directly to the characters you read on your screen, but that distinction doesn’t matter for most text manipulation tasks. To store text as binary data, you must specify an encoding for that text. The process of converting from a sequence of bytes (i.e. binary data) to a sequence of code points (i.e. text data) is decoding, while the reverse process is encoding. For historical reasons, the most widely used encoding is ascii , which can only handle Unicode code points in the range 0-0x7F (i.e. ASCII is a 7-bit encoding). There are a wide variety of ASCII compatible encodings, which ensure that any appearance of a valid ASCII value in the binary data refers to the corresponding ASCII character. “utf-8” is becoming the preferred encoding for many applications, as it is an ASCII-compatible encoding that can encode any valid Unicode code point. “latin-1” is another significant ASCII-compatible encoding, as it maps byte values directly to the first 256 Unicode code points. (Note that Windows has it’s own “latin-1” variant called cp1252, but, unlike the ISO “latin-1” implemented by the Python codec with that name, the Windows specific variant doesn’t map all 256 possible byte values) There are also many ASCII incompatible encodings in widespread use, particularly in Asian countries (which had to devise their own solutions before the rise of Unicode) and on platforms such as Windows, Java and the .NET CLR, where many APIs accept text as UTF-16 encoded data. The locale.getpreferredencoding() call reports the encoding that Python will use by default for most operations that require an encoding (e.g. reading in a text file without a specified encoding). This is designed to aid interoperability between Python and the host operating system, but can cause problems with interoperability between systems (if encoding issues are not managed consistently). The sys.getfilesystemencoding() call reports the encoding that Python will use by default for most operations that both require an encoding and involve textual metadata in the filesystem (e.g. determining the results of os.listdir() ) If you’re a native English speaker residing in an English speaking country (like me!) it’s tempting to think “but Python 2 works fine, why are you bothering me with all this Unicode malarkey?”. It’s worth trying to remember that we’re actually a minority on this planet and, for most people on Earth, ASCII and latin-1 can’t even handle their name, let alone any other text they might want to write or process in their native language.

Unicode Error Handlers¶ To help standardise various techniques for dealing with Unicode encoding and decoding errors, Python includes a concept of Unicode error handlers that are automatically invoked whenever a problem is encountered in the process of encoding or decoding text. I’m not going to cover all of them in this article, but three are of particular significance: strict : this is the default error handler that just raises UnicodeDecodeError for decoding problems and UnicodeEncodeError for encoding problems.

: this is the default error handler that just raises for decoding problems and for encoding problems. surrogateescape : this is the error handler that Python uses for most OS facing APIs to gracefully cope with encoding problems in the data supplied by the OS. It handles decoding errors by squirreling the data away in a little used part of the Unicode code point space (For those interested in more detail, see PEP 383). When encoding, it translates those hidden away values back into the exact original byte sequence that failed to decode correctly. Just as this is useful for OS APIs, it can make it easier to gracefully handle encoding problems in other contexts.

: this is the error handler that Python uses for most OS facing APIs to gracefully cope with encoding problems in the data supplied by the OS. It handles decoding errors by squirreling the data away in a little used part of the Unicode code point space (For those interested in more detail, see PEP 383). When encoding, it translates those hidden away values back into the exact original byte sequence that failed to decode correctly. Just as this is useful for OS APIs, it can make it easier to gracefully handle encoding problems in other contexts. backslashreplace : this is an encoding error handler that converts code points that can’t be represented in the target encoding to the equivalent Python string numeric escape sequence. It makes it easy to ensure that UnicodeEncodeError will never be thrown, but doesn’t lose much information while doing so losing (since we don’t want encoding problems hiding error output, this error handler is enabled on sys.stderr by default).

The Binary Option¶ One alternative that is always available is to open files in binary mode and process them as bytes rather than as text. This can work in many cases, especially those where the ASCII markers are embedded in genuinely arbitrary binary data. However, for both “text data with unknown encoding” and “text data with known encoding, but potentially containing encoding errors”, it is often preferable to get them into a form that can be handled as text strings. In particular, some APIs that accept both bytes and text may be very strict about the encoding of the bytes they accept (for example, the urllib.urlparse module accepts only pure ASCII data for processing as bytes, but will happily process text strings containing non-ASCII code points).