Notes for using Unicode with Python 2.x

Python is very Unicode friendly, but there are still a few quirks that people new to the language (or not so new!) need to assimilate in order to use Unicode effectively.

To avoid going over old ground, for a primer please see this excellent article on using Unicode with Python. Here, I want to talk about some of the corner cases remaining after you've absorbed the great advice in that article.

This is not, of course, to say that Unicode support in Python is in any way buggy. Nay, Python's Unicode support is a unique snowflake, perfect in its own special way. It's just us flawed humans who have trouble appreciating fully its snowy beauty, especially if we're not Dutch.

And of course, all strings are Unicode in Python 3.0. That and the new syntax for extended iterable unpacking are the two main reasons I'm looking forward to Python 3.0. But alas, we'll have to enjoy the unique aspects of Unicode in Python for a bit more, now.

Input

I like to keep my programs as bastions of sanity, where all text is handled as Unicode. I thus try to put gatekeepers on all code accepting input, passing it on to the rest of the program logic as Unicode.

Programs that fail to do this often break when dealing with text input that they were sure would be fine as "ascii." One example of this is file paths. Programmers generally expect paths to be in nice, ASCII characters, and that's why their scripts often break when I run them on my Japanese system. For example, on my system the Desktop folder contains Japanese characters:

C:\Documents and Settings\Ryan Ginstrom\デスクトップ\

When a random python script breaks when run from my Desktop folder, I peek inside, and it's invariably because the programmer never expected the path to contain characters that couldn't be expressed as ASCII.

Put it into Unicode as soon as you get it.

As mentioned in the article above, the codecs module makes reading text files as Unicode very simple:

import codecs

unitext = codecs . open ( "/data.txt" , encoding= "utf-8" ) . read ( )

There are just a couple of twists to watch out for when using the codecs module.

It obviously can't guess the encoding; you've got to figure this out yourself. open() converts the UTF-8 byte-order mark (BOM) ('\xef\xbb\xbf') into the UTF-16 BOM character ('\ufeff'), while removing the UTF-16 and UTF-16BE BOMs. This might not be what you expected.

Because of these shortcomings unique aspects of the codecs module, I normally use the chardet module in a custom function to get a random (i.e. user-supplied) text file as Unicode:

def bytes2unicode ( bytes, errors= 'replace' ) :

"" "Convert a byte string into Unicode Have to chop off the BOM by hand.

Usage:

text = bytes2unicode(open(" txt ", " rb ").read())

" "" bytes2unicodebytes, errors=somefile.rb encodings = ((codecs.BOM_UTF8, "utf-8"),

(codecs.BOM_UTF16_LE, "utf-16"),

(codecs.BOM_UTF16_BE, "UTF-16BE")) for bom, enc in encodings:

if bytes.startswith(bom):

return unicode(bytes[len(bom):], enc, errors=errors) # No BOM found, so use chardet

encoding = chardet.detect(bytes).get('encoding', 'ascii')

return unicode(bytes, encoding, errors=errors)

Output

As I mentioned, I like to get my text into Unicode as early as possible, and keep it as Unicode as late as possible. Ideally, I'd like to just output my text as Unicode, and let the output stream take care of the encoding (if any).

That's why when I need to output Unicode as a stream of bytes, I use the codecs module for files, and wrap the output stream otherwise. This is needed, for example, when using cStringIO, which chokes on Unicode.

#coding: UTF8

import cStringIO myval = u"日本語" out = cStringIO.StringIO()

print >> out, myval

Error message:

Traceback (most recent call last):

File "C:\workspace\SpamTest\uni2.py", line 8, in <module>

print >> out, myval

UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-2: ordinal not in range(128)

I can fix this by wrapping out with a class that intercepts the write() method, and converts Unicode strings to the specified encoding just before writing.

class OutStreamEncoder ( object ) :

""

Wraps a stream with an encoder Wraps a stream with an encoder usage:

out = OutStreamEncoder(out, " -8 ")

" "" OutStreamEncoderutf def __init__(self, outstream, encoding):

self.out = outstream

self.encoding = encoding def write(self, obj):

"""

Wraps the output stream, encoding Unicode

strings with the specified encoding

""" if isinstance(obj, unicode):

self.out.write(obj.encode(self.encoding))

else:

self.out.write(obj) def __getattr__(self, attr):

"""Delegate everything but 'write' to the stream""" return getattr(self.out, attr)

Now the example above works:

"日本語" myval = u out = cStringIO.StringIO()

out = OutStreamEncoder(out, "utf-8")

print >> out, myval

IDLE

IDLE has its own peculiarities regarding Unicode. It actually handles Unicode like a champ, but it assumes that everything you type at the command prompt is in the file-system encoding. Since I'm on a Japanese system, this is "mbcs." You can thus get into some odd states:

>>> # A unicode string of multibyte chars as bytes…

>>> u "日本語"

u ' \x 93 \x fa \x 96{ \x 8c \x ea'

>>> # This is what it should be

>>> unicode ( "日本語" , "mbcs" )

u ' \u 65e5 \u 672c \u 8a9e'

The general way to avoid these problems in IDLE is using sys.getfilesystemencoding() .

>>> import sys

>>> print unicode ( "日本語" , sys . getfilesystemencoding ( ) )

日本語

Doctests

doctest is so full of snow-flaky uniqueness, I could put cherry syrup on it and call it a snow cone. Note in the example below that my "is_asian" function's doctests contain a Japanese character (日).

#coding: UTF8 # 0x3000 is ideographic space (i.e. double-byte space)

IDEOGRAPHIC_SPACE = 0x3000 def is_asian(char):

"""

Is the character Asian? >>> is_asian(u'a')

False

>>> is_asian(u'日')

True

""" return ord(char) > IDEOGRAPHIC_SPACE

Running doctest on this gives a rather cryptic error:

Failed example:

is_asian(u'日')

Exception raised:

Traceback (most recent call last):

File "C:\Python25\lib\doctest.py", line 1228, in __run

compileflags, 1) in test.globs

File "<doctest __main__.is_asian[1]>", line 1, in <module>

is_asian(u'日')

File "C:\workspace\SpamTest\uni1.py", line 15, in is_asian

return ord(char) > IDEOGRAPHIC_SPACE

TypeError: ord() expected a character, but string of length 3 found

It turns out that doctests can't handle Unicode characters. It's making the same "string of utf-8 bytes as Unicode characters" error as IDLE, and thus interpreting one character ("日") as three.

So we have to trick doctest by taking the repr value of the Unicode text (I usually stick the actual characters in a comment above it). Here's a repaired version, which runs without errors:

def is_asian ( char ) :

""

Repaired version of doctests Repaired version of doctests >>> is_asian(u'a')

False

>>> # u'日'

>>> is_asian(u'\u65e5′)

True

" "" is_asianchar return ord(char) > IDEOGRAPHIC_SPACE

To see the silver lining in this, at least it encourages you to keep your complicated tests in unit tests, and save doctests for simple, illustrative purposes.

Conclusion

Unicode support in Python is actually quite good — much better than most languages. And it will get even better with Python 3.0. In the meantime, however, there are a few gotchas to look out for when using Unicode in Python.