UnicodeDecodeError: ‘ascii’ codec can’t decode byte 0xc3 in position 0: ordinal not in range(128)

Have you seen this exception? If you know why, you can skip this introduction. There are several great articles and talks about Unicode and encoding in Python, but I’ll try to summarize the basic concepts.

When working with strings in Python, you must know which type of string you are working with. In Python 3 things change a bit, so we’ll focus on Python 2 first.

In short, you may have strings as BYTES or UNICODE objects, and you should take that into consideration.

Unicode is a world standard to represent all possible character representations, in which each one has a unique “code point”. Here’s a table.

For example π (pi) code point is U+03C0. In Python we can represent Unicode objects by prepending u to our string, and directly invoke code points prepending \u to its code number.

>>> print u'\u03c0 is pi!'

π is pi!

Unicode strings can be encoded to byte strings using different encodings, like ASCII or preferably UTF-8.

Likewise, byte strings can be decoded into Unicode objects.

Paste it on your refrigerator!!

If you have a UTF-8 encoded byte string you should use UTF-8 to decode it into a Unicode object, otherwise you will get errors.

Let’s consider this example,

>>> u'Hello ' + 'world'

>>> # Python 2 actually does: u'Hello ' + 'world'.decode('ascii')

u'Hello world' >>> u'Hello ' + 'π'

>>> # Python 2 actually does: u'Hello ' + 'π'.decode('ascii')

Traceback (most recent call last):

File “<stdin>”, line 1, in <module>

UnicodeDecodeError: ‘ascii’ codec can’t decode byte 0xcf in position 0: ordinal not in range(128)

Boom! We’ve tried to concatenate a Unicode object with a byte string. Python 2 does implicit decoding in order to make a single Unicode object, but its default codec is ASCII (you can run sys.getdefaultencoding() to check). So, in the first example, ‘world’ wasn’t a problem but ‘π’ cannot be decoded using ASCII.

Codec is a shortcut for Encoder/Decoder.

>>> content = '\xcf\x80-zza'.decode('utf-8') # π-zza >>> type(content)

<type 'unicode'> >>> print content

π-zza >>> output_string = content.encode('utf-8') >>> type(output_string)

<type 'str'> >>> output_string

'\xcf\x80-zza' # bytes!

In conclusion, files store bytes strings, then we decode those strings using the proper decoder (i.e UTF-8, ASCII, etc) into Unicode objects to work with, and at the very end, i.e. before writing to a file, encode our strings with our desired encoder (i.e UTF-8, ASCII, etc).

—

So that’s all for now! In the following post I might explain the string representation differences in Python 3, but if you’re still interested, I strongly recommend these two other talks on this topic.

Ned Batchelder: Pragmatic Unicode

Entendiendo Unicode — Facundo Batista — PyConAr 2012