Unicode In Python, Completely Demystified Author: Kumar McMillan Location: PyCon 2008, Chicago URL: http://farmdev.com/talks/unicode/ Source: https://github.com/kumar303/unicode-in-python

What does this mean? UnicodeDecodeError: 'ascii' codec can't decode byte 0xc4 in position 10: ordinal not in range(128) Never seen this exception?

Seen it and sort of fixed it?

This is a confusing error If you've never seen this before but want to write Python code, this talk is for you

If you've seen this before and have no idea how to solve it, this talk is for you

This is a really confusing error if you don't know what Python is trying to do for you; this talk aims to clarify

Overview The truth about strings in Python

The magic of Unicode

How to work with Unicode in Python 2 fundamental concept example code

Glimpse at Unicode in Python 3

Ask lots of questions

Corrections?

Why use Unicode in Python? handle non-English languages

use 3rd party modules

accept arbitrary text input

you will love Unicode

you will hate Unicode

Web application [form input] => [Python] => [HTML] accepts input as text

writes text to an html file

Interacting with a database [read from DB] => [Python] => [write to DB] accepts input as text

writes text to the database

Command line script [text files] => [Python] => [stdout] accepts input as text

writes text to stdout or other files

Let's open a UTF-8 file Ivan Krstić >>> f = open ( '/tmp/ivan_utf8.txt' , 'r' ) >>> ivan_utf8 = f . read() >>> ivan_utf8 'Ivan Krsti \xc4\x87 ' Ivan Krstić is the director of security architecture at OLPC

pretend you opened this in a desktop text editor (nothing fancy like vi) and you saved it in UTF-8 format. This might not have been the default.

now you are opening the file in Python

What is it? >>> ivan_utf8 'Ivan Krsti \xc4\x87 ' >>> type (ivan_utf8) < type 'str' > a string of bytes!

1 byte = 8 bits

A bit is either "0" or "1"

Text is encoded Ivan Krstić 'Ivan Krsti \xc4\x87 ' This string is encoded in UTF-8 format

An encoding is a set of rules that assign numeric values to each text character

Notice the c with a hachek takes up 2 bytes

Other encodings might represent ć differently

Python stdlib supports over 100 encodings c with a hachek is part of the Croatian language

each encoding has its own byte representation of text

ASCII char I v a n hex \x49 \x76 \x61 \x6e decimal 73 118 97 110 UTF-8 is an extension of ASCII

created in 1963 as the American Standard Code for Information Interchange

Standard Code for Information Interchange each character is 1 byte

128 possible characters

ASCII char K r s t i ć hex \x4b \x76 \x72 \x74 \x69 nope decimal 75 118 114 116 105 sorry ć cannot be encoded as ASCII

d'oh!

built-in string types (Python 2) <type 'basestring'> | +--<type 'str'> | +--<type 'unicode'>

Important methods s.decode(encoding) <type 'str'> to <type 'unicode'> u.encode(encoding) <type 'unicode'> to <type 'str'>

The problem Can't my Python text remain encoded? Ivan Krstić >>> ivan_utf8 'Ivan Krsti \xc4\x87 ' >>> len (ivan_utf8) 12 >>> ivan_utf8[ -1 ] ' \x87 ' isn't encoded text good enough? No decoding errors anywhere

is the length of Ivan Krstić really 12? what happens if the text were encoded differently?

is the last character really hexadecimal 87? Is that what I wanted?

Unicode is more accurate Ivan Krstić >>> ivan_utf8 'Ivan Krsti \xc4\x87 ' >>> ivan_uni = ivan_utf8 . decode( 'utf-8' ) >>> ivan_uni u'Ivan Krsti \u0107 ' >>> type (ivan_uni) < type 'unicode' >

Unicode is more accurate Ivan Krstić >>> ivan_uni u'Ivan Krsti \u0107 ' >>> len (ivan_uni) 11 >>> ivan_uni[ -1 ] u' \u0107 '

Unicode, what is it? u'Ivan Krsti \u0107 ' a way to represent text without bytes

unique number (code point) for each character of every language

supports all major languages written today

defines over 1 million code points supports... European alphabets Middle Eastern right-to-left scripts of Asia technical math symbols et cetera



Unicode, the ideal If ASCII, UTF-8, and other byte strings are "text" ... ...then Unicode is "text-ness"; it is the abstract form of text http://en.wikipedia.org/wiki/Platonic_idealism

Unicode is a concept letter Unicode Code Point ć \u0107 to save Unicode to disk you have to encode it Byte Encodings letter UTF-8 UTF-16 Shift-JIS ć \xc4\x87 \x07\x01 \x85\xc9

Unicode Transformation Format >>> ab = unicode ( 'AB' ) UTF-8 >>> ab . encode( 'utf-8' ) 'AB' variable byte representation

first 128 characters encoded just like ASCII

1 byte (8 bits) to 4 bytes per code point

Unicode Transformation Format >>> ab = unicode ( 'AB' ) UTF-16 >>> ab . encode( 'utf-16' ) ' \xff\xfe A \x00 B \x00 ' variable byte representation

2 bytes (16 bits) to 4 bytes per code point

optimized for languages residing in the 2 byte character range

Unicode Transformation Format UTF-32 fixed width byte representation, fastest

4 bytes (32 bits) per code point

not supported in Python

Unicode chart Ian Albert's Unicode chart this guy decided to print the entire Unicode chart

1,114,112 code points

6 feet by 12 feet

22,017 × 42,807 pixels

Unicode chart Ian Albert's Unicode chart. Says it only cost him $20 at Kinko's but he was pretty sure they rang him up wrong.

Unicode chart 50 %

Unicode chart 100 %

Decoding text into Unicode It's mostly automatic

this happens a lot in 3rd party modules

Python will try to decode it for you

Python magic >>> ivan_uni u'Ivan Krsti \u0107 ' >>> f = open ( '/tmp/ivan.txt' , 'w' ) >>> f . write(ivan_uni) Traceback (most recent call last): ... UnicodeEncodeError : 'ascii' codec can 't encode character u' \u0107 ' in position 10: ordinal not in range(128)

Python magic, revealed >>> ivan_uni u'Ivan Krsti \u0107 ' >>> f = open ( '/tmp/ivan.txt' , 'w' ) >>> import sys >>> f . write(ivan_uni . encode( ... sys . getdefaultencoding())) ... Traceback (most recent call last): ... UnicodeEncodeError : 'ascii' codec can 't encode character u' \u0107 ' in position 10: ordinal not in range(128)

Gasp! THE DEFAULT ENCODING FOR PYTHON 2 IS ASCII

Just reset it?! sys . setdefaultencoding( 'utf-8' ) can't I just put this in sitecustomize.py ?

? No!

your code will not work on other Python installations

more trouble than it's worth

Solution Decode early Unicode everywhere Encode late

1. Decode early Decode to <type 'unicode'> ASAP >>> def to_unicode_or_bust ( ... obj, encoding = 'utf-8' ): ... if isinstance (obj, basestring ): ... if not isinstance (obj, unicode ): ... obj = unicode (obj, encoding) ... return obj ... >>> detects if object is a string and if so converts to unicode, if not already.

2. Unicode everywhere >>> to_unicode_or_bust(ivan_uni) u'Ivan Krsti \u0107 ' >>> to_unicode_or_bust(ivan_utf8) u'Ivan Krsti \u0107 ' >>> to_unicode_or_bust( 1234 ) 1234

3. Encode late Encode to <type 'str'> when you write to disk or print >>> f = open ( '/tmp/ivan_out.txt' , 'w' ) >>> f . write(ivan_uni . encode( 'utf-8' )) >>> f . close()

Shortcuts codecs.open() >>> import codecs >>> f = codecs . open( '/tmp/ivan_utf8.txt' , 'r' , ... encoding = 'utf-8' ) ... >>> f . read() u'Ivan Krsti \u0107 ' >>> f . close()

Shortcuts codecs.open() >>> import codecs >>> f = codecs . open( '/tmp/ivan_utf8.txt' , 'w' , ... encoding = 'utf-8' ) ... >>> f . write(ivan_uni) >>> f . close()

Python 2 Unicode incompatibility some 3rd party modules incompatible submit bugs!

some builtin modules are incompatible csv



Python 2 Unicode workarounds momentarily encode as UTF-8, then decode immediately

csv documentation shows you how to do this >>> ivan_bytes = ivan_uni . encode( 'utf-8' ) >>> # do stuff >>> ivan_bytes . decode( 'utf-8' ) u'Ivan Krsti \u0107 '

The BOM sometimes at beginning of files

Byte Order Mark

essential for UTF-16, UTF-32 files Big Endian (MSB first) Little Endian (LSB first)

UTF-8 BOM just says "I am UTF-8" popular in Windows



Detecting the BOM >>> f = open ( '/tmp/ivan_utf16.txt' , 'r' ) >>> sample = f . read( 4 ) >>> sample ' \xff\xfe I \x00 ' BOM can be 2, 3, or 4 bytes long

Detecting the BOM >>> import codecs >>> (sample . startswith(codecs . BOM_UTF16_LE) or ... sample . startswith(codecs . BOM_UTF16_BE)) ... True >>> sample . startswith(codecs . BOM_UTF8) False

Do I have to remove the BOM? maybe

decoding UTF-16 removes the BOM automatically

but not UTF-8 unless you say s.decode('utf-8-sig') available since Python 2.5



How do you guess an encoding? There is no reliable way to guess an encoding

BOM gives you a clue

Content-type header usually contains charset=...

header usually contains chardet module tries http://chardet.feedparser.org/ port of Mozilla encoding detection

UTF-8 is your best guess

Summary of problems default Python 2 encoding is 'ascii'

files might contain a BOM

not all Python 2 internals support Unicode

You can't reliably guess an encoding

Summary of solutions Decode early, Unicode everywhere, encode late

write wrappers for modules that don't like Unicode

Always put Unicode in unit tests

UTF-8 is the best guess for an encoding use the BOM to guess encodings or use chardet.detect()



Unicode in Python 3 they fixed Unicode!

<type 'str'> is a Unicode object

is a Unicode object separate <type 'bytes'> type

type all builtin modules support Unicode

no more u'text' syntax

Unicode in Python 3 open() takes an encoding argument, like codecs.open()

takes an encoding argument, like default encoding is UTF-8 not ASCII woo!

Tries to guess the file encoding You will still need to declare encodings

