The Hassle of Unicode (and Getting On With It in Python)

Let’s face facts.

Unicode is a hassle [1] Not using unicode is a hassle, especially if you have one of thoese “weirdo” languages, and *gasp*, you want to read text *in your own language*.

I was faced with a simple task. Take some text, process it, and print out some results (in JSON). This should be trivial, and in a world where programming

was invented by a multinational consortium, and designed from the first day to be compatible with all text, maybe it would be. Instead we have a world with a rich history of mutual incompatibility.

Text vs. Bytestreams

We english speakers are used to thinking about text as a series of bytes, that maps one-to-one onto a set of glyphs (a-zA-Z0-9 and various control characters… i.e., the stuff on a US keyboard). One byte = one character = one glyph. Simple, right? But limiting…. we only get 256 choices! A major driving idea of unicode is to reframe thinking of thing in terms of code points.

If you want more details, try these sources:

So, how do I deal with it? My solution is to try to get all bytestreams I encounter into unicode, and then use python do deal with it. Then when it comes time to deal with io again, output the unicode back into a bytestreams. Python has tons of support for unicode, but it can be confusing to use (especially for me).

Memory aid:

DECODE: bytes -> Unicode Object

ENCODE: Unicode Object -> bytes

# our bytestream, which should contain an "small latin n with tilde" glyph b='Ensexf1anza Txe9cnica' assert unicode(b,'latin1') == b.decode("latin1") for o in b: print (o,ord(o)) ## the original problem... import sys if sys.version_info < (2,6): import simplejson as json else: import json json.dumps(b) ''' Traceback (most recent call last): File "", line 1, in ? File "/usr/lib64/python2.4/site-packages/simplejson/__init__.py", line 225, in dumps return _default_encoder.encode(obj) File "/usr/lib64/python2.4/site-packages/simplejson/encoder.py", line 188, in encode return encode_basestring_ascii(o) UnicodeDecodeError: 'utf8' codec can't decode bytes in position 4-7: invalid data ''' # understanding the error try: json.dumps(b) except Exception, E1: pass try: unicode(b,'utf8') except Exception, E2: pass # the errors are the same enough assert E1.__dict__ == E2.__dict__ [/sourcecode] So, what's the problem here? We see the character "xf1" has an ordinal value of "241", meaning it can't be understood as ascii. Implicitly, simplejson is trying to encode the string as UTF8 unicode before dumping. Is "xf1" valid in UTF8? To learn more about this letter, we can do some experiments. [sourcecode language="python"] # our tilde n n = 'xf1' assert type(n) = type('') # it's a string assert ord(n) == 241 assert n.decode('latin1').encode('utf8') == 'xc3xb1' assert 'xc3xb1'.decode('utf8') == u'xf1' [/sourcecode] So, it's "xf1" sometimes, and "xc3xb1" sometimes? What the heck is going on? If we look at the "<a href="http://www.fileformat.info/info/unicode/char/00f1/index.htm">latin small letter n with tilde</a>" page, we see that the<strong> "code point"</strong> associated with this<strong> *glyph*</strong> is represented different ways in different encodings, as listed in the "encodings" section. In Latin-1, it's "xf1" and in UTF-8, it's "xc3xb1". Simplejson thought it was seeing a UTF-8 bytestream, when it was really seeing a bytestream indended to be viewed through the lens of Latin-1. <a href="https://writeonly.files.wordpress.com/2008/12/tilde-n.png"><img class="size-medium wp-image-135" title="tilde-n" src="https://writeonly.files.wordpress.com/2008/12/tilde-n.png?w=188" alt="//www.fileformat.info/info/unicode/char/00f1/index.htm" width="188" height="300" /></a> So what do we actually have? Encoding the unicode object as UTF16 will give us a good view of what's actually inside the unicode object(essentially, displaying the code points). def _twobite(uni): ''' take a unicode object and make it into nice two-byte sequences to show better what's going on inside the unicode object ''' if not type(uni) == type(u''): raise ValueError, "must be unicode type" enc = uni.encode('utf16') return [ enc[ii:ii+2] for ii in xrange(len(enc)) if not ii % 2 ] print _twobite(n.decode('latin1')) # ['xffxfe', 'xf1x00'] # if the endian order on your machine is different, your tuples may be reversed # use the unichr function to verify this print _twobite(unichr(241)) # 241 -> "f1"

Putting it all together

Let’s see if we can simply code our way out it, using simplejson, and ignore the issue altogeher:

json.dumps(b,encoding='latin1') # works fine, but assumes we know what # encoding we already have for the bytestream. json.dumps(b,ensure_ascii=False) # ignores the problem, passing through # the string untouched

Both of those work fine, for what they do. Ideally though, I don’t want to have to think about what the strange text coming in is. I just want a guarantee that I have valid unicode (eventually encodable into UTF8), to use as I see fit [2].

def _to_unicode(str, verbose=False): '''attempt to fix non uft-8 string into utf-8, using a limited set of encodings''' # fuller list of encodings at http://docs.python.org/library/codecs.html#standard-encodings if not str: return u'' u = None # we could add more encodings here, as warranted. encodings = ('ascii', 'utf8', 'latin1') for enc in encodings: if u: break try: u = unicode(str,enc) except UnicodeDecodeError: if verbose: print "error for %s into encoding %s" % (str, enc) pass if not u: u = unicode(str, errors='replace') if verbose: print "using replacement character for %s" % str return u assert json.dumps(_to_unicode(b)) == '"Ense\u00f1anza T\u00e9cnica"' type(json.loads(json.dumps(_to_unicode(b)))) == type(u'') print json.loads(json.dumps(_to_unicode(b)))

Our world of Babel [3] makes things complicated enough already. So, go forth in to this land, waving the banner of UTF-8. Maybe some world builder will finally tell us the secrets of the universe, and we’ll be able to read them, instead of seeing a row of empty boxes.

Notes:

At least until Python 3 gets to be the usual state of affairs. For example, I might want to take the text, and replace all latin-1 with “nearest ascii equivalents”. cf: The Unicode Hammer, ignorning the main text,

and focusing on comment 3 and its use of the “unicodedata” module. Genesis 11:1-9, cf: http://en.wikipedia.org/wiki/Tower_of_Babel