From: (Anonymous)

2008-09-10 02:34 pm (UTC)

Don't change the default encoding It will come back to bite you...



What you to do is to declare the source encoding at the top of the file (which will then cause Python to decode string literals in your source correctly). From: lonely_squirrel

2008-09-10 02:59 pm (UTC)

Re: Don't change the default encoding You mean the

# -*- coding: utf-8 -*-



at the top of the file? This doesn't work for me. It still fails the ("%s" % u'ü') expression test, with the same "'ascii' codec can't decode byte 0xc3 in position 43: ordinal not in range(128)" error.



I think what's happening is this: the left string "%s" is of basestring class, while the u'ü' is unicode string class. The coding line at the top of the file tells python that the left string is encoded in utf-8, but it still is of basestring class. Then, when the % operator gets evaluated, it goes into the python source, where there is no coding statement. There it tries to stick a unicode string into the basestring, and tries to re-encode the unicode string in ascii.





From: (Anonymous)

2008-09-10 08:04 pm (UTC)

Re: Don't change the default encoding This doesn't work for me. It still fails the ("%s" % u'ü') expression test

No it doesn't. That works just fine when you state the source encoding. What doesn't work is printing the result, and it doesn't work because it thinks the terminal is ASCII.

I think what's happening is this

Sorry, your guesswork is way off. How about reading the documents you are ridiculing before resorting to speculation?

the left string "%s" is of basestring class

No. It's a str, not a basestring.

Then, when the % operator gets evaluated, it goes into the python source, where there is no coding statement.

This is simply fantasy. Do you have any basis for suspecting this, or are you just making things up?

The representation of strings in your source and the way those strings are treated during execution are two entirely different things. Encoding declarations in the Python source are simply not necessary or relevant unless that source contains non-ASCII literals itself. Such encodings won't make a blind bit of difference to your own strings in your own source.

No it doesn't. That works just fine when you state the source encoding. What doesn't work is printing the result, and it doesn't work because it thinks the terminal is ASCII.Sorry, your guesswork is way off. How about reading the documents you are ridiculing before resorting to speculation?No. It's a str, not a basestring.This is simply fantasy. Do you have any basis for suspecting this, or are you just making things up?The representation of strings in your source and the way those strings are treated during execution are two entirely different things. Encoding declarations in the Python source are simply not necessary or relevant unless that source contains non-ASCII literals itself. Such encodings won't make a blind bit of difference to your own strings in your own source. From: elfz

2008-09-10 08:26 pm (UTC)

While ascii as a default encoding in python is asswards, setting source encoding as utf-8 is the correct way to make your problems go away. If the input data for some reason is in unicode string, not str, you need to convert it to utf-8 before output.



#!/usr/bin/python

# -*- coding: utf-8 -*-



unicode_str = u'pītōn' # <type 'unicode'>

plain_str = 'pītōn' # <type 'str'>

print ('unicode: %s' % unicode_str.encode('utf-8'))

print ('plain: %s' % plain_str)



From: lonely_squirrel

2008-09-11 09:08 am (UTC)

Hmm, this is getting more interesting, because this example works, but

# -*- coding: utf-8 -*-

is not enough to make things work in my program.



Could it be that my problem is that the ultimate source of utf-8 isn't a python source file, but the database? What I actually have is a Django model object, whose __unicode__ function is being called. I am assuming that the coding needs to be in the file where I am doing the "%s" % ... operation.



Not only that, but

tmp = "object %s" % obj # this works

tmp = "object %s other %s" % (obj, other) # this produces a ascii encoding error

tmp = "object %s other %s" % ( "My object name is Schröder", other) # this again works



Very strange. Seems like something buggy with Python, honestly. From: metallian

2008-09-11 02:02 pm (UTC)

A surprising number of the truly frustrating problems I've encountered over the years have been caused either by character encoding issues or newline character issues. From: lonely_squirrel

2008-09-11 02:25 pm (UTC)

I'm starting to understand that there are two separate things that are commonly confused:



byte streams

character streams.



Character streams require an encoding in order to be interpreted. A character stream with no encoding is ambiguous at best. The key facet of a character stream is the size of one character may not be fixed, and it the stream may have a particular byte order.



Byte streams have no encoding, because the key facet of a bytestream is that the size of one byte is fixed. There is no big/little endian, and reading X bytes always gives the same result everywhere.



ASCII streams lie in the intersection of the two. In fact, I think that ASCII is the ONLY format that is both an unambiguous character stream AND a bytestream (excepting EBCDIC and historical encodings). Though ASCII text is UTF-8 text, an ASCII bytestream cannot be considered a UTF-8 bytestream; for instance, strncpy(char *, char *, len) isn't going to do what you think it will with utf-8 text.



I think, though I'm not sure, that this is where Bad Things will happen if you make python bytestrings utf8 encoded. Simple things like len(str) for a utf8 string depends on if you're writing it to a socket, to a terminal, or to memory.



From: eichin

2008-09-11 03:26 pm (UTC)

(here via reddit/programming, hi) actually, len on a utf8 string is well defined - it's just defined in terms of bytes, not characters. It might help to realize that an encoded string doesn't actually carry the encoding with it, so len couldn't possibly know how to make characters out of it... Where you may be seeing ambiguity is the difference between a unicode string and a utf-8 string.



There was a good talk on this entire issue at pycon2008; the "sane" way to think about it appears to be to

* work in unicode (*not* utf-8, unicode) everywhere

* the OS hands you bytes; convert to characters (unicode) as soon as you can

* the OS wants bytes; convert back from characters (unicode) as late as you can



(And of course, "if you find library bugs in this regard, report them" :-)



To help with this model, python 2.6/3k introduce an explicit bytes type, which (aside from other features) helps the developer keep track.

From: lonely_squirrel

2008-09-11 05:21 pm (UTC)

(welcome reddit!)



At this point I would guess that the length of 'ö' is 2, and the length of u'ö' is 1. Because the first is considered to be a bytestream, and the second a character stream. Indeed, the first can't know that the data is utf-8, not until I explicitly encode it to a unicode object and specify the encoding.



But see, if you change the default encoding to utf-8, then python kind of does know that 'ö' is utf-8. So I wonder if the length of the bytestream changes if you change the default encoding? I'm trying to explore why changing the default encoding is a bad idea.



The difference between "unicode" as a python object and utf-8 as a unicode encoding was precisely where my understanding was falling short. I did notice that my problem seemed to go away by changing a string literal into a unicode literal. It's just, from a practical standpoint, I find this to be a bad solution. It's tedious, and yet if I miss a string somewhere, it will still work most the time but occasionally throw exceptions. My server's default encoding is utf-8, as well as my terminal, the database tables, and my source files. Ideally, on this system if there is a bytesting which needs to be converted to a unicode string, python would use utf-8 by default. Frustrating that I cannot safely change that.



Such a headache, I can see why just having a unicode instead is recommended! Thanks for your comment.

From: lonely_squirrel

2008-09-11 04:52 pm (UTC)

Well, I did eventually figure out what was going on, with some help from the Django Users group. The issue was that my objects' __unicode__ functions were returning utf-8 encoded data, but in the bytestring class. Even though the data is the utf-8 data I intended, python believes the data is in the default encoding -- ascii. Hence, changing the default encoding fixed the problem.



The correct fix is to return a unicode-typed string from the __unicode__ method.



Indeed, the ("%s" % "ü") example *does* work with the coding cookie. What doesn't work is when a method (all methods I think, but perhaps just the __unicode__ method) returns "ö" instead of u'ö' . A very important difference, not to the underlying data, but to the container the data is in.



My default encoding is back to normal. Mojibake Averted!