There are several options for __str__ and __repr__ under Python 2.x.

I think this approach is too magical and has serious drawbacks:

It is tempting to encode the values of __str__ and __repr__ to sys.stdout.encoding , and maybe set system default encoding to sys.stdout.encoding . This way REPL and print would work as expected and the result would be readable when it is possible.

7bit ASCII

The second option is to make __str__ and __repr__ return 7bit ASCII, using escaping and/or transliteration.

Escaping is often used to make an arbitrary string 7bit. Python 2.x does it itself:

repr(unicode_string) returns escaped 7bit-safe ASCII string;

returns escaped 7bit-safe ASCII string; str([unicode_string1, unicode_string2]) returns 7bit ASCII because __repr__ of elements is used for building string representation of standard Python container types (see http://www.python.org/dev/peps/pep-3140/ ).

String escaping is an "escaping" from the problems; non-English Python users are all used to unhelpful Python 2.x output for non-ascii data.

I think that limiting __str__ and __repr__ to 7bit in user code is not popular because (while it is the most robust way to deal with the issue under Python 2.x) it often makes the output unreadable.

In order to improve readability transliteration may be considered.

For example, we may decide to have obj.__str__() returning a transliterated 7bit ASCII version of obj.__unicode__() , and __repr__ returning the "full" representation encoded to 7bit ASCII with escaping:

>>> obj = MyCls(unicode_data=u'ciào привет') >>> print obj ciao privet >>> print unicode(obj) ciào привет >>> obj <MyCls(u'ci\xe0o \u043f\u0440\u0438\u0432\u0435\u0442')>

It may look like the transliteration of obj.__str__() is not needed (because print unicode(obj) output is even nicer). The advantages of transliterating obj.__str__() are:

consistent interface - print obj becomes useful for inspection;

becomes useful for inspection; __str__ is used in string formatting so transliterated obj.__str__() could make representation of other objects more readable.

The main drawback of the transliteration is that user might think that obj has "ciao privet" data, not "ciào привет" (based on "print obj" ). I think this is a serious issue, but convinced myself that it may be OK for __str__ to behave like this because (according to Python docs) __str__ is only an "informal" representation of an object. In my opinion, the following conditions are met with transliterated __str__ and escaped __repr__ :

__repr__ is "information-rich and unambiguous";

is "information-rich and unambiguous"; __str__ provides "convenient or concise representation";

According to Python docs, we shouldn't transliterate __repr__ because the transliteration is a lossy process.

Proper transliteration is complex; transliteration rules depend on language used, and they are not limited to 1-to-1 mappings between characters. There is a registry of machine-readable rules at http://site.icu-project.org/ , and I've even seen a Python package for transliteration that uses these rules (can't remember the name). But in order to use these "proper" transliteration methods language of the text should be known in advance. A general solution using ICU rules would be quite complex: the language of the text should be guessed before transliterating; if text has parts written in different languages then it should be somehow split into monolingual parts. Nice project for studying statistics and machine learning by the way :)

The popular option for transliteration is Unidecode. Unidecode supports many languages; it is small and fast because it uses a simple mapping between unicode codepoints and ASCII representation.

Unidecode works quite well in practice, but it is plagued with licensing issues. It used to be dual-licensed under (quite obscure) Perl Artistic License and GPL; the license was later changed to GPL; GPL may be a show-stopper in many cases.

Unidecode is a port of Perl Text::Unidecode library; I've made an another (very basic, 10 lines of Perl + 20 lines of Python) port named text-unidecode which is licensed under Perl Artistic License (thanks to Steven Bird for the idea of re-porting).

For Western languages removing diacritic marks if often enough to make text 7bit ASCII. In this case external libraries are not needed (thanks to Álvaro Justen for the suggestion).

Example implementation of the (non-GPL) transliteration method selection:

try : # Older versions of unidecode are licensed under Artistic License; # assume an older version is installed. from unidecode import unidecode def transliterate ( txt ): return unidecode ( txt ) . encode ( 'ascii' ) except ImportError : try : # text-unidecode implementation is worse than unidecode's # so unidecode is preferred. from text_unidecode import unidecode def transliterate ( txt ): return unidecode ( txt ) . encode ( 'ascii' ) except ImportError : # I'm not sure about this part. The version below only # handles accents; this may be OK for many European languages # but will produce empty strings e.g. for Cyrillic. # Maybe try a yet another method if this returns an empty string? import unicodedata def transliterate ( text ): normalized_text = unicodedata . normalize ( 'NFKD' , text ) return normalized_text . encode ( 'ascii' , 'ignore' )

If GPL is OK, just use Unidecode.