I was briefly expressing my disagreement with the Python 3 development decisions, so I want to elaborate on that a bit. While I was previously addressing some of the problems I have with Python 3, I took the time to create a list of things that were solved in a way other than I expected or hoped.

Let's start with the biggest grief of mine...

Unicode Support

If you look at any of the pocoo libraries, they all use unicode. In fact, Jinja2 and Werkzeug even enforce unicode so you can't even use them with byte strings internally unless you do the encode dance. Why that? Because I believe in unicode and that is not too surprising because German, the official language of Austria uses some non ASCII letters and yet tons of systems deployed force me to substitute Umlauts with latin letters just because someone had a limited horizon.

But the unicode world is complex and Python does not care about unicode too much. And that neither in Python 2 nor Python 3. So what does not work about unicode in Python? So in German for example, there are words like "Fuß" (which means foot). The last letter there is a so-called "Scharfes Es" or "Eszett". The former means "sharp-s" and is usually represented as a "ſs" ligature, the other one is a "ſz" ligature and used in blackletter fonts (The letter "ſ" is no longer in use but worked similar to a latin "s"). Another common way was using an "ſs" litature. Because this letter will never occur at the beginning of words there was never an uppercase character for it (there was one introduced lately, but nobody uses it). However, it is pretty common to use title case or uppercase for emphasis, so there is a need for that letter to exist in uppercase. The common replacement you see is a doubled "s" or "sz". So "Fuß" becomes "FUSS", "Maße" became "MASZE" etc. There are some variations, but basically it means that one letter becomes two.

However, that does not work in Python. The Python unicode implementation cannot do two things: neither can it replace one letter with two when changing the case, nor does it allow a locale information for character mappings. The latter is necessary for languages like Turkish where an uppercase "I" is lowercased to "ı" and not "i". (And I will not complain about the shared state of the locale library which of course stayed in Python 3).

Another problem is that Python uses UCS2 or UCS4 internally and that shines through. So if you have an UCS2 build, len() called on strings does not give you the number of characters in the string, but the number of UCS2 interpreted characters which might not be the same. In fact, every letter outside the basic plane will be wrongly counted. Because UTF-16 (UCS2 with support for surrogate pairs that allow you to use characters outside the basic plane) is a variable length encoding it has the same problems as utf-8 as an internal encoding: namely, making slicing a non trivial operation. Another problem arises: binary extensions have to be compiled for UCS2 and UCS4 pythons separately. And last time I checked, setuptools did not allow you to publish both builds on pypi and pull the correct one. In fact, by default there is no such information in the filename which would make it possible to provide both extensions.

So what they did "improve" with Python 3 was making unicode strings the default. And they did that in a very backwards incompatible and IMO problematic way: they degraded bytestrings from strings to glorified integer arrays and enforced unicode on non-unicode protocols.

Just to give you an example: When you iterate over bytestrings in Python 2, the iterator will yield you a bytestring of length 1 for each character with that character in. While I was always a harsh opponent of strings being iteration, it was something everybody relied on. In Python 3, bytestrings are bytes objects, which are basically arrays of integers which look like strings in the repr, but yield the bytes as integers. So if you had code that relied on the iteration returning chars, your code will break. And yes, Python 3 breaks backwards compatibility, but this is something that 2to3 does not pick up and most likely you will not either. At least in my situation it took me a long time to track down the problem because some other implicit conversion was happening at another place.

Now at the same time unicode strings continue to yield unicode strings on iteration with the char in. That means suddenly bytes and unicode objects have different semantics, making it impossible to provide an interface for both bytestrings and unicode. There were tons of places libraries accepted both unicode and bytes in Python 2 because it "made sense". A good example is URL handling. URLs are encodingless. Some schemes hint a default charset, but in reality no such thing exists. However, applications themselves knew what encoding the URL would use, so they happily pass unicode strings to the URL encoder and that would use the application URL encoding to ensure the URL is properly quoted. In Django/Werkzeug/and probably many more libraries, if you passed unicode to the URL encoder, it would by default encode to UTF-8. However if the URL was came from another source with unknown encoding, it was possible to transparently pass the URL on. Also the decoding of URLs from somewhere else usually happened encoding-less. Many applications for example check the referrer of the page to check if the user came from a search engine and if yes, grab the search keywords from the referrer and highlight them on the current page. In that situation you can keep a list of known encodings of referrers in the URL and decode the referrer URL accordingly. In Python 3 the URL module in many situations uses an UTF-8 default encoding, or requires the URL to be UTF-8 encoded or provides a completely different and limited interface for byte URLs.

Sure, it might be sufficient for 98% of all users, but there are non obvious implications: a library that wraps urllib/urlparse and whatnot cannot reuse the same code for Python 2 and 3. When I started supporting IRIs in Werkzeug (basically the URL successors with proper encoding, already somewhat used by browsers) I chose to abandon the urllib module altogether and write my own simple decoder to make it easier to later port that thing over to Python 3 without changed semantics.

There are other examples as well: filesystem access. Python 3 assumes your filesystem has an encoding, but many linux systems do not. In fact, not even OS X enforces an encoding. You can happily use fopen to create a file that does not look like UTF-8 at all. And even there, the situation is a lot more complex because on OS X, different unicode normalization rules apply for the filesystem than for the applications themselves. So even if you are using Python 3, you still will have to manually normalize the filename to a different encoding when you want to compare filenames on the filesystem.

When I looked at the unicode stuff in Python 3 I did not see much value over nicely written libraries in Python 2 that enforced unicode usage. In fact, the update makes it especially hard to convert such libraries (that required unicode) to Python 3, because 2to3 assumes you are using byte strings and not unicode.