WSGI on Python 3

Yesterday after my talk about WSGI on Python 3 I announced an OpenSpace about WSGI. However only two people showed up there which was quite disappointing. On the bright side however: it was in parallel to some interesting lighting talks and I did not explain to well what the purpose of this OpenSpace was.

In order to do better this time around, I want to summarize the current situation of WSGI on Python 3, what the options are and why I'm at the moment thinking of going back to an earlier proposal that was dismissed already.

So here we go again:

Language Changes There are a couple of changes in the Python language that are relevant to WSGI because they make certain things harder to implement and others easier. In Python 2.x bytestrings and unicode strings shared many methods and Python would do a lot to make it easy for you to implicitly switch between the two types. The root cause of the unicode decode and unicode encode errors everybody knows in Python are often caused by the implicit conversion going on. Now in Python 3 the whole thing looks a lot different. There are only unicode strings now and the bytestrings got replaced by things that are more like arrays than strings. Take this Python 2 example: >>> 'foo' + u 'bar' u'foobar' >>> 'foo %s ' % 42 'foo 42' >>> print 'foo' foo >>> list ( 'foo' ) ['f', 'o', 'o'] Now compare that to the very same example on Python 3, just with syntax adjusted to the new rules: >>> b 'foo' + 'bar' Traceback (most recent call last): File "", line 1, in TypeError : can't concat bytes to str >>> b 'foo %s ' % 42 Traceback (most recent call last): File "", line 1, in TypeError : unsupported operand type(s) for %: 'bytes' and 'int' >>> print ( b 'foo' ) b'foo' >>> list ( b 'foo' ) [102, 111, 111] There are ways to convert these bytes to unicode strings and the other way round, there are also string methods like title() and upper() and everything you know from a string, but it still does not behave like a string. Keep this in mind when reading the rest of this article, because that explains why the straightforward approach does not work out too well at the moment.

Something about Protocols WSGI like HTTP or URIs are all based on ASCII or an encoding like latin1 or even different encodings. But all those are not based on a single encoding that represents unicode. In Python 2 the unicode situation for web applications was fixed pretty quickly by all frameworks in the same way: you as the framework/application know the encoding, so decode incoming request data from the given charset and operate on unicode internally. If you go to the database, back to HTTP or something else that does not operate on unicode, encode to the target encoding which you know. This is painless some libraries like Django make it even less painful by having special helpers that can convert between utf-8 encoded strings and actual unicode objects at any point. Here a list of web related libraries operating on unicode (just a small pick): Django, Pylons, TurboGears 2, WebOb, Werkzeug, Jinja, SQLAlchemy, Genshi, simplejson, feedparser and the list goes on. What these libraries can have, what a protocol like WSGI does not, is having the knowledge of the encoding used. Why? Because in practice (not on the paper) encodings on the web are very simple and driven by the application: the encoding the application sends out is the encoding that comes back. It's as simple as that. However WSGI does not have that knowledge because how would you tell WSGI what encoding to assume? There is no configuration for WSGI so the only thing we could do is forcing a specific charset for WSGI applications on Python 3 if we want to get unicode onto that layer. Like utf-8 for everything except headers which should be latin1 for RFC compliance.

Byte Based WSGI On Python 2 WSGI is based on bytes. If we would go with bytes on Python 3 as well, the specification for Python 3 would look like this: WSGI environ keys are unicode WSGI environ values that contain incoming request data are bytes headers, chunks in the response iterable as well as status code are bytes as well If we ignore everything else that makes this approach hard on Python 3 and only look at the bytes object which just does not behave like a standard string any more, a WSGI library based on the standard libraries functions and the bytes type is quite complex compared to the Python 2 counterpart. Take the very simple code commonly used to reproduce a URL from the WSGI environment on Python 2: def get_host ( environ ): if 'HTTP_HOST' in environ : return environ [ 'HTTP_HOST' ] result = environ [ 'SERVER_NAME' ] if ( environ [ 'wsgi.url_scheme' ], environ [ 'SERVER_PORT' ]) not \ in (( 'https' , '443' ), ( 'http' , '80' )): result += ':' + environ [ 'SERVER_PORT' ] return result def get_current_url ( environ ): rv = ' %s :// %s / %s%s ' % ( environ [ 'wsgi.url_scheme' ], get_host ( environ ), urllib . quote ( environ . get ( 'SCRIPT_NAME' , '' ) . strip ( '/' )), urllib . quote ( '/' + environ . get ( 'PATH_INFO' , '' ) . lstrip ( '/' )) ) qs = environ . get ( 'QUERY_STRING' ) if qs : rv += '?' + qs return rv This depends on many string operations and is entirely based on bytes (like URLs are). So what has to be changed to make this code work on Python 3? Here an untested version of the same code adapted to theoretically run on a byte based WSGI implementation for Python 3. The get_host() function is easy to port because it only concatenates bytes. This works exactly the same on Python 3, but we could even improve that theoretically by switching to bytearrays which are mutable bytes objects which in theory give us better memory management. But here the straightforward port: def get_host ( environ ): if 'HTTP_HOST' in environ : return environ [ 'HTTP_HOST' ] result = environ [ 'SERVER_NAME' ] if ( environ [ 'wsgi.url_scheme' ], environ [ 'SERVER_PORT' ]) not \ in (( b 'https' , b '443' ), ( b 'http' , b '80' )): result += b ':' + environ [ 'SERVER_PORT' ] return result The port of the actual get_current_url() function is a little different because the string formatting feature used for the Python 2 implementation are no longer available: def get_current_url ( environ ): rv = ( environ [ 'wsgi.url_scheme' ] + b '://' get_host ( environ ) + b '/' urllib . quote ( environ . get ( 'SCRIPT_NAME' , b '' ) . strip ( b '/' )) + urllib . quote ( b '/' + environ . get ( 'PATH_INFO' , b '' ) . lstrip ( b '/' )) ) qs = environ . get ( 'QUERY_STRING' ) if qs : rv += b '?' + qs return rv The example did not become necessarily harder, but it became a little bit more low level. When the developers of the standard library ported over some of the functions and classes related to web development they decided to introduce unicode in places where it's does not really belong. It's an understandable decision based on how byte strings work on Python 3, but it does cause some problems. Here a list of places where we have unicode, where we previously did not have it. Not judging here on if the decision was right or wrong to introduce unicode there, just that it happened: All the HTTP functions and servers in the standard library are now operating on latin1 encoded headers. The header parsing functions will assume latin1 there and pass unicode to you. Unfortunately right now, Python 3 does not support non ASCII headers at all which I think is a bug in the implementation.

The FieldStorage object is assuming an utf-8 encoded input stream as far as I understand which currently breaks binary file uploads. This apparently is also an issue with the email package which internally is based on a common mime parsing library.

object is assuming an utf-8 encoded input stream as far as I understand which currently breaks binary file uploads. This apparently is also an issue with the email package which internally is based on a common mime parsing library. urllib also got unicode forcely integrated. It is assuming utf-8 encoded string in many places and does not support other encodings for some functions which is something that can be fixed. Ideally it would also support operations on bytes which is currently only the case for unquoting but none of the more complex operations.

The about-to-be Spec There are some other places as well where unicode appeared, but these are the ones causing the most troubles besides the bytes not being a string thing. Now what later most of WEB-SIG agreed with and what Graham implemented for mod_wsgi ultimately is a fake unicode approach. What does this mean? Make sure that all the information is stored as unicode but not with the proper encoding (which WSGI would not know) but just assume latin1. If latin1 is not what the application expected, the application can encode back to latin1 and decode from utf-8. (As far as I know, this is loss-less). Here what the current specification looks like that is about to be crafted into a PEP: The application is passed an instance of a Python dictionary containing what is referred to as the WSGI environment. All keys in this dictionary are native strings. For CGI variables, all names are going to be ISO-8859-1 and so where native strings are unicode strings, that encoding is used for the names of CGI variables. For the WSGI variable 'wsgi.url_scheme' contained in the WSGI environment, the value of the variable should be a native string. For the CGI variables contained in the WSGI environment, the values of the variables are native strings. Where native strings are unicode strings, ISO-8859-1 encoding would be used such that the original character data is preserved and as necessary the unicode string can be converted back to bytes and thence decoded to unicode again using a different encoding. The WSGI input stream 'wsgi.input' contained in the WSGI environment and from which request content is read, should yield byte strings. The status line specified by the WSGI application should be a byte string. Where native strings are unicode strings, the native string type can also be returned in which case it would be encoded as ISO-8859-1. The list of response headers specified by the WSGI application should contain tuples consisting of two values, where each value is a byte string. Where native strings are unicode strings, the native string type can also be returned in which case it would be encoded as ISO-8859-1. The iterable returned by the application and from which response content is derived, should yield byte strings. Where native strings are unicode strings, the native string type can also be returned in which case it would be encoded as ISO-8859-1. The value passed to the 'write()' callback returned by 'start_response()' should be a byte string. Where native strings are unicode strings, a native string type can also be supplied, in which case it would be encoded as ISO-8859-1.