Python Programming, news on the Voidspace Python Projects and all things techie.

Python 3: An end to Unicode Problems?

One of the major changes in Python 3 is that there are now only Unicode strings. There is a bytes type for when you are dealing with binary data, but when you are dealing with text it is always Unicode. There is a corresponding change to the Python I/O system so that files opened in text mode return decoded strings and in binary mode they return bytes.

Compared to many languages the support for Unicode in Python 2 is pretty good. So long as you obey the straightforward rules that you always decode data to Unicode when you receive it, and encode when you need to write it (always specifying the encoding of course) then Unicode strings in Python 2 work very well. The problem is that by having the default string type ( str ) as a byte string the language is virtually begging you to write code that falls over as soon as it has to cope with non-ascii characters or multi-byte encodings. Even if you are careful it is very easy for byte strings to slip through (particularly when dealing with third party libraries that aren't as fastidious as you). As soon as you have a mix of Unicode and byte strings you risk an implicit decoding operation, which will use the default encoding (probably ascii) and possibly failing loudly.

What Python 3 brings to Python is internal consistency. You can no longer cause encoding related errors by adding Unicode and byte-strings. Whenever you have text you have strings that are represented internally as Unicode. No need to worry about that any more. What it doesn't mean is that you don't have to worry about encodings, or even cross-platform issues. In fact it is likely to be a cause of some fun cross-platform issues.

What Python 3 does by default is read and write using the platform default encoding. On a Mac that will be 'mac-roman' and on Windows in the UK it will be something like CP1252. Very different encodings. This means that if you have program files stored as text that were created on one system and then read them from another system, by default they will be decoded with a completely different encoding.

Take a file like this created and then read on the Mac:

>> > d = '\u20ac'

>> > open ( 'test.txt' , 'w' ) . write ( d )



>> > d = open ( 'test.txt' ) . read ( )

>> > [ ord ( c ) for c in d ]

[ 8364 ]

If we take the same file and then read it with Python 3 on Windows:

>> > d = open ( 'test.txt' ) . read ( )

>> > [ ord ( c ) for c in d ]

[ 219 ]

And this is because:

Mac

>> > h = open ( 'test.txt' )

>> > h . encoding

'mac-roman'



Windows

>> > h = open ( 'test.txt' )

>> > h . encoding

'cp1252'

The answer to this is that in fact now that we are always decoding when we read a file in as text (or writing text out to a file), it is even more important that we are aware of the encoding being used. If we are only ever using this file on the current system then maybe we don't need to worry, but if we ever need to read data that might have been created on another system then we had better know what encoding was used.

The solution: the open function takes an optional encoding parameter:

>> > h = open ( 'test.txt' , encoding = 'utf-8' )

>> > h = open ( 'test.txt' , 'w' , encoding = 'utf-8' )

Private Members in Python and C#: You Don't Really Need Them and You Can't Really Have Them

A while ago I wrote a mildly controversial blog entry about the way the Python community responds to certain questions: You Don't Really Need It

One of the issues I talked about was the fact that Python has no concept of private members. It does provide name mangling when your member names start with a double underscore. This is intended to be used for avoiding name collisions in subclasses. In Python the strong convention is that member names that start with a single underscore are private and shouldn't be used outside the class.

This worries developers who come from a background in statically typed languages, where the default is usually for everything to be private. As well as the nervousness that naturally comes from having your privates exposed, some feel that this violates encapsulation and means that Python isn't truly object-oriented. Of course Python encapsulates data and methods as objects, it is merely a translucent encapsulation. In practise this just doesn't seem to be a problem.

In Python you can (ab)use name mangling (or other tricks like data access through closures) that make it inconvenient for developers to access private data and methods. However, Python has such simple to use introspection capabilities that these are almost always trivial to get around.

However, most static languages also provide introspection. In C# this is called reflection, and though it is damn inconvenient to use, it is there and can be used to access / modify private members. In fact, not only is it available, it is also the recommended way to overcome certain difficulties.

Take the following situation that Christian and Orestis encountered at work the other day. By default when making web requests with WebClient.DownloadString it raises an exception if the server sends non RFC compliant headers. Unfortunately such rarely-encountered sites as the Yahoo finance sites do this. You can specify in your app.config file that you are happy to accept non-compliant headers (useUnsafeHeaderParsing), but there is no direct API to set this. If you need to change this setting at runtime you use something similar to this wonderfully horrible piece of reflection (shown as IronPython code ):

import System

from System import Array

from System . Reflection import Assembly , BindingFlags

typeName = "System.Net.Configuration.SettingsSectionInternal"



def setUnsafeHeaderParsing ( ) :

settingsAsm = Assembly . GetAssembly ( System . Net . Configuration . SettingsSection )

if settingsAsm is not None :

settingsType = settingsAsm . GetType ( typeName )

if settingsType is not None :

instance = settingsType . InvokeMember ( "Section" ,

BindingFlags . Static | BindingFlags . GetProperty | BindingFlags . NonPublic ,

None , None , Array [ object ] ( [ ] ) )



if instance is not None :

useUnsafeHeaderParsing = settingsType . GetField ( "useUnsafeHeaderParsing" ,

BindingFlags . NonPublic | BindingFlags . Instance )

if useUnsafeHeaderParsing is not None :

useUnsafeHeaderParsing . SetValue ( instance , True )

return True

return False

This modifies the same hidden setting that would be set from app.config .

Note This is mainly an example of why private isn't really private in C#. Miguel de Icaza doesn't like it as an example though: "Bad advice. That hack will for starters not work with mono, silverlight or iphone and will easily break on upgrades." "It won't work with a stronger security setting (sl). For individual cases like your sample you can c/p mono code."

So in C# your privates are also exposed, all you do is make life more painful for the developer who really does need access to them. Here's to not making life painful for us poor beleaguered developers.

To end with, here is a quote from an excellent article (not at all about Python) on how a strong focus on testing (essential for any developer with a genuine commitment to quality) challenges your conventions:

"3. Private makes less sense than it used to... Get used to living in a more public world."

Integers Can't Handle Floats

This surprised Glenn and me at work the other day:

Python 2.6 ( trunk : 66714 : 66715 M , Oct 1 2008 , 18 : 36 : 04 )

[ GCC 4.0 .1 ( Apple Computer , Inc . build 5370 ) ] on darwin

Type "help" , "copyright" , "credits" or "license" for more information .

>> > a = 6

>> > a . __sub__ ( 2.0 )

NotImplemented

Python integers don't know how to subtract floats from themselves! This is because the result needs to be a float. Instead, when you subtract a float from an integer, the integer returns NotImplemented , which causes the Python interpreter to call the __rsub__ method on the float - that then does the subtraction and returns a new float.

>> > a = 6

>> > b = 2.0

>> > b . __rsub__ ( a )

4.0

The reason that it caught us out is that in Resolver One we have some spreadsheet objects that need to behave like numeric objects under certain circumstances (for compatibility with Excel you need to be able to treat cell ranges that only cover a single cell as if they were the value of the cell they contain).

We were taking a shortcut to achieve this. Rather than implementing all the numeric protocol methods by hand we attached a bunch of functions that delegate to the corresponding bound method of the object they hold as values. In IronPython 1 this worked fine, but IronPython 2 has got more compatible with CPython - and when we call the bound subtraction method of an integer with a float it calls the float __rsub__ with our cellrange. Unsurprisingly floats don't know how to do arithmetic with cell ranges...

The nice and simple (and cleaner) solution was to stop delegating to bound methods and instead use the numeric functions from the operator module that provide full Python semantics for numerical operations.

Python 2.6 and Executable Zipfiles

A new feature that was quietly sneaked into Python 2.6, without the fanfare it deserves, is the ability to distribute Python applications as executable zipfiles. Python has long had support for importing modules and packages from zipfiles - through the oh-so-badly-needed-in-IronPython zipimport.

What is new is the ability to make zip archives executable. If you call the Python 2.6+ (or 3.0+) interpreter passing in a zip file instead of a Python file - the interpreter looks inside the zip file for a Python file named __main__.py (at the top-level) and executes it. The zip file can also contain all the (pure-Python only) modules and packages your app depends on.

This is a great way of distributing applications as a single file. The nice thing is that the Python interpreter doesn't depend on the extension to recognise zipfiles, instead recognising them automagically. This means that on Windoze you can give these archives a new extension (perhaps '.pyz') and associate them with Python 2.6 - allowing Windows to execute them automatically when you double click on them in explorer. I think, but am not 100% certain, that the zipfile specification is flexible enough that you could also prepend a pound-bang ('#!') interpreter line to make them executable under Mac OS X and Lunix type platforms.

michael$ echo > __main__.py "print 'Hello world'" michael$ python __main__.py Hello world michael$ zip test.zip __main__.py adding: __main__.py (stored 0%) michael$ python test.zip Hello world

UPDATE: Floris Bruynooghe notes in the comments that you can add a hash-bang line to a zipfile and make it executable:

$ cat > __main__.py print('hi there') ^D $ zip test.zip __main__.py adding: __main__.py (stored 0%) $ cat > hashbang.txt #!/usr/bin/env python3.0 ^D $ cat hashbang.txt test.zip > my_exec $ chmod +x my_exec $ ./my_exec hi there $

Join the Glorious Python Programming Revolution

Ars Technica has a new article on learning Python:

The article itself is kind of meh, merely linking to six of the most well known online Python tutorials, but what is interesting is their rationale for promoting Python:

"Recently, Google has stepped up its presence in the cloud computing arena. Google's new App Engine (aka "AppSpot") lets you design and run web applications using Google's existing infrastructure." "At this time, App Engine uses Python as its primary programming language. Although Google is investigating other languages for future releases, if you want to get started with App Engine, you'll need to first master the Python scripting language."

Overwhelmingly the best part of the article is the image they use...

Ars Technica have a real problem with puns when doing articles on Python. Take their article introducing Python 3 for example:

Thissss article is pretty good, but as expected Python 3 is causing controversy both inside and outside the Python community. People wonder if the benefits of breaking backwards compatibility are worth the potential confusion it causes. The very best response I've seen is the following blog entry by James Bennett. It kicks off with a fantastic introduction that has nothing to do with Python but should be read by programmers and business managers of every strain.

Back to Ars Technica, they will always have a place in my heart. They've done a fantastic series of articles on the history and technology of the Amiga computer (the first 'desktop' computer with a pre-emptive multitasking operating system being one of the jewels in its crown of glory). If, like me , you are nostalgic for the beautiful usability of this machine (both as a user and a developer) then you'll love this series.

Archives