Python Programming, news on the Voidspace Python Projects and all things techie.

Fun with Unicode, Latin-1 and a C1 Control Code

Unicode is a rabbit-warren of complexity; almost fractal in nature, the more you learn about it the more complexity you discover. Anyway, all that aside you can have great fun (i.e. pain) with fairly basic situations even if you are trying to do the right thing.

This particular problem was encountered by Stephan Mitt, one of my colleagues at Comsulting. I helped him find the solution, and with a bit of digging (and some help from #python-dev) worked out why it was happening.

We receive data from customers as CSV files that need importing into a web application. The CSV files are received in latin-1 encoding and we decode and then iterate over them to process a line at a time. Unfortunately the data from the customers included some \x85 characters, which were breaking the CSV parsing.

One of the problems with the latin-1 encoding is that it uses all 256 bytes, so it is never possible to detect badly encoded data. Arbitrary binary data will always successfully decode:

>>> data = '' . join ( chr ( x ) for x in range ( 256 )) >>> data . decode ( 'latin-1' ) u'\x00\x01\x02\x03\x04\x05\x06\x07\x08\t

\x0b\x0c\r\x0e\x0f...'

If you iterate over a standard file object in Python 2 (i.e. one that reads data as bytestrings) then you iterate over it a line at a time. This splits lines on carriage returns ( \x0D ) and line feeds ( \x0A ). If you're on Windows then the sequence \x0D\x0A (CRLF) signifies a new line. If you're trying to do-the-right-thing, and decode your data to Unicode before treating it as text, then you might use code a bit like the following to read it:

import codecs handle = codecs . open ( filename , 'r' , encoding = 'latin-1' ) for line in handle : ...

This was the cause of our problem. When decoding using latin-1 \x85 is transcoded to u'\x85' , which Unicode treats as a line break. So if your source data has \x85 embedded in it, and you are splitting on lines, where the lines break will be different depending on if you are using byte-strings or Unicode strings:

>>> d = 'foo \x85 bar' >>> d . split () ['foo\x85bar'] >>> u = d . decode ( 'latin-1' ) >>> u u'foo\x85bar' >>> u . split () [u'foo','bar']

This could still be a pitfall in Python 3, where all strings are Unicode, particularly if you are porting an application from Python 2 to Python 3. Suddenly your data will behave differently when you treat it as Unicode. The answer is to do the split manually, specifying which character to use as a line break.

The problem isn't restricted to \x85 . The Unicode spec on newlines shows us why. \x85 is referred to by the acronym NEL, which is a C1 Control Code: NEL Next Line Equivalent to CR+LF. Used to mark end-of-line on some IBM mainframes.

In fact NEL belongs to a general class of characters known as Paragraph Separators (Category B). This category includes the characters \x1C , \x1D , \x1E , \x0D , \x0A and \x85 . Splitting on lines will split on any of these characters, which may not be what you expect. It certainly wasn't what we expected.

For us the solution was simple; we just strip out any occurence of \x85 in the binary data before decoding.

Note Marius Gedminas suggests that the data is probably encoded as Windows 1252 rather than Latin-1. He is probably right. There are some interesting notes on Unicode line breaks in this Python bug report: What is an ASCII linebreak?.

Current State of Unladen Swallow (Towards a Faster Python)

I'm helping to organise the Python Language Summit that precedes the PyCon Conference this year. One of the first topics we'll be discussing is "Python 3 adoption and tools and the Python language moratorium" which will include representatives of the major Python implementations telling us the current state of their implementation and their plans or progress for supporting Python 3.

I was fortunate today to exchange emails with Collin Winter, one of the core developers of Unladen Swallow on this topic. He gave me a sneak preview of what he will have to say at the summit.

By way of introduction, Unladen Swallow is a project sponsored by Google to speedup Python. In particular it uses the LLVM (Low Level Virtual Machine) to provide a JIT (Just in Time compiler). Google use Python a great deal and so have a direct commercial interest in making it faster. Shortly after the first release Unladen Swallow was put to work serving YouTube traffic. The Unladen Swallow team see the project as a branch of CPython (the standard and reference implementation of the Python programming language) and their goal is to merge their changes back into Python once it is complete.

Here is what Collin had to say:

Brief summary: Unladen Swallow is currently focused on the process of merging into CPython's 3.x line (PEPs, patches, etc). We've been focused on pushing all our necessary changes into upstream LLVM in time for LLVM's 2.7 release, so that CPython 3.x can be based on that release. We've met our goals of maintaining pure-Python and C extension compatibility while speeding up CPython, and have created a platform for future development that we believe can continue to yield increasing performance for years to come. As for py3k support, Unladen Swallow is currently based on CPython 2.6.1. We will seek to merge our changes into Python's 3.x line exclusively (without backporting to 2.x), updating our patches as necessary to correct for the 2.x->3.y skew. Once merger is completed, Unladen Swallow will cease to exist as an independent project.

(Emphasis added by me.) I also asked Collin about whether the merge back into Python woould be difficult, in particular because LLVM hasn't been well tested on Windows in the past and the LLVM is also written in C++ whereas the rest of CPython is all C. This potentially has ABI compatibility issues:

Unladen currently builds and passes its tests on Windows (or did the last time I checked), so I don't think that will be an issue. The addition of C++ to the codebase will be more contentious, but the C++ stuff is restricted to the JIT-facing internals; the rest of the implementation is still straight C. We hope that will mitigate any concerns. "Impact on CPython Development" is a fairly lengthy section of the merger PEP I'm writing.

So, it looks like Unladen Swallow already offers a speedup that the team are happy with. I know there have been some interesting improvements recently, like this one that inlines certain binary operations. Additionally the merge with CPython is going to happen 'soon' (i.e. a PEP is being written) and it is likely to happen on the Python 3 branch. Perhaps that will give the Python community an incentive to make the jump...

NOTE: Jesse Noller also has a blog entry on this topic: Unladen Swallow: Python 3's Best Feature.

New Year's Python Meme

This is the blog entry I had nearly finished when I started messing around with Mock on Saturday. Started by Tarek Ziade, five short questions on you and Python in 2009 (or in this case me and Python)...

What's the coolest Python application, framework or library you have discovered in 2009? One of the biggest things that happened in my life in 2009 was leaving Resolver Systems to become a freelance developer working for a German firm Comsulting.de. I'm still working with IronPython, but now developing web applications with Django on the server and using IronPython in Silverlight on the client. It's great fun and although I'd used both Silverlight and Django a bit previously I'm now using working with them full time. What new programming technique did you learn in 2009? At Resolver Systems we all took responsibility for architectural decisions. It was a great team to be part of and a great way to work. In my current project I've largely been responsible for building the application myself, although my colleague who is an excellent designer has recently been able to join me in the coding. That means I've made the architectural decisions for the application. This has stretched me and structuring large applications is something I want to explore more. What's the name of the open source project you contributed the most in 2009? What did you do? Well, in 2009 I became a Python core-developer and the maintainer of unittest. The work I've done on unittest is definitely the most valuable contribution to open-source in 2009. That reminds me, there are a bunch of open tickets with my name on that I really ought to be looking at instead of doing this... Other than that I worked on a bunch of little projects of my own. Try Python This is probably the project I'm most proud of: a Python interpreter and tutorial that runs in the browser with Silverlight. As Silverlight comes from Microsoft, and last I heard was installed on around 30% of the world's browsers, Try Python isn't a runaway success but I think it is very cool. Moonlight is now out, so in theory the site could work on Linux machines - but there is an issue with the version of IronPython I use. Hopefully I'll get around to updating the site soon and will also add an IronPython tutorial to the Python tutorial.

ConfigObj A Python configuration file reader and writer that is easy to use but with about a gazillion extra features not found in ConfigParser. This is the most widely used code I've ever released, but I don't use it much myself these days. Thankfully it takes little maintaining, however I have done a bunch of work on version 4.7 which is just waiting for me to pull my finger out and release.

Mock A simple mocking library for testing Python code, that makes a great companion to unittest. In my day job I'm now focusing on integration testing and not doing much unit testing, so I don't use Mock as much as I used to and it hasn't got the attention it deserves. It was nice to finally add support for magic methods so that you can mock numeric types, containers and so on. I also wrote a lot of articles on IronPython and supporting example code to go with them. What was the Python blog or website you read the most in 2009? Like many of the other folk who answered these questions, I tend to keep in touch with Python news through Planet Python (in fact this year I sort of became responsible for some of the administration of the Planet when I joined the Python webmaster team). I enjoy a lot of the bloggers on the Planet and find it invaluable for keeping up to date with the Python world. There really are a lot of great bloggers contributing to the Planet so I'm only going to call out one: Jacob Kaplan-Moss. A great blogger, both fun and on the ball technically. Of course like the rest of us he needs to pull his finger out and blog more often. In fact in 2009 I went old school and (re)discovered the joys of IRC. I'm often on #python-dev and various other Python related channels. Twitter is also still growing and I've had a lot of fun and learned a lot from the many tech folk I follow there. What are the three top things you want to learn in 2010? I'd like to learn more about web programming, in particular I want to get deeper into Django (and perhaps Pinax) and properly learn Javascript. There are lots of programming languages I'd like to learn (C so I can contribute to CPython and just because it is everywhere, Haskell so I can get functional enlightenment , maybe F# so I can achieve the same thing but in a language that might actually be useful, Erlang because all the cool kids are doing it and it seems to have the most practical approach to concurrency of the 'modern' languages, Lisp to see what all the fuss is about and probably a load more languages). In reality I'll only learn a programming language that I actually need to use, so I think Javascript is the programming language I'm most likely to have the opportunity to really dive into. Although I've tinkered with Javascript (who hasn't) I haven't fully appreciated what it means to idiomatically program with a prototype based language so it is definitely of value. There are a huge number of libraries and frameworks I'd love to learn, including Twisted, multiprocessing and other web frameworks. I'd also like to do mobile application development either for the iPhone or Android. That would give me a reason to use another language, but I have to say that Objective-C is more appealing than Java. I doubt I'll find time to do any of this in 'hobby-time', so hopefully they'll come up in a work context.

Python Surprises

In the last few days I've run into several things I didn't know about Python. Not necessarily bad or wrong, just new to me.

>>> object . __new__ ( int ) Traceback (most recent call last): File "<stdin>" , line 1 , in <module> TypeError : object.__new__(int) is not safe, use int.__new__()

The same happens for pretty much all the built-in types. I don't think you can achieve this effect from pure-Python code, which is why it is impossible (I think) to write a real singleton in pure-Python. From any singleton instance you can always do this:

object . __new__ ( type ( the_singleton ))

Anyway, next surprise:

>>> class Meta ( type ): ... __slots__ = [ 'foo' ] ... Traceback (most recent call last): File "<stdin>" , line 1 , in <module> TypeError : Error when calling the metaclass bases nonempty __slots__ not supported for subtype of 'type'

This was annoying at the time, but caused me to find a better way to achieve what I wanted anyway. These first two show that despite the 'grand-merger' of Python 2.2 you can't treat the built-in types exactly as if they were user-defined classes.

The next one I actually ran into a while back:

>>> @EventHandler [ HtmlEventArgs ] File "<stdin>" , line 1 @EventHandler [ HtmlArgs ] ^ SyntaxError : invalid syntax

This one is annoying. In IronPython EventHandler[HtmlEventArgs] would return a typed event handler for wrapping a function with. Decorator syntax would be very convenient but the only valid syntax is a name followed by optional parentheses and arguments - not any arbitrary expression.

The relevant part of the grammar is:

decorator ::= "@" dotted_name ["(" [argument_list [","]] ")"] NEWLINE

This grammar not only prevents indexing but means you can't (for example) define lambda decorators. All it would take is a grammar change and these could work, no actual code would need to be written in support. The reason that Guido didn't allow it is that he didn't want people writing code like:

@ ( F (( foo + bar / 3 )) / [ x ** 2 for x in frobulator ]) def function (): ...

Guido did agree that the rules could be relaxed (here is the python-ideas thread where it was discussed), but then the language moratorium came into effect.

The final surprise was that default object equality comparison is implemented inside the Python runtime instead of there being a default implementation in object . In fact object() instances don't even have the equality / inequality methods ( __eq__ / __ne__ ).

>>> object () . __eq__ ( object ()) Traceback (most recent call last): File "<stdin>" , line 1 , in <module> AttributeError : 'object' object has no attribute '__eq__'

However, if you look up __eq__ on the type, as you might if you were trying to delegate up to the default implementation that doesn't exist, then something weird happens:

>>> object . __eq__ ( object (), object ()) Traceback (most recent call last): File "<stdin>" , line 1 , in <module> TypeError : expected 1 arguments, got 2 >>> object . __eq__ <method-wrapper '__eq__' of type object at 0x141fc0>

When you look up __eq__ on object (the type rather than an instance) then you get the __eq__ method of its metaclass ( type ) bound to object which is an instance of type . As this is a bound method it only takes one argument and calling it with two arguments causes a TypeError .

In fact there is nothing special about __eq__ here, I just didn't realise that member resolution on types would check the metaclass after checking the base classes:

>>> class Meta ( type ): ... X = 3 ... >>> class Something ( object ): ... __metaclass__ = Meta ... >>> Something . X # from the metaclass 3 >>> Something . X = 4 # set on the type >>> Meta . X 3 >>> class SomethingElse ( Something ): pass ... >>> SomethingElse . X # fetched from base class not the metaclass 4

Mocking Magic Methods and Preserving Function Signatures Whilst Mocking

So, I'm most of the way through one blog entry, my tax return is due, I have a PyCon talk to write and I have a release of ConfigObj just waiting for me to finish updating the docs. Naturally then I should mess around implementing new features for Mock.

These particular features were inspired by an email from Mock user Juho Vepsalainen who had a particular problem with Mock. In case you aren't familiar with it, Mock is a simple mocking library for unit testing. Mock makes creating mock objects, and patching out implementations with mocks at runtime, trivially easy.

I've spent a chunk of time today implementing a module that extends Mock to add new features. Eventually they will become part of Mock itself, but that would require a new release and tedious things like writing documentation:

Note I've already improved the code in extendmock and merged it into the main mock module. No need for a special MagicMock class any more. You can use mock.py from subversion or wait for the release of version 0.7.

To implement a lot of functionality (mocking any class and recording how they are used), mocks are instances of the Mock class. This can be a problem for code that uses introspection to determine if something is a function or not, or introspects the function signature. If you mock a function or method it will be replaced with a callable object with the signature (*args, **kwargs) . This also means that code which is called incorrectly won't raise an error, you will only catch this in your tests if you specifically check how the object is called (which you usually will because that's the point of mocking it out - but still).

A solution to all these problems is the mocksignature function. This takes a function (or method) and a mock object. It creates a wrapper function with the same signature as the function you pass in. When called this wrapper function calls the mock, so instead of directly patching a mock to replace a function or method you use the function returned by mocksignature . Code that introspects the function you are patching out will still work. Here's an example:

from mock import Mock , patch from extendmock import mocksignature from some_module import some_function mock = Mock () mock_function = mocksignature ( some_function , mock ) @patch ( 'some_module.some_function' , mock_function ) def test (): from some_module import some_function some_function ( 'foo' , 'bar' , 'baz' ) test () mock . assert_called_with ( 'foo' , 'bar' , 'baz' )

To make it more convenient to use I will build support for mocksignature into the patch decorator.

You can also use mocksignature on instance methods:

from mock import Mock from extendmock import mocksignature class Something ( object ): def method ( self , a , b ): pass s = Something () mock = Mock () mock_method = mocksignature ( s . method , mock ) s . method = mock_method s . method ( 3 , 4 ) mock . assert_called_with ( 3 , 4 )

A limitation of mocksignature is that all arguments are passed to the underlying mock by position. If there are default values they will be explicitly passed in. Keyword arguments are only collected if the function uses **kwargs. See the tests for more details. The important fact is that the function signature is unchanged:

import inspect from extendmock import mocksignature from mock import Mock def f ( a , b , c = 'foo' , ** kwargs ): pass mock = Mock () new_function = mocksignature ( f , mock ) assert inspect . getargspec ( f ) == inspect . getargs ( new_function )

The limitation on keyword arguments sounds confusing (certainly the way I expressed it above), so it's easier to demonstrate in practise with the call_args attribute:

>>> from mock import Mock >>> from extendmock import mocksignature >>> >>> mock = Mock () >>> >>> def f ( a = None ): pass ... >>> f2 = mocksignature ( f , mock ) >>> f2 () <mock.Mock object at 0x441d70> >>> mock . call_args ((None,), {}) >>> mock . assert_called_with ( None ) >>>

Even though we passed no arguments in, the argument with the default value ( a ) is called as if None was passed in explicitly. This affects the way you use assert_called_with when using Mock and mocksignature in concert. You can still use mocksignature with functions that collect args with *args and **kwargs:

>>> from extendmock import mocksignature >>> from mock import Mock >>> >>> def f ( * args , ** kw ): pass ... >>> mock = Mock () >>> mock . return_value = 3 >>> f2 = mocksignature ( f , mock ) >>> f2 ( 1 , 'a' , None , foo = 'fish' , bar = 1.0 ) 3 >>> mock . call_args ((1, 'a', None), {'foo': 'fish', 'bar': 1.0}) >>>

Another problem with Mock is that it currently doesn't support mocking out the Python protocol methods (like __len__ , __getitem__ and so on). extendmock contains a new class that adds magic suport to Mock: MagicMock . Here's an example of how you use it:

from extendmock import MagicMock mock = MagicMock () _dict = {} def getitem ( self , name ): return _dict [ name ] def setitem ( self , name , value ): _dict [ name ] = value def delitem ( self , name ): del _dict [ name ] mock . __setitem__ = setitem mock . __getitem__ = getitem mock . __delitem__ = delitem self . assertRaises ( KeyError , lambda : mock [ 'foo' ]) mock [ 'foo' ] = 'bar' self . assertEquals ( _dict , { 'foo' : 'bar' }) self . assertEquals ( mock [ 'foo' ], 'bar' ) del mock [ 'foo' ] self . assertEquals ( _dict , {})

You mock magic methods by assigning a function (or a mock object) to the mock instance. Magic methods are looked up on the object class by the Python interpreter. MagicMock has all the magic methods implemented in a way that checks for corresponding instance variables, with sensible behaviour if the instance variable doesn't exist. However, the presence of these magic methods on the class could break some duck-typing (if it checks for the presence or absence of these methods), so I would rather have MagicMock be a separate class instead of integrating this into the Mock class. On the other hand there is no reason why I can't move MagicMock into the mock module next time I do a release.

For all magic methods you mock in this way you have to include self in the function signature. I might change this at a future date, so be warned this an experimental implementation. Also note that calls to mocked magic methods aren't recorded in method_calls and don't use object wrapping - all things that may change in the future.

One reason that some users have been requesting magic method support is for mocking context managers. Unfortunately __enter__ and __exit__ are looked up differently from the other magic methods in Python 2.5 and 2.6 (they aren't looked up on the class first but on the instance first like normal members). This makes the following technique still the correct way to mock the with statement.

Note This is no longer true in the magic method support now in trunk. You mock __enter__ and __exit__ in exactly the same way as you do other magic methods. You can also mock magic methods by assigning a Mock instance to the method you are mocking. For example: >>> from mock import Mock >>> mock = Mock () >>> mock . __getitem__ = Mock () >>> mock . __getitem__ . return_value = 'bar' >>> mock [ 'foo' ] 'bar' >>> mock . __getitem__ . assert_called_with ( 'foo' ) Mocking the with statement: mock = Mock () mock . __enter__ = Mock () mock . __exit__ = Mock () mock . __exit__ . return_value = False with mock as m : self . assertEqual ( m , mock . __enter__ . return_value ) mock . __enter__ . assert_called_with () mock . __exit__ . assert_called_with ( None , None , None )

Archives