Python Programming, news on the Voidspace Python Projects and all things techie.

Piping Objects: Bringing a Powershell-alike Syntax to Python

Harry Pierson recently twittered about how he missed the Powershell syntax, for piping objects between commandlets, in Python. After exchanging emails with him, he actually prefers something like the F# syntax - which uses '|>' rather than '|' as a pipe. I decided to see how far I could get with Python (settling on '>>' as the operator), and I think that what I've come up with is quite nice.

It enables you to create commandlets and pipe objects between them using '>>' (the right shift operator). Creating new commandlets is as easy as writing a function.

As I don't actually have a use case for this (!), this 'proof of concept' implementation is pretty specific to my example use case - but as it is around 60 lines of Python it is very easy to customise. The syntax is nice and declarative, so creating a library of commandlets could be useful for working at the interactive interpreter, or it could be used for creating Domain Specific Languages.

Suppose you have a set of data that you want to pass through several filters that also transform the data, and then perform an action on each record. With commandlets you can do things like:

some_data >> filter1 >> filter2 >> action

The normal Python technique would be to use list comprehensions. With list comprehensions each record has to go through the filter twice, as transforming and filtering have to be done separately. An equivalent of the above using list comprehensions looks like:

intermediate = [filter1(x) for x in some_data if filter1(x) is not ignored] [action(filter2(x)) for x in intermediate if filter2(x) is not ignored]

Rolling that exactly into a single list comprehension means one big-ass ugly list comprehension.

As an example of the syntax it enables, I've implemented three simple commandlets that allow you to do:

listdir('.') >> notolderthan('2/3/08') >> prettyprint

I'm afraid that the example only works with IronPython (because the date handling is much nicer than CPython), but none of the rest of the code requires IronPython.

The first part of the chain shown above is a commandlet called listdir that returns a list of all the files in a directory (it delegates to os.listdir). Although it is lists that are piped between commandlets, the functions you write (which are wrapped in a Cmdlet class) only need to handle one argument at the time.

You create commandlets that take arguments (like notolderthan ), in the same way you write decorators that take arguments - as a function that returns a function.

Here is the implementation of listdir and the notolderthan filter:

def f_listdir ( path ) :

def listdir ( ) :



return [ Path ( path , member ) for member in os . listdir ( path ) ]

return listdir



def f_notolderthan ( date ) :

datetime = System . DateTime . Parse ( date )

def notolderthan ( member ) :

if member . mtime >= datetime :

return member

return ignored

return notolderthan



listdir = Cmdlet ( f_listdir )

notolderthan = Cmdlet ( f_notolderthan )

ignored is a special sentinel value that allows commandlets to act as filters. Commandlets can also perform an action instead of piping objects out. prettyprint is an example of this:

def f_prettyprint ( val ) :

print val



prettyprint = Action ( f_prettyprint )

You can also pass in a generator (or any iterable) to the start of the chain. Here is an argument that uses a recursive generator, listing all the files in a directory and its subdirectories, on the left hand side of the chain:

def recursive_walk ( path ) :

for e in os . listdir ( path ) :

p = os . path . join ( path , e )

if os . path . isfile ( p ) :

yield Path ( path , e )

else :

for entry in recursive_walk ( p ) :

yield entry



recursive_walk ( '.' ) >> notolderthan ( '2/3/08' ) >> prettyprint

The Cmdlet class is a subclass of list, so a chain of commandlets returns a list (well - a Cmdlet ) populated with the results of the call chain.

Here's the full implementation of Cmdlet and Action :

import os

import System



__version__ = '0.1.0'



__all__ = [ 'Action' , 'Cmdlet' , 'ignored' , 'listdir' , 'notolderthan' , 'prettyprint' ]



ignored = object ( )





class Cmdlet ( list ) :

def __init__ ( self , function , _populated = False ) :

self . function = function

self . _populated = _populated





def __call__ ( self , * args , ** keywargs ) :

function = self . function

if args or keywargs :

function = self . function ( * args , ** keywargs )



return Cmdlet ( function )





def __rshift__ ( self , other ) :

if not self . _populated :



self [ : ] = self . function ( )



new = Cmdlet ( other . function , True )

vals = [ other . function ( m ) for m in self ]

new [ : ] = [ v for v in vals if v is not ignored ]

return new





def __rrshift__ ( self , other ) :





new = Cmdlet ( self . function , True )

vals = [ self . function ( m ) for m in other ]

new [ : ] = [ v for v in vals if v is not ignored ]

return new





def __repr__ ( self ) :

return 'Cmdlet(%s)' % list . __repr__ ( self )





class Action ( Cmdlet ) :

def __rshift__ ( self , other ) :

Cmdlet . __rshift__ ( self , other )

return None



def __repr__ ( self ) :

return 'Action(%s)' % self . function . __name__

Nice.

Most of the magic is in __rshift__ and __rrshift__ , but I'm also fond of __call__ which allows you to create commandlets that take arguments.

To run the examples you also need my homegrown Path class:

class Path ( object ) :

def __init__ ( self , path , entry ) :

self . dir = path

self . name = entry

self . path = os . path . join ( path , entry )

self . mtime = System . IO . File . GetCreationTime ( self . path )

self . ctime = System . IO . File . GetLastWriteTime ( self . path )



def __repr__ ( self ) :

start = 'File:'

if os . path . isdir ( self . path ) :

start = 'Dir:'

ctime = self . ctime

mtime = self . mtime

return "%s %s :ctime: %s :mtime: %s" % ( start , self . path , ctime , mtime )

Apples, Django is Wired and Faster Python

Just a collection of interesting links from the intarwebz today:

Report: Mac sales up 60% in February The report shows growth in Mac unit sales up 60 percent from 2007 In dollar terms, NPD has Apple capturing a full 25 percent of the U.S. computer market last month.

Wired Magazine: Expired-Tired-Wired Expired: ASP.NET - Tired: PHP - Wired: Django

Issue 2459: speedup loops with better bytecode A patch for CPython that speeds up for and while loops through better bytecode. The same technique promises to improve list comprehensions and generator expressions as well.

Python-safethread: Python 3 without the GIL This is a big-assed patch and I wouldn't rate its chances of making it into the core too highly - but it is a damn impressive project. It allows you to compile a version of Python without the GIL (requires changes to C extensions of course). There is a performance cost for single-threaded code, but allows multi-threaded code to scale across multiple cores.

SFI Conference: Erlang, Ruby and Java (and Concurrency)

A week before PyCon I was at a very different conference: the Academic IT Festival organised by three universities in Krakow, Poland. I was also accompanied by my erstwhile colleague, Jonathan Hartley (who has currently abandoned us for Mexico to get married.) You can read his write-up in case you don't believe mine...

For a student conference this was fiercely well organised, with a fantastic array of speakers (myself and Jonathan not-withstanding). It was great to meet (amongst many others):

Chad Fowler, author and prominent member of the Ruby community

Maciek Fijalkowski who works on PyPy

Joe Armstrong, the creator of Erlang

Gilad Bracha, one of the architects of the JVM

The conference is free and open to all, and around half the talks are in English. Krakow is a beautiful city, so if you have the opportunity to be there next year then take it .

I'd like share with you some of the things I learnt at the conference.

Chad Fowler

Chad is a great guy, I had a great time eating kebabs with him at one in the morning in Krakow. He is also a great speaker and I'd love to be as confident as him before speaking.

I've gradually been learning more about Ruby, and Chad filled in a few more gaps for me. One interesting point was that the Ruby community generally sees Rubinius (a kind-of-equivalent of PyPy for Ruby) as the future of Ruby. Currently the VM is written in C, but they aim to develop a static-subset of Ruby so that the VM can be maintained in Ruby. This an idea originating, I believe, in Slang for Smalltalk, but also seen in RPython for PyPy.

I would love to see the future of Python in PyPy. Maintaining Python in Python sounds much more fun than maintaining it in C, but also I see it as the only viable path that can lead us away from the GIL and reference counting.

I also learned that in some ways Ruby is more restrictive than Python. For example you can't change the type of objects (which you can do in Python by assigning to __class__ ), or the bases of a class (by manipulating __bases__ on the class). Being able to change the type of an object is one of the requirements Gilad Bracha has of a dynamic language, of which he is a big fan.

Maciek Fijalkowski

Maciek is one of the PyPy developers. He first got involved through the google summer of code. He emphasised that one of differences between the PyPy team and the Python core developers is that while the Python-Dev team are fanatically interested in language design (which is good news for Python users), the PyPy team are much more interested in remaining language agnostic and producing the best possible VM.

There has been a sea-change in PyPy development recently. In the past PyPy has been a fantastic project for producing a wealth of not-quite-working-but-really-cool side projects. As a result PyPy has been at the 'nearly working' stage for a long time. They have recently pruned the codebase of anything unmaintained or that was blocking development. Some really cool things (like the ability to write CPython extensions in RPython) have gone (possibly to return), but the new focus on solidifying and completing the core is great.

Unsurprisingly though, the core Python developers, who have no familiarity with the PyPy code base, don't see it as a 'replacement' for the code base of CPython which they are very familiar with. Hopefully, as the JIT integration improves, PyPy will become more of a viable alternative in the not-too distant future. (Something I see as being potentially important to the concurrency with Python story.)

Maciek is also interested in the Resolver Systems Ironclad project, to see if parts of it could be reused to bring compatibility with CPython C extensions to PyPy - something that otherwise could be a barrier to adoption.

Joe Armstrong

Joe Armstrong is an eccentric Englishman living in Sweden. He started his talk with a look to the future for computing. Modern computing has long been based on the Von Neumann Architecture. In recent years programmers of sequential programs ('normal' programs, which Joe quaintly calls 'legacy' programs) have seen their code get faster and faster as CPU clock speeds have improved (despite the fact that Intel clock speeds are largely fiddled anyway).

That has now started to change. To reduce power consumption, and maintain the ever decreasing proportion of a chip that can be reached in a single clock cycle, clock-speeds are starting to drop and processors are gaining more cores instead. Sequential programs that can only run on a single core will start getting slower rather than faster.

Single chips with hundreds of cores, the power of a super-computer, have already been produced experimentally by chip manufacturers (who aren't keen to sell them and wipe out their market for selling many processors for supercomputers). Joe has a project which is close to getting funding for blowing sixty MIPS cores onto an FPGA.

Threading with locking, as a concurrency solution, is very difficult (but not as painful as it is often painted by the Python community - particularly if you take care with your design). Some alternatives exist, like Transactional Memory (which when implemented as Software Transactional Memory without hardware support has performance costs) - but although this technique scales over multiple cores it doesn't scale over multiple machines.

Another alternative is the functional programming language Erlang. By using lightweight processes (which aren't OS level threads or OS level processes but run on top of a scheduler in the VM) and removing mutable state, Erlang programs scale to multiple cores or multiple machines as part of the language design.

Concurrency is going to be ever more important and Erlang is very trendy at the moment. It is used by ejabbared which in turn is used by Twitter and there are some very interesting 'massively distributed hash table' projects likely to surface soon.

In CPython, because of the Global Interpreter Lock, even threaded applications don't scale across multiple cores. The GIL is supposed to make writing C-extensions easier, but it is an interesting design choice to make aspects of Python more suited to writing C than to writing Python. The Python community normally touts Process based concurrency as a better alternative. However, with a clean design you can limit the locking needed for multiple threads, and where you use threads for doing large calculations (and so have a lot of data to marshall back and forth) the costs of using multiple processes are great. Naturally Joe Armstrong things the answer is for us all to write in Erlang.

Interestingly, the work Chad Fowler has been doing recently has involved Ruby-without-Rails and Erlang. He says that writing Erlang test driven has been 'interesting' and has tended to make them use more processes (where in Python or Ruby you would expect TDD to make your code more modular). He is not sure whether this is a good thing or not, but it seems to be working for them.

Gilad Bracha

Gilad was one of the architects of the Java Virtual Machine, and also involved in Strongtalk, the phenomenal Smalltalk JIT that effectively became the Java hotspot JIT.

Random Aside: .NET vs Java It was very interesting to talk to some of the Microsoft .NET guys at PyCon about '.NET vs Java'. Their take was "they have the better JIT while we have the better garbage collection".

Gilad knows a lot about optimising dynamic languages. He firmly believes that dynamic languages can run as fast as statically typed languages, the reasons that it doesn't happen is that all the people who know enough about the subject are being paid to work on other things (this is rocket science he says). One of the things that Maciek emphasised in his talk was that you have more information for optimisation about a dynamic language at runtime than you do for optimising a statically typed program at compile time.

Now that Gilad has left Sun (he is now working on a hush-hush project that involves creating a new language called Newspeak built on Smalltalk), he is happy to talk about the mistakes made with the JVM! (Unfortunately the slides aren't available.)

His talk was full of great quotes, and it is a shame that I can't remember them all. They include:

The problem with software is that you can add things, but you can't remove them. So it bloats (this is why his language is called Newspeak - in the book 1984 the language Newspeak was regularly revised to remove words).

Java has nine types of integer, and none of them do the right thing. (C# is basically no better before .NET enthusiasts get excited.) For example, 'Integer' has a public constructor and you can create new ones for the same value that aren't equal to each other! You don't have auto promotion to big integers (Python longs), so you still have to know in advance how big the results of your calculations are going to be. The basic problem comes from having primitives that aren't 'objects' and the consequent boxing and unboxing. A decent JIT can apply optimisation without needing to have primitives that aren't objects.

Programs possible from languages with a static type system are a subset of all the programs that can be written. For some people this is enough.

Don't listen to your customer Although it sounds like it, I don't think he was knocking agile processes of involving customers in the design. He was more saying don't let your customers tell you how to do things.

Static type systems are an inherent security risk.

This last point is an interesting one, as security is one of the things that type-safety is supposed to be able to bring. He explained it with a justification followed by a story:

Formalising real-world type systems is very difficult. In order to formalise them, the authors of most type systems simplify them by making some assumptions. In practise those assumptions turn out to be wrong. Even if your formalisation is fully correct in theory, it is only safe if the implementation has no bugs...

Gilad told the story of a bug in the Java Mobile Edition type verifier. The byte-code has a 'jump' instruction, and the two conditions for the jump instruction are that the target exists and that the type-state at source and target agree. In Java ME, the type-state is tracked separately, and the type verifier only verified the type-state and forgot to verify that the target exists. A Polish programmer discovered that he could construct bytecode that would pass verification but could jump into data. He could then do things like overwrite the length of an array and effectively peek and poke into memory (on a Nokia phone). Using the type information in the operating system, he was then able to reverse engineer the whole Nokia operating system...

Gilad thinks the answer is for everyone to use Smalltalk of course (he also likes Self). He isn't too polite about other languages:

On Ruby: The performance is pathetic

On Python: I can't take seriously a language VM that uses reference counting

On Erlang and improving performance through concurrency: what Joe didn't say is that Erlang isn't exactly fast...

Gilad's latest blog entry on monkey-patching is well worth a read. Jonathan and I are sort-of-quoted as 'the-pythoners'.

Python 3, Bytes and Source File Encoding

After PyCon I stayed on for the sprints, spending most of the time with the core-Python guys. Brett's introduction to developing Python was very good. I feel much more confident about tackling bugs now - from compiling Python, to creating and submitting a patch, to running the tests. (Python's test framework is awful in my opinion, but I think it's getting better.)

It was great to renew friendships with people like Jack Diederich and Brett Cannon, plus make new friends (despite what you may think from reading the mailing list, the Python-Dev guys are a friendly bunch). On Monday I decided to help Trent Nelson, who bet Martin Lowis a beer he could get the buildbots green by the end of the sprints. This was some challenge as at the time none of the 64bit Windows buildbots had built successfully, let alone gone green.

A problem that seemed easy was a failure in the tests for the tokenize module in Python 3. The problem, that we assumed would be a four line fix, was that generate_tokens wasn't honouring the encoding declaration of Python source files. In Python 3, where string literals are Unicode, it meant that string literals would be incorrectly decoded.

I spent about a day and a half fixing it, a lot of it pairing with Trent on the work but some of it just under his watchful eye whilst he revamped the Windows build scripts and process.

generate_tokens has an odd API. It takes a readline method as its argument, either from a file or the __next__ method of a generator so that the source you are tokenizing doesn't have to come from the filesystem. The problem is that if the file has been opened in text mode, then in Python 3 it will be decoded to unicode using the platform default. On the Mac this is utf-8 (a nice sensible default), but on Windows it is 'cp-1252'. If the Python source file has an 'encoding cookie' ( # coding=<latin-1> from pep-0263 or a utf-8 bom) then it will already have been (possibly incorrectly) decoded by the time we read the declaration.

I think this problem is fairly typical of moving from Python 2 to 3. If you save a text file with non-ascii characters on the Mac and then open it in text mode on Windows, it could well fail on Python 3. Even writing arbitrary text to a file could fail on Windows because the default encoding doesn't have a representation for every Unicode code point. Having Unicode strings is no panacea, you still have to know the encoding of your text files.

The answer for tokenize (at least the answer we implemented), was to move it to a bytes API. The readline method passed in to tokenize (the new name for generate_tokens ) should return bytes instead of strings. The correct encoding is then determined (defaulting to utf-8) and used to decode the source file. The encoding declaration can be on the first line or the second, to allow for a shebang line.

The function we wrote to detect encoding may be useful if you ever have to work with Python source files. Here's a Python 2 compatible version:

import re

from codecs import lookup



cookie_re = re . compile ( "coding[:=]\s*([-\w.]+)" )



def detect_encoding ( readline ) :

"""

The detect_encoding() function is used to detect the encoding that should

be used to decode a Python source file. It requires one argument. a

readline function.



It will call readline a maximum of twice, and return the encoding used

and a list of any lines it has read in.



It detects the encoding from the presence of a utf-8 bom or an encoding

cookie as specified in pep-0263. If both a bom and a cookie are present,

but disagree, a SyntaxError will be raised.



If no encoding is specified, then the default of 'utf-8' will be returned.

"""

utf8_bom = '\xef\xbb\xbf'

bom_found = False

encoding = None

def read_or_stop ( ) :

try :

return readline ( )

except StopIteration :

return ''



def find_cookie ( line ) :

try :

line_string = line . decode ( 'ascii' )

except UnicodeDecodeError :

pass

else :

matches = cookie_re . findall ( line_string )

if matches :

encoding = matches [ 0 ]

if bom_found and lookup ( encoding ) . name != 'utf-8' :



raise SyntaxError ( 'encoding problem: utf-8' )

return encoding



first = read_or_stop ( )

if first . startswith ( utf8_bom ) :

bom_found = True

first = first [ 3 : ]

if not first :

return 'utf-8' , [ ]



encoding = find_cookie ( first )

if encoding :

return encoding , [ first ]



second = read_or_stop ( )

if not second :

return 'utf-8' , [ first ]



encoding = find_cookie ( second )

if encoding :

return encoding , [ first , second ]



return 'utf-8' , [ first , second ]

Because the lines that have been read in still need to be tokenized, any lines that have been consumed by detect_encoding need to be 'buffered'. This is done using itertools.chain:

encoding , consumed = detect_encoding ( readline )

def readline_generator ( ) :

while True :

try :

yield readline ( )

except StopIteration :

return

chained = chain ( consumed , readline_generator ( ) )

My first contribution to Python (except for a couple of trivial patches). As you can see from the check-in message, Trent was amused by my TDD approach to testing ... The change may not survive into Python 3 in the wild, but working on it was very satisfying.

It was great to work a bit with Python 3, particularly to understand the bytes / string / io changes. The new nonlocal keyword is wicked, especially for testing.

On Wednesday I spent some time with Jerry Seutter fixing some tests for urllib2 so that they no longer do things like test for redirection by going to an external server that we know does a redirect! That patch hasn't yet been applied.

Trent got most of the buildbots green and the 64bit Windows boxes not only building but passing! Not sure if he got the beer though...

Fun at PyCon 2008

I really enjoyed PyCon. It was great that the conference has grown so much but still has a real 'community' feeling where you can stop and talk to people you don't know.

Unfortunately the boss took ill at the last minute, which left Jonathan and I to man the Resolver Systems sponsor stand (we had some good conversations in the exhibition hall) and to do his talk. We survived though, and being totally unprepared even enjoyed giving the talk together.

Even more problematic, Jonathan has a Linux laptop and I use a Mac - Giles was bringing the Windows laptop! Resolver One runs fine on my Macbook under Parallels, which acted as a great stand-in until Van Lindberg came to the rescue with a loan laptop. (Many thanks Van!) This was a double rescue, as in an attempt to get his Linux laptop to work with a conference projector Jonathan managed to blow away his X-server configuration and couldn't boot his laptop an hour before his talk on Test Driven Development.

I couldn't find Jonathan before doing the Resolver Systems lightning talk though, so I did it from the Mac.

I didn't attend many talks in the end (at least I don't think I did - it's all a bit of a blur), but my favourite was Raymond Hettinger on Core Python Containers (which as it is specific to CPython wasn't really relevant to me as I'm spending most of my time buried deep inside IronPython these days).

As for the great 'sponsor controversy'... personally I didn't enjoy the sponsor keynote(s?) particularly, but not everyone agrees. I'm afraid that most of the lightning talks on Friday were pretty dull, but the ones on Saturday and Sunday were much better (even the sponsor ones). As usual the issue has been blown out of proportion, but the organisers are well aware of what worked and what didn't.

I think my talk went OK. I dashed from the talk with Jonathan on 'End User Computing and Resolver One', straight into my 'Python in the Browser Talk'. Rather than having any time to prepare myself, Chris McAvoy introduced me as I walked into the room. The audience was then treated to an undignified scramble as I tried to get my computer in a fit state to give the presentation.

The best part of the talk was showing the prototype 'Interactive Interpreter in the Browser' right at the end. This used some bugfixed IronPython binaries that Dino Viehland delivered to me the morning of my talk. As if it wasn't rushed enough already! I think that once the updated binaries, the 'interpreter in the browser' will be a great tool for teaching Python.

After the talk I gave a brief (7.5 minutes podcast) interview with Dr Dobb's Journal on Silverlight and IronPython.

There were many other highlights. I talked to a lot of great folks, too numerous to mention all of them, including folks like Maciek of PyPy (who I met at the SFI conference and will also be at RuPy):

And Dino the lead IronPython developer (he gave me some great reasons why Resolver Systems should upgrade to IronPython 2 including improved performance and startup time):

His talk was amazing (showing Django on IronPython using Silverlight). In order to maintain feature parity with Jython Dino implemented from __future__ import GIL whilst I was watching! I presented the Ironclad Overview at the 'Python and .NET' Open Space organised by Feihong Hsu (who gave a great talk on Python.NET):

After the Open Space a bunch of us went to downtown Chicago (including Mahesh Prakriya who you can see lingering in the back of the picture above and Harry Pierson who would prefer I didn't call him the new commander-in-chief of dynamic languages at Microsoft). We went up the John Hancock Tower which has an astonishing view of the city.

After this it was onto the sprints. Jonathan and I went out with a crew of the 'Python-Dev' guys to a Mexican restaurant. Never have I seen an individual so excited to find good Mexican food!

Whilst hanging around in the Python-core sprint I got to meet Mark Hammond:

The sprinting was massively my favourite part of the whole event, and deserves a blog entry of its own...

IronPython & Silverlight 2 Tutorial with Demos and Downloads

Instead of posting my PyCon talk slides, I've turned them into a series of articles instead, which should be easier to follow. All the examples are available online and for download. This is everything you need to get started with IronPython and Silverlight 2.

All the articles, demos and downloads can be found at: Voidspace IronPython and Silverlight Pages.

You can experiment online with the IronPython & Silverlight Web IDE.

There is also a prototype of my Interactive Python Interpreter in the Browser. Due to a bug in the current Dynamic Silverlight binaries it can't handle indented blocks. I expect Dino will be able to post fixed binaries soon. If you want to help me make the Javascript cross-browser then get in touch! (Currently it is safari only I think.)

The Articles

These articles will take you through everything you need to know to write Silverlight applications with IronPython:

Downloads & Online Examples

You can download several of the examples used in the articles:

The easiest way of experimenting with the Silverlight APIs is through the Web IDE:

The following downloads use C#:

One of the most exciting things about Silverlight 2 is the possibility of embedding an interactive interpreter inside web pages:

If you find any bugs, typos or missing links then please let me know.

Archives