Computing Thoughts

Parallel Python

by Bruce Eckel

September 11, 2007



Summary

My previous post led me to this library, which appears to solve the coarse-grained parallelism problem quite elegantly.


You can find the library here, which was written by Vitalii Vanovschi, a Russian chemist now doing graduate work at USC. It appears he created the library to serve his own computational needs, and designed it to be simple so his colleagues could use it.

Parallel Python is based on a functional model; you submit a function to a "job server" and then later fetch the result. It uses processes (just like I requested in my previous post) and IPC (InterProcess Communication) to execute the function, so there is no shared memory and thus no side effects.

The pp module will automatically figure out the number of processors available and by default create as many worker processes as there are processors.

You provide the function, a tuple of that function's arguments, and tuples with the dependent functions and modules that your function uses. You can also provide a callback function called when your function completes. Here's what the syntax looks like:

import pp job_server = pp.Server() # Uses number of processors in system f1 = job_server.submit(func1, args1, depfuncs1, modules1) f2 = job_server.submit(func1, args2, depfuncs1, modules1) f3 = job_server.submit(func2, args3, depfuncs2, modules2) # Retrieve results: r1 = f1() r2 = f2() r3 = f3()

What's even more interesting is that Vitalii has already solved the scaling problem. If you want to use a network of machines to solve your problem, the change is relatively minor. You start an instance of the Parallel Python server on each node machine:

node-1> ./ppserver.py node-2> ./ppserver.py node-3> ./ppserver.py

Then create the Server() by handing it a list of nodes in the cluster:

import pp ppservers=("node-1", "node-2", "node-3") job_server = pp.Server(ppservers=ppservers)

Submitting jobs and getting results is the same as before, thus switching from multicores to a cluster of computers is virtually effortless. Notice that it transparently handles the problem of distributing code to the remote machines. It was not clear, however, whether ppserver.py automatically makes use of multiple cores on the node machines, but you would think so.

This library allows you to stay within Python for everything you're doing, although you can easily do further optimizations by writing time-critical sections in C, C++, Fortran, etc. and effortlessly and efficiently linking to them using Python 2.5's ctypes.

This is an exciting development for anyone doing parallel processing work, and something I want to explore further once my dual-core machine comes online.

Talk Back!

Have an opinion? Readers have already posted 16 comments about this weblog entry. Why not add yours?

RSS Feed

If you'd like to be notified whenever Bruce Eckel adds a new entry to his weblog, subscribe to his RSS feed.

About the Blogger

Bruce Eckel (www.BruceEckel.com) provides development assistance in Python with user interfaces in Flex. He is the author of Thinking in Java (Prentice-Hall, 1998, 2nd Edition, 2000, 3rd Edition, 2003, 4th Edition, 2005), the Hands-On Java Seminar CD ROM (available on the Web site), Thinking in C++ (PH 1995; 2nd edition 2000, Volume 2 with Chuck Allison, 2003), C++ Inside & Out (Osborne/McGraw-Hill 1993), among others. He's given hundreds of presentations throughout the world, published over 150 articles in numerous magazines, was a founding member of the ANSI/ISO C++ committee and speaks regularly at conferences.

This weblog entry is Copyright © 2007 Bruce Eckel. All rights reserved.