Python is a programming language that I learnt somewhat recently (something like 2, 3 years ago) and that I like very much. It is simple, to the point, and has several functional-like constructs that I am already familiar with. But Python is slow compared to other programming languages. But it was unclear to me just how slow Python was compared to other languages. It just felt slow.

So I have decided to investigate by comparing the implementation of a simple, compute-bound problem, the eight queens puzzle generalized to any board dimensions. This puzzle is most easily solved using, as Dijkstra did, a depth-first backtracking program, using bitmaps to test rapidly whether or not a square is free of attack1. I implemented the same program in C++, Python, and Bash, and got help from friends for the C# and Java versions2. I then compared the resulting speeds.

The eight queens puzzle consist in finding a way to place 8 queens on a 8×8 chessboard so that none of the queen checks another queen. By “checking”, we mean that if the other queen was of the opposing camp, you could capture it if it were your turn to play. So this means that no queen is on the same row, column, or diagonal as another queen. Using a real chessboard and pawns in lieu of queens, you can easily find a solution in a few seconds. Now, to make things more interesting, we might be interested in enumerating all solutions (and, for the time being, neglecting to check if a solution is a rotated or mirrored version of another solution).

The basic algorithm to solve the Eight queens puzzle is a relatively simple recursive algorithm. First, we place a queen somewhere on the first row and we mark the row it occupies as menaced. We also do so with the column and two diagonals. We then try to place a second queen somewhere on the second row, on a square that is menace-free, and mark the row, column, and diagonals of the new queen as menaced. And we proceed in the same fashion for other queens. But suppose that at the th stage, we cannot find a menace-free square, preventing us from placing the th queen. If the situation arises, we give up for the th queen and backtrack to the th queen. We try a new (and never tried before) location for the th queen and we go forth trying to place the th queen. If may happen that we go all the way back to the th queen because there are no other solutions for the th queen, which asks for the th queen to be moved, which can also result in a dead-end; and so forth all the way down to the th queen.

Because the algorithm proceeds depth-first and can rear back quite a bit in looking for new solutions, it is called a backtracking algorithm. Backtracking algorithms are key to many artificial intelligence systems like, well, chess programs.

So I wrote the program for the queens puzzle for a board in C quite sometime ago (circa 1995), then I ported to C++ recently (2006). And just for kicks, I decided to port it to different languages, with help from friends for Java and C# (of which I know about zilch). The C++ versions consist in a generic version that accepts the board size as a command-line argument and in a constant-propagated version where the board size is determined at compile-time. The Python versions, as there are two of them also, differ on how python-esque they are. The first variant is a rather literal translation of the C++ program. The second uses pythonesque idioms such as sets rather than bitmaps and turns out to be quite a bit faster—about 40% faster in fact. The Bash version is necessarily rather bash-esque as Bash does not offer anything much more sophisticated than arrays in terms of data structures.

All implementations were compiled with all optimizations enabled ( -O3 , inlining, interprocedural optimizations, etc.). For C++, I used g++ 4.2.4, for C#, gmcs 1.2.6.0, for Java, gcj 4.2.4, Python 2.6.2, and, finally, Bash 3.2.39—all latest versions for Ubuntu 8.04 LTS, which doesn’t means that they’re the really latest versions.

So I ran all seven implementations with boards sizes ranging from 1 to 15 on a AMD64 4000+ CPU (the numbers are arbitrary; I wanted to get good data but also limit the CPU time spent as the time increases factorially in board size!). At first, the results are not very informative:

As expected, all times shoot up quite fast, with, unsurprisingly, BASH shooting up faster, followed by the two Python implementations. At this scale, Java, C#, C++, C++-fixed (the constant-propagated version) all seems to be more or less similar. To remove the factorial growth (as even with a log plot the data remains rather unclear), I scaled all performances relative to the C++ version. The C++ version is therefore 1.0 regardless of actual run-time; for board size , the time needed by Bash, for example, was divided by the time needed by C++ for board size . We now get:

We can use a log-scale for the axis to better separate the similar results:

We see three very strange things. The first is Bash shooting up wildly. The second is that C# relative time goes down with the increasing board size. The third is that before , the results are fluctuating. The first is easily explained: Bash is incredibly slow. About 10,000 times as slow as the optimized compiled C++ version. It is so slow that the two last data points are estimated as it would have taken an extra week, or so, to get them. The second anomaly needs a bit more analysis. Looking at the raw timing data, we can see that the C# version seems to be needing an extra 7ms regardless of board size. I do not know where that comes from, as the timings do not time things such as program load and initialization, but only the solving of the puzzle itself. It may well be the JIT that takes a while to figure out that the recursive function is expensive and the run-time compilation takes 7ms? Anyway, would we remove this mysterious extra 7ms, the odd behaviour of the C# implementation would vanish. The third anomaly is easily explained by the granularity of the timer and the operating system; up to puzzle size of 8, the times are much less than 100 μs for most implementations. Suffice to move the mouse and generate a couple of interrupts to throw timing off considerably.

Removing the small board sizes yields:

which shows that the respective implementations are well behaved (except for C# and its most probably JIT-related extra 7ms).

So we see that Bash is about 10,000 times slower than the optimized C++ version. The constant-propagated version is only a bit (~5%) faster than the generic C++ version. The Java version takes about 1.5× the time of the C++ version, which is way better than I expected—don’t forget that this is a native-code version of the Java program, it doesn’t run on the JVM. The C# version is twice as slow as the C++ version, which is somewhat disappointing but not terribly so. The Python versions are 120× (for the more pythonesque version) and 200× slower. That’s unacceptably slow, especially that the Python programs aren’t particularly fancy nor complex. We do see that using pythonesque idioms yields a nice performance improvement—40%—but that’s still nowhere useful.

*

* *

So what does this tell us? For one thing, that Bash is slow. But that Bash is slow even when it doesn’t use any external command should not come as a surprise. From what I understand from Bash, data structures are limited to strings and arrays. Lists and strings are the same. Basically, a list is merely a string with items separated by the IFS character(s), which causes all array-like accesses to lists to be performed in linear time as each time the string is reinterpreted given the current IFS . So even though a construct such as ${x[i]} looks like random-access, it is not. As for explicitly constructed arrays (as opposed to lists), there seems to be a real random-access capability, but it’s still very slow. I do not think that bash uses something like a virtual machine and an internal tokenizer to speed up script interpretation. Maybe that’d be something to put on the to-do list for Bash 5? In any case, I also learnt that Bash is a lot more generic than I thought.

The other thing is that Python is not a programming language for compute-bound problems. This makes me question how far projects such as Pygame (which aims at developing a cromulent gaming framework for Python) can go. While all of the graphics and sound processing can be off-loaded to third party libraries written in C or assembly language and interacting with the platform-specific drivers, the central problem of driving the game itself remains complete. How do you provide strong non-player characters/opponents when everything is 100× slower than the equivalent C or C++ program? What about strategy games? How can you build a MMORPG with a Python engine? Is a MMORPG I/O or compute bound? Could you write a championship-grade chess engine in Python?

My guess is that you just can’t.

And that’s quite sad because I like Python as a programming language. I used it in a number of (I/O-bound) mini-projects and I was each time delighted with the ease of coding compared to C or C++ (for those specific tasks). It pains me that Python is just too slow to be of any use whatsoever in scientific/high-performance computing. I wish for Python 4 to have a new virtual machine and interpreter to bring Python back with Java and C#, performance-wise. Better yet, why not have a true native compiler like gcj for Python?

*

* *

I am fully aware that the queens on a board puzzle is a toy problem of sorts. But its extreme simplicity and its somewhat universal backtracking structure makes it an especially adequate toy problem. If a language can’t handle such a simple problem very well, how can we expect it to be able to scale to much more complex problems like, say, a championship-level chess engine?

*

* *

The raw data (do not forget that the two last timings for Bash are estimated). All times are in seconds.

Language 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 C++ 0.000000 0.000002 0.000003 0.000046 0.000005 0.000012 0.000036 0.000126 0.000569 0.002973 0.016647 0.080705 0.448602 2.830424 18.648028 C++-fixed 0.000000 0.000001 0.000001 0.000007 0.000005 0.000011 0.000033 0.000104 0.000500 0.002369 0.011468 0.065407 0.462506 2.765098 18.032639 C# 0.007315 0.007260 0.008356 0.007302 0.007864 0.007451 0.007421 0.008287 0.008452 0.012481 0.033548 0.154490 0.877194 5.444645 35.769138 Java 0.000001 0.000002 0.000002 0.000003 0.000005 0.000015 0.000049 0.000198 0.000895 0.004244 0.022947 0.128498 0.692349 4.457763 29.321677 Python 0.000012 0.000029 0.000054 0.000136 0.000424 0.001708 0.006078 0.026174 0.115218 0.546922 2.822554 15.405020 91.170068 581.983903 3762.739785 Python-2 0.000012 0.000028 0.000045 0.000109 0.000318 0.001275 0.004125 0.017448 0.077854 0.348823 1.701767 9.349661 55.532111 339.244132 2259.010794 Bash 0.003054 0.010938 0.006067 0.011355 0.026913 0.091869 0.347451 1.472102 6.483076 30.813085 163.061341 891.828031 5031.663741 31746.000000 209162.000000

The More Pythonesque version:

#!/usr/bin/python -O # -*- coding: utf-8 -*- import sys import time ######################################## ## ## (c) 2009 Steven Pigeon (pythonesque version) ## diag45=set() diag135=set() cols=set() solutions=0 ######################################## ## ## Marks occupancy ## def mark(k,j): global cols, diag45, diag135 cols.add(j); diag135.add(j+k) diag45.add(32+j-k) ######################################## ## ## unmarks occupancy ## def unmark(k,j): global cols, diag45, diag135 cols.remove(j); diag135.remove(j+k) diag45.remove(32+j-k) ######################################## ## ## Tests if a square is menaced ## def test(k,j): global cols, diag45, diag135 return not((j in cols) or \ ((j+k) in diag135) or \ ((32+j-k) in diag45)) ######################################## ## ## Backtracking solver ## def solve( niv, dx ): global solutions, nodes if niv > 0 : for i in xrange(0,dx): if test(niv,i) == True: mark ( niv, i ) solve( niv-1, dx) unmark ( niv, i ) else: for i in xrange(0,dx): if (test(0,i) == True): solutions += 1 ######################################## ## ## usage message ## def usage( progname ): print "usage: ", progname, " <size>" print "size must be 1..32" ######################################## ## ## c/c++-style main function ## def main(): if len(sys.argv) < 2: usage(sys.argv[0]) else: try: size = int(sys.argv[1]) except: usage(sys.argv[0]) return if (size <= 32) & (size>0): start = time.time() solve(size-1,size) elapsed = time.time()-start print "%s %0.6f" % (solutions,elapsed) else: usage(sys.argv[0]) # if __name__ == "__main__": main()

The other versions can be found here in super_reines.zip. The Bash and C++ versions are somewhat Linux-specific.

A trick explained in detail in Brassard and Bratley,, Presses de l’Université de Montréal. Translated to English: Fundamentals of Algorithmics (at Amazon.com)

2 Frédéric Marceau translated the program from C++ to C#. François-Denis Gonthier translated from C++ to Java.

*

* *

So I added a third Python version to the archive. I also benchmarked it. It is quite a bit faster than the original one, and even than the ‘pythonesque’ version. For example:

$ super-reines.py 12 14200 15.966019 $ super-reines-2.py 12 14200 9.478235 $ super-reines-3.py 12 14200 6.845071

So a more ‘functional’ version performs about 2.3× faster than a direct translation from C++. For the same problem size, however, the C++ (fixed) version takes 0.069s… Roughly 100× faster than the 3rd version.

Share this: Reddit

Twitter

More

Facebook

Email



Like this: Like Loading... Related