100,000 tasklets: Stackless and Go

(Sorry for the repost. For some reason I used the wrong date. Now corrected. Since posting this yesterday several people have posted timings which confirm that with the code Pike posted, current Go is roughly 1/2 the performance of current Stackless Python.)

People are talking about Google's Go language. It supports concurrency, which here means light-weight user-space threads and inter-thread communications channels. In Stackless terms these are "tasklets" and "channels." See Richard Tew's more detailed description of the correspondences between these two languages.

In Rob Pike's Google Tech Talk on Go he shows an example which builds 100,000 "goroutines" (their tasklest, and a pun on the word 'coroutine') and does a simple operation in each goroutine. It's at 43:45 into the video. I've transcribed it here by hand, so this might contain typos:

package main import ("flag"; "fmt") var ngoroutine = flag.Int("n", 100000, "how many") func f(left, right chan int) { left <- 1 + <-right } func main() { flag.Parse(); leftmost := make(chan int); var left, right chan int = nil, leftmost; for i:= 0; i< *ngoroutine; i++ { left, right = right, make(chan int); go f(left, right); } right <- 0; // bang! x := <-leftmost; // wait for completion fmt.Println(x); // 100000 }

On the Stackless list, Richard Tew commented:

They have an nice example where they chain 100000 microthreads each wrapping the same function that increases the value of a passed argument by one, with channels inbetween. Pumping a value through the chain takes 1.5 seconds. I can't imagine that Stackless will be anything close to that, given the difference between scripting and compiled code.

import stackless from optparse import OptionParser parser = OptionParser() parser.add_option("-n", type="int", dest="num_tasklets", help="how many", default=100000) def f(left, right): left.send(right.receive()+1) def main(): options, args = parser.parse_args() leftmost = stackless.channel() left, right = None, leftmost for i in xrange(options.num_tasklets): left, right = right, stackless.channel() stackless.tasklet(f)(left, right) right.send(0) x = leftmost.receive() print x stackless.tasklet(main)() stackless.run()

I was curious so I wrote something up which suggested that Stackless Python was a lot faster than Go for this task. That was on a benchmark of my own devising, based on reading Richard's comment. Yesterday I finally tracked down the code and wrote a direct translation into Stackless:

It's a bit longer and more verbose than Go because there's no syntactical support for the Stackless additions to Python. Question is, how long does it take to run?

My reference is from when Pike did:

wally:~ r$ goc goroutine.go wally:~ r$ 6.out 100000 wally:~ r$ time 6.out 100000 real 0m1.507s user 0m0.875s sys 0m0.626s wally:~ r$

He jokingly apologized for how long the 'goc' step took, which got a titter because it was quite fast, even on a "little Mac here, it's not very fast."

Let me reproduce that timing test in Stackless:

josiah:~/src dalke$ uptime 14:01 up 2 days, 13:53, 4 users, load averages: 0.46 0.68 0.74 josiah:~/src dalke$ time spython go_100000.py 100000 real 0m0.655s user 0m0.512s sys 0m0.136s josiah:~/src dalke$

("spython" is the name of my local installation of Stackless.) Now, I did my tests on a 2.5 year old MacBook Pro. That should be comparable to his Mac. My Stackless example was faster in every way than the reference Go code, and that includes the code of Python parsing the .py file and compiling it to byte-code.

(Edit: decode contributed timing numbers done on the same machine, with Stackless being almost twice as fast as Go for this benchmark.)

Remember, Go is a compiled language designed for concurrency. I'm working with the C implementation of Python, which compiles only to byte codes, not machine code, and which was not designed for concurrency. Yet the Python code is faster.

Why then does Pike sound so proud about the performance of Go on this timing test? I don't know.

Edit: By the way, I did some other analysis. It takes 0.35 seconds to start up that many tasklets and 0.07 seonds to send a number through them all. I did not time tasklet teardown.

Why do they talk about fast compile times with Go?

One other observation about Go. Pike's video and another I watched stressed the fast compile times. It seems modern development bogs down doing compiles. They gave numbers: 1,000 lines of Go code in 0.2 seconds on some sort of Mac. I took my copy of Python 0.9p1, at 25,000 lines of C code. It compiled in 7 seconds on my MacBook Pro laptop. Scaling up, the same number of lines of Go code would compile (assuming we have identical machines) in 5 seconds. If I compile with CFLAGS=-g then Python compiles in 2.75 seconds.

Since I'm not testing this on the same machine as them it's hard to say anything concrete, but I would be suprised to find out that my laptop was twice as fast as theirs, which it would have to be to make my numbers be worse than what they report for Go. Plus, I've heard the Intel compilers are a lot faster than gcc. (Edit: I tried to build Go on my laptop only to find it doesn't seem to support OS X 10.4.)

Something doesn't make sense here. Why do they tout Go's performance both for goroutine message passing and for compilation as exceptional? The timings seem worse than existing comparables.

Edit based on feedback from Karl on my comments page: It seems that the real advantage is that Go handles large dependancies better than the C or Java language models. They don't want to maintain the Makefile dependencies by hand, and for the large tool sets at Google, simple Java recompiles is dominated by the compiler taking 90+ seconds just to read all the dependency information. Yet in the videos the praise is for the (relatively slow) compile times of a small set of files.

If you have ideas or thoughts, leave a comment.

Addendum

Here's an example of a Go sort:

func Sort(data Interface) { for i := 1; i < data.Len(); i++ { for j := i; j > 0 && data.Less(j, j-1); j-- { data.Swap(j, j-1); } } }

type Interface interface { Len() int; Less(i, j int) bool; Swap(i, j int); }

It's handled by the interfaceTypes/interfaces are only a partial specification. There's nothing which says that Len() returns a constant value, nor that the size of the array is constant while being sorted. What would happen if the size changes while the sort is in operation?

This was a bug in Python some years back. During list.sort() a user-defined comparison method might reach back to the container and change it. In the worst case situations, I think this caused a crash. This was fixed by making the container read-only during the duration of the sort:

>>> class Evil(object): ... def __lt__(self, other): ... data.append(9) ... return 1 ... >>> data = [1, 2, Evil()] >>> data.sort() Traceback (most recent call last): File "<stdin>", line 1, in <module> ValueError: list modified during sort >>>

Why does Go do for this case? Does it crash? The code doesn't expect errors so there's no way to use the multi-valued return types suggested in Effective Go. It doesn't have exceptions. Perhaps there's a global error handler?

If you know, leave a comment.

Addendum #2 - building Go on my laptop

joesb on reddit ask why I didn't compile Go myself, so I could get head-to-head timings.

I went to the installation page thinking to do that. I was put off by the environment variables. This is 2009 and there should be no reason to set any variables. Honestly, that's why put me off.

Still, since this is making its way around the web, I decided to give it a go. The build requires $GOROOT, $GOOS and $GOARCH to be set. I did that. It says $GOBIN is optional, so I didn't set that. Ran the build and "$GOBIN is not a directory or does not exist". Doesn't seem so optional, does it?

I set GOBIN to ~/tmp and restated the build. That got me to "make: quietgcc: Command not found". A bit of digging and I see this where I missed that GOBIN needs to be on the PATH. It does mention that in the documentation, but since it was 'optional' I skipped it. Turns out that if I don't set the option then it uses ~/bin, which doesn't exist on my machine and is why I got the error in the previous paragraph. Only, well, I didn't set GOBIN so the error message is wrong - it should be that ~/bin is not a directory or does not exist.

Fixed my PATH and ended up with

quietgcc -ggdb -I/Users/dalke/cvses/go/include -O2 -fno-inline -c /Users/dalke/cvses/go/src/cmd/cc/y.tab.c /usr/include/stdio.h:269: error: conflicting types for 'getc' /Users/dalke/cvses/go/src/cmd/cc/cc.h:573: error: previous declaration of 'getc' was here make: *** [y.tab.o] Error 1 make: *** Waiting for unfinished jobs....

I've seen this before. It's because I'm running OS X 10.4.11 which has bison version 1.28 and Mac 10.5 updated bison to 2.3, and that introduced some changes. I did write that I have an old Mac, right? And I haven't upgraded it either.

In short, it doesn't look like Go installs on my machine. Perhaps I should get a new OS ... or a new laptop.

I'm not complaining about this limitation. I have an old OS. There's no reason for Google to expect to support it. I'm not interested enough to fix the problem myself. I think the numbers given in Pike's Tech Talk are comparable enough that my question is still valid - why does Pike emphasize the performance of Go's goroutine creation and channel communication when it seems to be slower than Stackless and definitely is not an order of magnitude faster?

Addendum #3 - comparisons on the same hardware

decode compared Go and Stackless on the same machine and reports:

~/development/go/dev$ time ./8.out 100000 real 0m0.632s user 0m0.288s sys 0m0.344s ~/install/stackless-2.6.4$ time ./python 100k.py 100000 real 0m0.350s user 0m0.292s sys 0m0.060s

making Python about twice as fast as Go. (The ratio is 1/1.8 if you want to be precise.)

Thanks decode!

Andrew Dalke is an independent consultant focusing on software development for computational chemistry and biology. Need contract programming, help, or training? Contact me