Stackless Python vs. Go

Note: I did a little more research and it turns out that the gc runtime creates one OS thread only, and then adds threads as a way to avoid I/O locks. On the other hand the gccgo runtime maps goroutines and p_threads on a 1 to 1 basis (as of now).

I have been lazily following the Go language project for a while now, when it
was suddenly discussed in this post of “Appunti Digitali” (in italian), that
pointed at this post
on Dalke Scientific.

It really got me that S.Python manages to beat Go so
badly, so I decided to run my own tests on my trusty old IBM X-41 (Pentium-M 1.5
GHz).

I compiled the Go toolchain and S.Python, then ran the
test exactly as I found them on the Dalke page. The results are following
hereby:

$ time ./8.out
100000

real 0m5.197s
user 0m1.508s
sys 0m1.352s

$ time
/usr/local/bin/python2.6 test.py
100000

real 0m3.315s
user 0m1.556s
sys 0m0.148s

What really struck me is that
the time spent in userland is roughly the same for both programs. What really
kills the Go execution time is that it spends almost as much time in kernel
mode. I wondered why…

Then I remembered that, in the Google Tech Talk,
Pike said something about the Go runtime managing threads for the user, and I
automatically thought that it would associate user threads with OS threads in
some way (as is the norm). In fact that is what gccgo does (uses NTPL).

On
the other hand, reading Stackless Python documentation, I found this page
regarding Tasklets. The page clearly states: “Tasklets are
characterized by being very lightweight and portable, and make great
alternatives to system threads or processes.”

So I am prone to believe
that S.Python Tasklets are managed by the Python VM itself, requiring no
user-to-kernel mode transitions and no other kernel interaction… because they
are not real threads (as opposed to Go goroutines).

This, with the fact
that Go is mainly intended to cut compile time (not so much running
time), makes me believe the comparison to be pretty unfair, and I think Go
provides quite a performance!

Regarding certain unsafe situations, they
are definitely a problem. Still, I believe that they will be taken care of
quickly: Go is a brand new language and rough edges are to be expected here and
there, but it has quite some room for growth.

Update:

I modified the test program so that each goroutine/Tasklet calls a second
function that loops 1000 times and cumulates the result of “sum = sum+1”.

The results are as expected: the compiled code is one order of magnitude faster
than Stackless Python interpreted code! I also replaced range with xrange, as suggested by Cesare
Di Mauro, but that did not change things much… nor did it when I tried psyco
(a JIT compiler for Python).


$time ./test2 && time /usr/local/bin/python test2.py
100000

real 0m4.037s
user 0m1.040s
sys 0m0.840s
100000

real 0m19.567s
user 0m15.697s
sys 0m0.160s

The source code for Python is:
default=100000)

def f(left, right):
loop()
left.send(right.receive()+1)

def loop():
sum = 0
for i in xrange (1,1000):
sum=sum+1

def main():
options, args = parser.parse_args()
leftmost = stackless.channel()
left, right = None, leftmost
for i in xrange(options.num_tasklets):
left, right = right, stackless.channel()
stackless.tasklet(f)(left, right)
right.send(0)
x = leftmost.receive()
print x

stackless.tasklet(main)()
stackless.run()

And the test code for Go is:
"fmt";
)

var ngoroutine = flag.Int("n", 100000, "how many")

func f(left, right chan int) {
loop();
left < - 1+<-right; } func loop() { var sum int; sum = 0; for i := 0; i < 1000; i++ { sum = sum + 1 } } func main() { flag.Parse(); leftmost := make(chan int); var left, right chan int = nil, leftmost; for i := 0; i < *ngoroutine; i++ { left, right = right, make(chan int); go f(left, right); } right <- 0; // bang! x := <-leftmost; // wait for completion fmt.Println(x); // 100000 }

8 thoughts on “Stackless Python vs. Go”

  1. Tasklets and Goroutines are both green treads (e.g. are lightweight ans scale well compared to OS-level threads) , so there shouldn’t be much difference there.

    One thing I suspect might be different is that goroutines are supposted to be distributed over a set of threads, where as tasklets only run in one thread (and also only on one core due to the GIL in Python, which means Stackless wouldn’t scale on a multi core system at all).

    Goland manages goroutines and can reallocate goroutines to a different os-thread if one goroutine block (on IO for instance).
    This adds could add some overhead.
    For what I know Golang doesn’t use this feature yet, since it’s not stable yet. So I’m not sure if it has anything to do with the difference in speed.

    Your test with an added load is interesting, because the original test results in just a number of function calls, for which Python could do very well. The biggest difference between Golang and Python is in the typesystem, the intepreter is pretty much the least slow thing for that kind of test.

  2. Sorry, but Tasklets and Goroutines are both coroutines, not Green Threads (e.g. a Thread containing only one routine is a coroutine as well).

    Tasklets are implemented as Green Threads (they are entirely managed by the Python VM), while Goroutines, as of now, are P-Threads.

    I guess the idea for the future is, as you said, to have only a few OS threads and multiplex the Goroutines upon them, but that would require features that are not present in the runtime right now.

  3. Molto interessante! Certo che è raro trovare un linguaggio più lento di python! 🙂
    Possibile che vogliano sacrificare così tanto in cambio di una compilazione veloce? A quel punto, perché non usare un interprete e un lint? 😀

    (tra l’altro non avevo ancora sentito parlare di stackless python, direi che ha delle caratteristiche decisamente interessanti, da approfondire)

  4. Cominciano ad essere riconosciuti i tuoi meriti:

    Google, nel 2010
    arriva lo smarthphone
    Sarà venduto online
    e senza intermediari
    Si chiamerà Nexus One. I servizi di telefonia mobile verranno acquistati separatamente

  5. Yes it has green threads (stackless) that allow quickly create many lightweight threads as long as no operations are blocking (something like Ruby’s threads?). What is this great for? What other features it has I want to use over CPython?

Leave a Reply

Your email address will not be published. Required fields are marked *