Posted 2009-08-09 03:50:00 GMT

The humiliation of having teepeedee2 play second fiddle to C implementations was weighing heavily on my mind, so today I spent a few hours squeezing a bit more fat out of the HTTP processing.

One of the major motivating factors for making tpd2 was the idea from the C10k website that it should be possible to get much better performance out of a webserver than is currently normal. The 10k goal looked very far away at the beginning of the project, and many people said it was impossible from Lisp, which after all is a very dynamic language.

Yes, tpd2 has broken the 10k requests/s barrier on one core.

This is a big moment for me psychologically (and a testament to the excellent work done by the SBCL hackers on their Lisp implementation).

The significance is that teepeedee2 presents a new level of speed for dynamic websites. The processing of GET parameters, building up of dynamic HTML and so on take less than 0.1ms — on my laptop, probably even less on a modern server CPU. Additionally, because of its scalable timeouts and use of epoll, teepeedee2 can handle many AJAX polling clients extremely efficiently. This opens up a world of opportunity for interactive web applications that simply can't be implemented on traditional platforms.

The two competitive (but slower) web application frameworks — ULib and kloned are based on custom template languages with the possibility to embed arbitrary C++ code.

The biggest obstacle was that the automatic code transforms from cl-cont mean that simply using local functions (i.e. flets and labels) causes memory to be allocated at runtime (inefficient funcallable/cc objects are created). Therefore I fiddled with the HTTP parsing to do more inside a without-call/cc. The result was a huge match-bind for cl-irregsexp.

Given a program has a (correct) performance orientated design, it's generally not very useful to look at profiling data. except to locate performance bugs where the implementation does not meet the design (e.g. this issue with cl-cont), or to do micro-optimizations. I had mostly concentrated on getting a good architectural design for teepeedee2, and hadn't done much micro-optimization based on profiles till now. Based on the profile output from sb-profile, I inlined a few timeout related functions, a few miscellaneous functions and rewrote the IP address to string routine (these changes boosted about 10% or so).

The result is this

$ schedtool -a 0 -e ab -n 100000 -c10 http://localhost:3000/test?name=John This is ApacheBench, Version 2.3 <$Revision: 655654 $> Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/ Licensed to The Apache Software Foundation, http://www.apache.org/ Benchmarking localhost (be patient) Completed 10000 requests Completed 20000 requests Completed 30000 requests Completed 40000 requests Completed 50000 requests Completed 60000 requests Completed 70000 requests Completed 80000 requests Completed 90000 requests Completed 100000 requests Finished 100000 requests Server Software: Server Hostname: localhost Server Port: 3000 Document Path: /test?name=John Document Length: 19 bytes Concurrency Level: 10 Time taken for tests: 8.839 seconds Complete requests: 100000 Failed requests: 0 Write errors: 0 Total transferred: 5800000 bytes HTML transferred: 1900000 bytes Requests per second: 11313.29 [#/sec] (mean) Time per request: 0.884 [ms] (mean) Time per request: 0.088 [ms] (mean, across all concurrent requests) Transfer rate: 640.79 [Kbytes/sec] received Connection Times (ms) min mean[+/-sd] median max Connect: 0 0 0.1 0 8 Processing: 0 1 0.5 1 39 Waiting: 0 1 0.5 1 39 Total: 0 1 0.5 1 39 Percentage of the requests served within a certain time (ms) 50% 1 66% 1 75% 1 80% 1 90% 1 95% 1 98% 1 99% 2 100% 39 (longest request)

I started tpd2 like this

schedtool -a 1 -e sbcl --load bench.lisp

(handler-bind ((error (lambda(c) (declare (ignore c)) (invoke-restart 'CONTINUE)))) (asdf:oos 'asdf:load-op 'teepeedee2)) (in-package #:tpd2.user) (defsite *bench*) (with-site (*bench*) (defpage "/test" (name) :create-frame nil (<h1 "Hello " name))) (http-start-server 3000) (event-loop)

where bench.lisp was

The hardware is my aging Panasonic Y7 laptop — an Intel(R) Core(TM)2 Duo CPU L7700 @ 1.80GHz, running Linux 2.6.31-5-generic #24-Ubuntu, and SBCL 1.0.29.11.debian.

The (now) fastest is an awesome framework called ULib by Stefano Casazza. It is in C++, uses select for portability(!) to MS Windows, and of course compiles dynamic pages to machine code. It once scored 11169.22/s, which is just a smidgeon less that teepeedee2, but normally scores much less (about 9k/s) — teepeedee2 is the fastest in my book. However, I hope to be able to blog more about Ulib because it's quite interesting and maybe Stefano will be able to improve it to topple teepeedee2 from the top spot.

I guess this means mission complete for teepeedee2. The external APIs need to be designed and documented if anybody wants to use it, and I would be delighted to accept patches.

UPDATE 20090819 — Kloned and Ulib do not use limited scripting languages. They can embed arbitrary C++. (Thanks to Stefano Barbato.)

UPDATE 20091028 — Added nginx's perl mode.

UPDATE 20091231 — Note that ULib is now the winner — dammit! :-(