Performance on AMD Opteron, 2.4GHz, 4 Cores Server Setup Requests per Second Tornado nginx, 4 frontends 8213 Tornado 1 single threaded frontend 3353 Django Apache/mod_wsgi 2223 web.py Apache/mod_wsgi 2066 CherryPy standalone 785

io

For Facebook Chat, we rolled our own subsystem for logging chat messages (in C++) as well as an epoll-driven web server (in Erlang) that holds online users' conversations in-memory and serves the long-polled HTTP requests. Both subsystems are clustered and partitioned for reliability and efficient failover. Why Erlang? In short, because the problem domain fits Erlang like a glove. Erlang is a functional concurrency-oriented language with extremely low-weight user-space "processes", share-nothing message-passing semantics, built-in distribution, and a "crash and recover" philosophy proven by two decades of deployment on large soft-realtime production systems.





http://highscalability.com/new-facebook-chat-feature-scales-70-million-users-using-erlang http://www.facebook.com/note.php?note_id=14218138919 , also

tcpAcceptor(Srv, ListeningSocket) -> case gen_tcp:accept(ListeningSocket) of {ok, Sock} -> Pid = spawn(fun () -> receive permission -> inet:setopts(Sock, [ binary, {packet, http_bin}, {active, true} ]) after 60000 -> timeout end, collectHttpHeaders(Srv, Sock, tstamp()+?HTTP_HDR_RCV_TMO, []) end), gen_tcp:controlling_process(Sock, Pid), Pid ! permission, tcpAcceptor(Srv, ListeningSocket); {error, econnaborted} -> tcpAcceptor(Srv, ListeningSocket); {error, closed} -> finished; Msg -> error_logger:error_msg("Acceptor died: ~p~n", [Msg]), gen_tcp:close(ListeningSocket) end.

collectHttpHeaders(Srv, Sock, UntilTS, Headers) -> Timeout = (UntilTS - tstamp()), receive % Add this next header into the pile of already received headers {http, Sock, {http_header, _Length, Key, undefined, Value}} -> collectHttpHeaders(Srv, Sock, UntilTS, [{header, {Key,Value}}|Headers]); {http, Sock, {http_request, Method, Path, HTTPVersion}} -> collectHttpHeaders(Srv, Sock, UntilTS, [{http_request, decode_method(Method), Path, HTTPVersion} | Headers]); {http, Sock, http_eoh} -> inet:setopts(Sock, [{active, false}, {packet, 0}]), reply(Sock, lists:reverse(Headers), fun(Hdrs) -> dispatch_http_request(Srv, Hdrs) end); {tcp_closed, Sock} -> nevermind; Msg -> io:format("Invalid message received: ~p~nAfter: ~p~n", [Msg, lists:reverse(Headers)]) after Timeout -> reply(Sock, Headers, fun(_) -> [{status, 408, "Request Timeout"}, {header, {<<"Content-Type: ">>, <<"text/html">>}}, {html, "<html><title>Request timeout</title>" "<body><h1>Request timeout</h1></body></html>"}] end) end.

Non-SMP Yucan (1 core) SMP Yucan (4 core) Side notes

About 3k requests per second for Non-SMP, and surprizing 2kRPS for SMP. Not good. Understandable. Red stuff means errors, normalized; red value should be as close to zero as possible. 100 on this scale means 1% requests never finished or finished badly.

Here, with backlog of 5 we see almost 3k requests per second for non-SMP system and a satisfactory almost 4k for SMP system.

Best backlog value!



A tiny bit better than before on Non-SMP front and a great margin better on SMP configuration. 8k RPS for sure, maybe even honest 8500.

Clearly, more is not always better. 256 entries long TCP backlog hurts performance noticeably in both SMP and Non-SMP systems. But we can state 3k/8k requests per second anyway.









-smp enable

httperf

ERROR:root:Exception in I/O handler for fd 5 Traceback (most recent call last): File "/home/vlm/tornado-0.2/tornado/ioloop.py", line 189, in start self._handlers[fd](fd, events) File "/home/vlm/tornado-0.2/tornado/httpserver.py", line 94, in _handle_events connection, address = self._socket.accept() File "/usr/local/lib/python2.6/socket.py", line 195, in accept sock, addr = self._sock.accept() error: [Errno 53] Software caused connection abort Traceback (most recent call last): File "./ws.py", line 18, in >module> tornado.ioloop.IOLoop.instance().start() File "/home/vlm/tornado-0.2/tornado/ioloop.py", line 173, in start event_pairs = self._impl.poll(poll_timeout) File "/home/vlm/tornado-0.2/tornado/ioloop.py", line 340, in poll self.read_fds, self.write_fds, self.error_fds, timeout) ValueError: filedescriptor out of range in select() [vlm@yucan ~/tornado-0.2]$

Rate,Received reply rate,Normalized error rate (1/100%),"Generated request rate (also, expected reply rate)",Error rate,Attempt 1, Attempt 2, Attempt 3, Error 1, Error 2, Error 3

1000 rps,1000,0,1000,0,1000,1000,1000,0,0,0

2000 rps,1999,0,2000,0,2000,2000,1999,0,0,0

3000 rps,2999,0,3000,0,2999,2999,3000,0,0,0

4000 rps,3997,0,4000,0,3997,3997,3997,0,0,0

5000 rps,4998,0,5000,0,4998,4998,4998,0,0,0

6000 rps,5999,0,6000,0,5999,5999,5999,0,0,0

7000 rps,7001,0,7000,0,7001,7001,7001,0,0,0

8000 rps,5725,1600,8000,16,5985,4049,7141,29,9,10

9000 rps,7364,1700,9000,17,7187,7044,7862,21,13,17

10000 rps,6070,2733,10000,27,5759,6333,6119,29,27,26

Warning: epoll/select: If you think you have discovered a potential problem with my test, and this problem is the lack of epoll use in Tornado, you are right. However, while using epoll (I will have to find Linux somewhere, which is not a trivial task due to relative scarcity of such systems) will almost certainly fix the Tornado crash problem, this is only part of the story. The other part is the baseline performance of Tornado as compared to other web servers, and here is where it gets interesting. My assessment is that enabling epoll will not help with its baseline performance. Why? Read my replies to several commenters below, wrt. the number of open sockets during the tests. If you want to repeat my test on a comprable Linux system, you are encouraged to do so, since webserver-benchmark.pl is available. I'll gladly publish the results here.



Translation: if you think epoll is better than select on a tiny number of hyper-active file descriptors, you are poised to do some reading. See

http://www.cs.uwaterloo.ca/~brecht/papers/getpaper.php?file=ols-2004.ps

http://www.gelato.org/pdf/Illinois/gelato_IL2004_epoll_brecht.pdf

Good things sometimes happen to the open source community. Since Facebook acquisition of FriendFeed, a bunch of technologies were released to the wild, including, most notably, a Tornado web server written in Python. The Tornado is touted as a «a scalable, non-blocking web server and web framework». See Wikipedia article http://en.wikipedia.org/wiki/Tornado_HTTP_Server on some details on the performance of that server, as well as some comparison with other web servers.Here's the chart, taken from Wikipedia:The numbers looked interesting, so I decided to benchmark Tornado myself to check out how it fares against some Erlang tools. Keep in mind that Erlang runtime itself is not the fastest beast in the woods. It is generally considered slower than many other interpreted languages (including Python), especially on file operations (due to complexities of thelibrary doing most of heavy lifting). However, the network I/O, message passing and [green] process spawning are quite fast, so people use Erlang quite extensively (comparatively) as a nice web backend. Facebook itself uses Erlang for the Facebook Chat application:There are a few web servers for Erlang VM, notably Yaws and Mochiweb . Yaws is positioned as the most general purpose (and most mature) web server, resembling Apache of imperative world. Mochiweb, in turn, is mostly a special purpose embedded web server (though Yaws can be embedded too).Here's a nice comparison of Yaws, Mochiweb and Nginx: http://www.joeandmotorboat.com/2009/01/03/nginx-vs-yaws-vs-mochiweb-web-server-performance-deathmatch-part-2/ Since I know Yaws performance very well (several thousand requests per second on modern hardware, generally a very competitive piece of software), I was interested in comparing it to Tornado using some sort of a stress test.But soon I realized that I also wanted to measure some baseline Erlang performance. Yaws does a bit of heavy lifting under the hood, which is not always valuable, especially in embedded environment. We can do better. So, I sat today at the Specialty's and implemented a small web server from scratch, using the newly documented Erlang's http packet filter. Name's(does not mean anything).So, meet. Here's the front of the web server: a central TCP acceptor loop. See how easy it is to spawn a process per connection:Here's Yucan's request header assembler, using the convenient http packet filter provided by Erlang:I also wanted to get a feeling of the TCP listening backlog effect on that web server, so I did a number of tests for different backlogs: 1, 5, 128, 256. And, for the sake of completion, I also intended to run the stress tests against a single-thread and SMP-enabled Erlang VM configurations.For a testing engine, I drafted a perl wrapper around the old httperf routine, which throws 1000, 2000, …, 10000 requests per second at a web site a number of times, averages data, captures error rates, and saves the result into a CSV for colorful graphing. There's nothing fancy about this perl wrapper, here it is Test bed was a 4 core 2.5GHz Xeon L5420 running the web server, and another such system as a source of requests. FreeBSD-7.2. Erlang R13B01. HiPE did not make a sound difference, see my email to erlang-questions Here are the graphs for the different TCP listening backlog and SMP/Non-SMP variables. It shows backlog of 128 entries as a sweet spot irrespectively of SMP mode. Incidentally, a Tornado web server also uses backlog of 128 by default. Yaws uses 5, which is Erlang's gen_tcp's default value.Now, since we see that TCP listening backlog of 128 is a sweet spot for at least Yucan, and also is a default setting for Tornado, let's fix that backlog setting at 128. First, let's compare Yucan and Yaws side by side:Oh, my dear! What the hell is that? Whereas Yucan runs close to 8500 requests per second on 4 cores, Yaws is only 2k, maybe 2.5k per second on the same SMP system! It can be explained to a degree by the fact that I used the production configuration for Yaws, with custom #arg rewriter which adds a bit to the running time. Also, Yaws itself is not the simplest piece of code, and perhaps has accumulated some inefficiencies over time which prevent it from scoring well against 180 lines ofBut anyway, Yaws' 2k RPS is for a production configuration, not just a tiny benchmark.Let's go to the Tornado web server test, which is clearly a tiny benchmark (see http://www.tornadoweb.org/ , I just copied these 15 lines of code off that page and used it). We switch Yucan to the Non-SMP mode to compare apples with apples.in a single thread configuration (listed as 3.3k RPS on AMD 2.4 GHz) it showed 4k RPS on my 2.5 GHz Xeon. Which is clearly faster than Yucan's 3.5k RPS in the same single thread configuration.Tornado is touted as a scalable thing, but it does appear to require nginx load balancer in front of the farm of independent Tornado processes (each will end up running on its own core, mostly) to show its scalability. This has a clear disadvantage in communication: in order to exchange data between these independent processes, a Tornado application will have to use some form of IPC (Thrift, JSON, XMLRPC, etc). Erlang Yucan proves to be much better in this respect: it can scale up to 8k by just giving the erlang VMflag. That's it: no complex set up, just a flag, and no changes to the application whatsoever. Yucan was written with at least two contention points: the TCP acceptor and a dispatcher lookup table process. And nevertheless, it scaled well, because Erlang has found opportunities for parallelization even in that code.Tornado has funneled under load!At some point whilewas doing a 6000 requests per second test round, the Tornado web server died with the following diagnostics:Neither Yucan nor Yaws allowed themselves such a liberty. Yes, even in Erlang certain things (in an isolated processes) can go wrong, but Erlang is specifically designed to be resilient to programming failures by adopting share-nothing semantics, message passing, process linking and supervision, and other nice concepts. Taken together, these things greatly simplify programmer's life, while Erlang VM produces more than acceptable out-of-the-box performance on real life tasks.So, here we are. The data are open to further interpretation.Roberto Ostinelli has contacted me asking to perform the same set of tests against the trunk version of misultin . Misultin (pronounced mee-sul-teen) is an Erlang library for building fast lightweight HTTP servers. Due to the fact that the same design criteria were used for misultin (e.g., embeddability and lean code), I presumed it would very closely match Yucan performance. However, please note that the code uses TCP backlog of 30 by default for some reason, which proved to be a bit less optimal in my Yucan tests (I did Yucan test with 64 backlog entries and it was a tiny bit worse than the one with 128 entries).Anyway, here's the data ( misultin-smp-30.csv ):Looking at these numbers, it is clear that misultin and Yucan are very similar in performance and load handling. Yucan starts to turn its nose at 9k RPS (5% errors), misultin is a bit earlier at 8k (16% errors). I can only applaud Roberto Ostinelli for developing this server, and recommend it to others, especially since it is incomparably more mature than my today's experiment with Yucan.