





Nothing is possible without lunch

So Aman Gupta (tmm1) and I were eating lunch at the Oaxacan Kitchen on Tuesday and as usual, we were talking about scaling Ruby. We got into a small debate about which phase of garbage collection took the most CPU time.

Aman’s claim:

The mark phase, specifically the stack marking phase because of the huge stack frames created by rb_eval

My claim:

The sweep phase, because every single object has to be touched and some freeing happens.

I told Aman that I didn’t believe the stack frames were that large, and we bet on how big we thought they would be. Couldn’t be more than a couple kilobytes, could it? Little did we know how wrong our estimates were.

Quick note about Ruby’s GC

Ruby MRI has a mark-and-sweep garbage collector. As part of the mark phase, it scans the process stack. This is required because a pointer to a Ruby object can be passed to a C extension (like Eventmachine, or Hpricot, or whatever). If that happens, it isn’t safe to free the object yet. So Ruby does a simple scan and checks if each word on the stack is a pointer to the Ruby heap, if so, that item cannot be freed.



GDB to the rescue

We get back from lunch, launch our application, attach GDB and set a breakpoint. The breakpoint gets triggered and we see this seemingly innocuous stack trace [Note: To help with debugging, we compiled the EventMachine gem with -fno-omit-frame-pointer]:



#0 0x00007ffff77629ac in epoll_wait () from /lib/libc.so.6

#1 0x00007ffff6c0b220 in EventMachine_t::_RunEpollOnce (this=0x158d7e0) at em.cpp:461

#2 0x00007ffff6c0b86c in EventMachine_t::_RunOnce (this=0x158d7e0) at em.cpp:423

#3 0x00007ffff6c0bbd6 in EventMachine_t::Run (this=0x158d7e0) at em.cpp:404

#4 0x00007ffff6c06638 in evma_run_machine () at cmain.cpp:83

#5 0x00007ffff6c1897f in t_run_machine_without_threads (self=26066936) at rubymain.cpp:154

#6 0x000000000041d598 in call_cfunc (func=0x7ffff6c1896e , recv=26066936, len=0, argc=0, argv=0x0) at eval.c:5759

#7 0x000000000041c92f in rb_call0 (klass=26065816, recv=26066936, id=29417, oid=29417, argc=0, argv=0x0, body=0x18dba10, flags=0) at eval.c:5911

#8 0x000000000041e0ad in rb_call (klass=26065816, recv=26066936, mid=29417, argc=0, argv=0x0, scope=2, self=26066936) at eval.c:6158

#9 0x00000000004160d5 in rb_eval (self=26066936, n=0x1940330) at eval.c:3514

#10 0x00000000004150b7 in rb_eval (self=26066936, n=0x1941018) at eval.c:3357

#11 0x000000000041d196 in rb_call0 (klass=26065816, recv=26066936, id=5393, oid=5393, argc=0, argv=0x0, body=0x1941018, flags=0) at eval.c:6062

#12 0x000000000041e0ad in rb_call (klass=26065816, recv=26066936, mid=5393, argc=0, argv=0x0, scope=0, self=47127864) at eval.c:6158

#13 0x0000000000415d01 in rb_eval (self=47127864, n=0x2cf5298) at eval.c:3493

#14 0x00000000004148b2 in rb_eval (self=47127864, n=0x2cf4380) at eval.c:3223

#15 0x000000000041d196 in rb_call0 (klass=47127808, recv=47127864, id=5313, oid=5313, argc=0, argv=0x0, body=0x2cf4380, flags=0) at eval.c:6062

#16 0x000000000041e0ad in rb_call (klass=47127808, recv=47127864, mid=5313, argc=0, argv=0x0, scope=0, self=9606072) at eval.c:6158

#17 0x0000000000415d01 in rb_eval (self=9606072, n=0x194b2a0) at eval.c:3493

#18 0x00000000004148b2 in rb_eval (self=9606072, n=0x19587b0) at eval.c:3223

#19 0x000000000041072c in eval_node (self=9606072, node=0x19587b0) at eval.c:1437

#20 0x0000000000410dff in ruby_exec_internal () at eval.c:1642

#21 0x0000000000410e4f in ruby_exec () at eval.c:1662

#22 0x0000000000410e72 in ruby_run () at eval.c:1672

#23 0x000000000040e78a in main (argc=3, argv=0x7fffffffebd8, envp=0x7fffffffebf8) at main.c:48



Looks pretty normal, nothing to worry about, right?

We started checking the rb_eval frames because we assumed that those would be the largest stack frames. The rb_eval function inlines other functions and call itself recursively. So how big is one of the rb_eval frames?



(gdb) frame 10

#10 0x00000000004150b7 in rb_eval (self=26066936, n=0x1941018) at eval.c:3357

3357 result = rb_eval(self, node->nd_head);

(gdb) p $rbp-$rsp

$2 = 1904



1,904 bytes – pretty large. If all the stack frames are that large, we are looking at around 47,600 bytes. Pretty serious. Let’s verify that Ruby thinks the stack is a sane size. There is a global in the Ruby interpreter called rb_gc_stack_start . It gets set when the Ruby stack is created in Init_stack() . When Ruby calculates the stack size it subtracts the current stack pointer from rb_gc_stack_start [remember on x86_64, the stack grows from high addresses to low addresses]. Let’s do that and see how big Ruby thinks the stack is.



(gdb) p (unsigned int)rb_gc_stack_start - (unsigned int)$rsp

$3 = 802688



Wait, wait, wait. 802,688 bytes with only 23 stack frames? WTF?! Something is wrong. We started at the top and checked all the rb_eval stack frames, but none of them are larger than 2kb. We did find something quite a bit larger than 2kb, though.



(gdb) frame 1

#1 0x00007ffff6c0b220 in EventMachine_t::_RunEpollOnce (this=0x158d7e0) at em.cpp:461

461 s = epoll_wait (epfd, ev, MaxEpollDescriptors, timeout == 0 ? 5 : timeout);

(gdb) p $rbp-$rsp

$28 = 786816



Uh, the RunEpollOnce stack frame is 786,816 bytes? That’s got to be wrong. WTF?

Time to bring out the big guns.

objdump + x86_64 asm FTW

I pumped EventMachine’s shared object into objdump and captured the assembly dump:



objdump -d rubyeventmachine.so > em.S



I headed down to the RunEpollOnce function and saw the following:



2f12b: 48 81 ec 78 01 0c 00 sub $0xc0178,%rsp



Interesting. So the code is moving %rsp down by 786,808 bytes to make room for something big. So, let’s see if the EventMachine code matches up with the assembly output.



struct epoll_event ev [MaxEpollDescriptors];



Where MaxEpollDescriptors = 64*1024 and sizeof(struct epoll_event) == 12 . That matches up with the assembly dump and the GDB output.

Usually, doing something like that in C/C++ is (usually) OK. Avoiding the heap whenever you can is a good idea because you avoid heap-lock contention, fragmenting the heap, and memory overhead for tracking the memory region. When writing Ruby extensions, this isn’t necessarily true. Remember, Ruby’s GC algorithm scans the entire process stack searching for references to Ruby objects. This EventMachine code causes Ruby to search an extra ~800,000 bytes drastically slowing down garbage collection.

The patch

Get the patch HERE

The patch simply moves the stack allocated struct epoll_event ev to the class definition so that it is allocated on the heap when an instance of the class is created with new . This does not change the memory usage of the process at all. It just moves the object off the stack. This makes all the difference because Ruby’s GC scans the process stack and not the process heap.

On top of all that, this patch helps with Ruby’s green threads, too. If the epoll_wait causes a Ruby event to fire and that event creates a Ruby thread, that Ruby thread gets an entire copy of the existing stack. Each time that thread is switched into and out of, that thread stack has to be memcpy’d into and out of place. Reducing those memcpys by ~800,000 bytes is a HUGE performance win. Want to learn more about threading implementations? Check out my threading models post: here.

Fixing this turned out to be pretty simple. A six (6!!) line patch:

Speeds up GC by 2-3x because of the huge decrease in stack frame size.

because of the decrease in stack frame size. Fixes an open bug in EventMachine where using threads with Epoll causes lots of slowness. The reason is that each thread will inherit an ~800,000 byte stack that gets copied in and out every context switch .

that gets copied in and out . This results in an increase from 500 requests/sec to 7000 requests/sec when using Sinatra+Thin+Epoll+Threads. That is pretty ill.

Conclusion

All in all, a productive debugging session lasting about an hour. The result was a simple patch, with 2 big performance improvements.

A couple things to take away from this experience:

Spend time learning your debugging tools because it pays off, especially nm , objdump , and of course GDB .

, , and of course . Getting familiar with x86_64 assembly is crucial if you hope to debug complex software and optimize it correctly.

Keep your eyes open for up-coming blog posts about x86_64 assembly! Don’t forget to subscribe via RSS or follow me on twitter