I've been playing with RabbitMQ recently, comparing it to our current use of SNS+SQS as our message bus at work. One of the nice things about it is that, with the bunny gem, you subscribe to messages from a queue by passing a block telling what to do with that message:

queue.bind(exchange).subscribe do |delivery, metadata, message| do_things_with(message, metadata) end

It started me down a rabbit hole of "how much performance does this need?" so I could figure out whether this should run in its own process. That's when I started looking too closely and checking out how we could maximize performance.

I wanted to understand the performance of the gem, especially since consumers of message queues should be fast and have minimal overhead, so I opened up the code and found that when that block is called, it's called with splat args, which then calls the block by splatting the same args.

NOTE: This is not a criticism of the bunny gem, splat args, or anything. This was simply an exploration of the performance characteristics of the pattern of taking a block and calling that block later, along with a few variations of that pattern. These are all common conventions in Ruby and I think it's useful to understand how well they perform.

The first thing I wondered was what the performance cost of calling procs was vs calling a PORO's call method — that is, a call able object.

Assumptions and Hypotheses

I had a feeling that procs would be slower. I didn't have anything on which to base that assumption, but Ruby implementations are very much optimized around the idea of sending messages to objects and procs aren't run-of-the-mill objects — they're basically a Ruby binding to some bytecode. I don't know how heavy those bindings are, but given that you can get all kinds of introspection out of them (including local variables), I assumed they'd be pretty heavy. So I'm assuming a lot here.

Something that was less of an assumption but more of a hypothesis was that splat-args would be slower than explicit arguments. Splat args have to allocate and populate an array, so there's a performance cost to them. Still, I wasn't completely certain of it, so it was at best a hypothesis.

Speculation about performance without benchmarks is a waste of time, so I wrote some, including calling both with splat args. Turns out my guesses were pretty close (click the link to see the benchmark code):

Comparison: callable no arg: 10095848.2 i/s callable with arg: 9777103.9 i/s - same-ish: difference falls within error callable 3 args: 9460308.0 i/s - same-ish: difference falls within error callable splat args (0): 6773190.5 i/s - 1.49x slower proc no arg: 6747397.4 i/s - 1.50x slower proc with arg: 6663572.5 i/s - 1.52x slower proc 3 args: 6454715.5 i/s - 1.56x slower callable splat args (1): 5099903.4 i/s - 1.98x slower proc splat args (0): 5028088.6 i/s - 2.01x slower callable splat args (3): 4880320.0 i/s - 2.07x slower proc splat args (1): 4091623.1 i/s - 2.47x slower proc splat args (3): 4005997.8 i/s - 2.52x slower

This was disappointing for 2 reasons:

Proving yourself correct teaches you very little; proving yourself wrong teaches you a lot. At best, I proved a bunch of mildly educated assumptions correct. Capturing and later calling blocks is such a common practice in Ruby that I wonder how much performance we're losing as a result

On the bright side, I'd gone down enough rabbit holes to find this out. If I'd been wrong, I'd have gone down even more to understand why.

What Do?

It would be silly to say "never capture blocks because performance". Capturing blocks in Ruby might be a bit slower, but it's a powerfully expressive concept and it's unlikely that the difference in performance will make that much of an impact in your app — I was still getting 6.7 million calls per second with a proc. If you need to call a captured block on the order of millions of times per second, you'll probably benefit from this article. Otherwise, this is largely an academic exercise and that's okay, too.

If you want to optimize performance while still allowing block capture, you can do both by taking a callable or a block:

class ThingThatHasEvents def on event_name, handler=nil, &block @events[event_name] << (handler || block) end end

You'll want to have a check in there to ensure you receive one or the other, but making affordances for passing either one will give you the expressive API of receiving a block while still accepting the faster path of callable objects. With a typical "event handler" style where the event is emitted with the call to each handler, we can see this goes up to 45% faster.

Unfortunately, the benchmark shows that a heterogenous set of event handlers (some passed as blocks, some passed as callable POROs) is actually slower than procs-only, but only by about 10% — much less than the difference between procs and callables separately.

Always Benchmark

I may have been right about this, but performance claims without benchmarks are always bullshit. Always benchmark.

Even if you've done something similar before. Even if you've done the exact same thing before in a different app. Even if you've done the exact same thing before in the same app on a different Ruby VM.