I recently had to build a small service where performance and reliability were high priorities. Our main app is built on Rails, but for these requirements it really doesn’t make a very good fit. Arguably, perhaps Ruby isn’t a good fit it self, but since the service would be exposing a library already built with Ruby, that was a given. So I decided to try to write the service directly on top on Thin.

Building directly on Thin means very few dependencies, which means fewer things that can go wrong and which means an app that is easier to reason about. This helps with the reliability part. In case you are not aware, Thin is based on Event Machine and uses a reactor pattern for concurrency, rather than threads. Essentially, this means that everything happens within a single thread, with the benefit of being very fast and having a very low resource overhead. This is the same reason why Nginx is able to handle far more concurrent connections than – say – Apache.

The downside of writing code directly on Thin, is that it has to be written in an evented fashion. For example, if you have to query a database, you can’t just wait for the call to return - you have to pass a callback.

class MyApp def call ( env ) Thin :: AsyncResponse . perform ( env ) do | response | response . headers [ "Content-Type" ] = "text/plain" db = Mysql2 :: EM :: Client . new ( $mysql_configuration ) defer = db . query "select now()" defer . callback do | result | now = result . first response . status = 200 response << "The time is #{ now } " response . done end defer . errback do | err | response . status = 500 response << "Oops - Something went wrong: #{ err } " response . done end end end end run MyApp . new

For smaller applications that’s ok, but for something more complex it quickly becomes unweildy. For our use case, it did fine though. And the end result was pretty satisfactory - on a stock EC2 m3.medium the app was able to handle close to 2000 req/sec without bulging.

All good, except for one thing.

I’m a big fan of New Relic. There are alternatives out there, but nothing that quite gives the same level of detail and insight – But the New Relic client side library (agent) doesn’t work with Thin’s async style. At least not out of the box. Looking at the source code of the agent makes it clear why that is:

module NewRelic module Agent module Instrumentation module MiddlewareTracing . . . def call ( env ) first_middleware = note_transaction_started ( env ) state = NewRelic :: Agent :: TransactionState . tl_get begin Transaction . start ( state , category , build_transaction_options ( env , first_middleware )) events . notify ( :before_call , env ) if first_middleware result = ( target == self ) ? traced_call ( env ) : target . call ( env ) capture_http_response_code ( state , result ) events . notify ( :after_call , env , result ) if first_middleware result rescue Exception => e NewRelic :: Agent . notice_error ( e ) raise e ensure Transaction . stop ( state ) end end . . .

This is a piece of Rack middleware, which is how New Relic monitors your app. Essentially, it starts a transaction, delegates to the next handler in the chain and then stops the transaction again. In the case of an async request however, the request returns immediately, but it only completes much later. The result is that NR falsely reports all requests as completing in 0 time.

Note that NR works fine with a traditional threaded concurrency model - the global state is kept thread-local through the single global accessor in NewRelic::Agent::TransactionState.tl_get . Everything else that NR does, uses this to store state.

Hacking the agent

Since I really wanted to use NR, I figured I would see if I could get it working with async.

The main problem seems to be that all state is stored globally. The usual solution to that would be to pass the state around as an object instance instead, but in this case that would be a massive change and would probably mean that the client code would need quite deep awareness of NR – which is not a good thing.

As a compromise I tried keeping the global state design, but instead of a true global, I would make a switchable registry. Then all we have to do is switch this based on an identifier that is unique to the request. In the scope of rack/Thin, the ENV variable contains everything specific to a request/response, so that seems like a good fit:

module NewRelic module Agent # Patch TransactionState to be local to the currently scoped request class TransactionState def self . tl_get id = scope_id if id tl_state_for_scope ( id ) else tl_state_for ( Thread . current ) end end def self . tl_state_for_scope ( scope_id ) @transaction_states ||= {} state = @transaction_states [ scope_id ] if state . nil? state = TransactionState . new @transaction_states [ scope_id ] = state end state end def self . current_env = ( env ) @current_env = env unless @current_env [ 'new_relic.scope_id' ] @new_relic_scope_ids ||= 0 @new_relic_scope_ids += 1 @current_env [ 'new_relic.scope_id' ] = @new_relic_scope_ids . to_s end end def self . scope_id @current_env ? @current_env [ 'new_relic.scope_id' ] : nil end end end end

First we override the normal implementation of tl_get . If we don’t have a scope_id , we let it fall back to normal behaviour. Otherwise we call tl_state_for_scope , which essentially is a local registry of TransactionState instances. The unique identifier ( scope_id ) is stored on the env. To use this, we somehow have to set the current_env before we enter into a handler and clear it again afterwards. That seems like a good case for using a block:

. . . def self . scope_env ( env_or_request , & block ) previous = @current_env @current_env = env_or_request . respond_to? ( :env ) ? env_or_request . env : env_or_request result = yield @current_env = previous result end . . .

Notice that we maintain the previous env, so this should be safe to nest, even if that won’t normally happen in an evented app.

Armed with this, let’s patch the middleware to keep the transaction open:

module NewRelic module Agent module Instrumentation module MiddlewareTracing . . . def call ( env ) NewRelic :: Agent :: TransactionState . scope_env ( env ) do first_middleware = note_transaction_started ( env ) state = NewRelic :: Agent :: TransactionState . tl_get is_async = nil env [ 'new_relic.middleware_depth' ] ||= 0 env [ 'new_relic.middleware_depth' ] += 1 begin Transaction . start ( state , category , build_transaction_options ( env , first_middleware )) events . notify ( :before_call , env ) if first_middleware result = ( target == self ) ? traced_call ( env ) : target . call ( env ) is_async = result [ 0 ] == - 1 capture_http_response_code ( state , result ) events . notify ( :after_call , env , result ) if first_middleware result rescue Exception => e NewRelic :: Agent . notice_error ( e ) raise e ensure # If the request is async, we expect the handler to end these transactions Transaction . stop ( state ) unless is_async end end end . . .

The key change here is that Transaction.stop(state) is not called for an async response. And we scope the contents of the code to the current env, using the code we wrote above. Of course, since the transaction is no longer stopped by the middleware, it now becomes the responsibility of the client code to shut it down when the request completes.

Let’s first create a helper method for this - Transaction seems to be a good place for that:

module NewRelic module Agent class Transaction # End async request cycle def self . stop_async ( env , response = nil ) state = NewRelic :: Agent :: TransactionState . tl_get state . current_transaction . http_response_code = response . status if response && state . current_transaction env [ 'new_relic.middleware_depth' ]. times do NewRelic :: Agent :: Transaction . stop ( state ) end end end # class Transaction . . .

The middleware_depth counter is there to make sure any nested calls are unwound correctly.

Finally, we need the client code to wrap all handlers in scope_env . So our example from earlier now becomes:

class MyApp def call ( env ) NewRelic :: Agent :: TransactionState . scope_env ( env ) do Thin :: AsyncResponse . perform ( env ) do | response | response . headers [ "Content-Type" ] = "text/plain" db = Mysql2 :: EM :: Client . new ( $mysql_configuration ) defer = db . query "select now()" defer . callback do | result | NewRelic :: Agent :: TransactionState . scope_env ( env ) do now = result . first response . status = 200 response << "The time is #{ now } " response . done end end defer . errback do | err | NewRelic :: Agent :: TransactionState . scope_env ( env ) do response . status = 500 response << "Oops - Something went wrong: #{ err } " response . done end end end end response . callback do NewRelic :: Agent :: Transaction . stop_async ( env , response ) end end end run MyApp . new

Clearly, wrapping code in scope_env gets a bit tedious, so that would be another good candidate for abstracting away. I’ll leave that as an excersise for the reader.

Method tracing

There’s still a few lose ends though. First there is the name of the transaction as it appears within NR. Currently it will all just show up as MyApp#call , which is a bit of a problem if your app has more than one pathway/routes. You can explicitly set this with:

class MyApp def call ( env ) NewRelic :: Agent :: TransactionState . scope_env ( env ) do txn = NewRelic :: Agent :: Transaction . tl_current txn . set_overriding_transaction_name txn . make_transaction_name ( "My Custom Transaction" ), txn . category . . .

We also have a bit of patching to do to get method tracers to work correctly. Again, instead of this async code:

trace_execution_scoped "Scope/Module/method" do . . . end

We need to start a timer manually and pass it when the trace commences:

t0 = Time . now . to_f ConnectionPool . instance . acquire do | db | # Expensive, async operation NewRelic :: Agent :: TransactionState . scope_env ( env ) do trace_execution_scoped "Database/ConnectionPool/acquire" , :start_time => t0 . . .

This is fairly straight-forward:

module NewRelic module Agent module MethodTracer def trace_execution_scoped ( metric_names , options = {}) #THREAD_LOCAL_ACCESS if block_given? NewRelic :: Agent :: MethodTracerHelpers . trace_execution_scoped ( metric_names , options ) do # Using an implicit block avoids object allocation for a &block param yield end else NewRelic :: Agent :: MethodTracerHelpers . trace_execution_scoped ( metric_names , options ) end end end # module MethodTracer # We override these in order to allow passing in start_time instead of relying on a block to time the call module MethodTracerHelpers def trace_execution_scoped ( metric_names , options = {}) #THREAD_LOCAL_ACCESS state = NewRelic :: Agent :: TransactionState . tl_get return ( block_given? ? yield : nil ) unless state . is_execution_traced? metric_names = Array ( metric_names ) first_name = metric_names . shift return ( block_given? ? yield : nil ) unless first_name additional_metrics_callback = options [ :additional_metrics_callback ] start_time = options [ :start_time ] || Time . now . to_f expected_scope = trace_execution_scoped_header ( state , start_time ) begin result = block_given? ? yield : nil metric_names += Array ( additional_metrics_callback . call ) if additional_metrics_callback result ensure trace_execution_scoped_footer ( state , start_time , first_name , metric_names , expected_scope , options ) end end end # class MethodTracerHelpers . . .

So, here we allow trace_execution_scoped to be called without a block, and we allow start_time to be passed in through options.

That’s all

I wrapped the patches up into a gem and put it on Github, in case anyone else should find this useful. Do mind that this monkey patches directly on the main NR agent code. While transparently compatible, if the underlying code changes, it may break. The version of newrelic_rpm that I have used is (3.13.2.302)

You can find the code at: github.com/troelskn/newrelic-async

Discuss this post