Since writing about our lackluster experience with Fiveruns awhile back, I had been keeping my eyes open for a good way to get reliable, comprehensive, and easy-to-interpret performance metrics for Bonanzle. Prior to our transition away from shared hosting in November, I would browse through our production logs and regularly find actions taking 10 seconds or more, but I had no way to get a good overall sense of what actions were taking the longest, how long those actions were taking, and how it was changing from day to day.

Now we have a solution, and it’s a dandy.

Many have heard of pl_analyze, but I think that few may realize how truly easy it is to get up and running on any Rails 2.X application (we’re now on 2.2). To start, Geoff has written a strikingly excellent blog about how to change your log format to be compatible with pl_analyze by adding a single file to your lib directory and changing a couple lines in environment.rb.

After you follow his instructions, here are a couple more key points that I found relevant to polish the rough edges:

* By default, when we were running 2.0 (don’t know if it’s the case with 2.2, since we were pre-patched when we upgraded), logging of DB times was disabled. This monkey patch fixes that.

* By default, when we moved to 2.2, the log format changed and broke pl_analyze. This fixes that.

* After enabling the Hodel logger as described by Geoff, our production log became extremely verbose. One of the helpful commenters on Geoff’s blog pointed out the solution to this: add the line “config.logger.level = Logger::INFO” to your production.rb

After that, you’re good to go. Took me less than five minutes to get it set up, after which I was able to get priceless info like the following:

Request Times Summary: Count Avg Std Dev Min Max

ALL REQUESTS: 1576 0.242 0.229 0.000 2.487

ItemsController#show: 685 0.328 0.141 0.008 0.751

ChatsController#service_user_list: 286 0.022 0.037 0.005 0.301

HomeController#index: 158 0.256 0.308 0.000 1.461

Well, doesn’t look so pretty on my crap blog (shouldn’t I have upgraded this by now?), but it’ll look like a champ in your terminal window.

After getting our basic setup established a couple weeks ago, I started experimenting today with archiving our daily performance info into the database so I can track how action times are changing from day to day. Geoff provides a great start for this as well, with his pl_analyze Rails plugin, which you can install by running

./script/plugin install http://topfunky.net/svn/plugins/mint

After you’ve installed the plugin, run

rake mint:create_analyze_table

to create your table, and then

rake mint:analyze PRODUCTION_LOG=/path/to/production.log

to actually extract and store the day’s data. Be sure to read my comment in Geoff’s blog (it’s about the 30th comment) for a slight modification you may need to make to the analyze task if you get an error that says something to the effect that “0 parameteres were expected, got 1” or somesuch when you run mint:analyze.

Also note that this will store everything in your production log, so if you haven’t already setup log rotating on your production log, you’ll want to do that (it can be as simple as this).

After you’re archiving performance metrics to your database, you are in a performance analysis wonderland — it’s just a matter of applying creativity to data. For starters, I plan to create reports that list the top 10 daily time users (= count of action * average time per action), and the top 10 biggest daily deltas in time taken for various actions. I plan to use the latter to help triangulate where code changes affect performance. It would be trivial to create a model that actually allowed daily notes to be added to a report, so if the items#show action starts running 20% slower one day, I could demarcate the code changes that were made on the day that happened (though our changelist makes such a demarcation unnecessary).

I’m not sure what other tools are being used by the community to keep an eye on performance, but with the quickness and reliability (= stable results reported for weeks) of getting a pl_analyze solution running, it gets high recommends from ’round these parts. I can’t imagine returning to a day where we didn’t know if our average request time was 300 ms, 3000 ms, or 30000 ms (yes, we had some very dark days before our server move).