Recently Godfrey Chan got Discourse working on Rails Master. It was a rather long task that involved some changes to Discourse internals and some changes to Rails internals.

Knowing Rails 4.2 was just around the corner I decided that it seemed like the perfect time to see how performance is. Seeing Rails 4.2 contains the adequate record patches, I was super eager to see how Discourse faired.

###The Discourse benchmark

Answering the question of “how fast is Ruby?” or “how fast is Rails?” is something people usually answer with the following set of steps.

Upgrade production to a certain version of Rails Look at New Relic reports Depending on how performance is either complain on Twitter and/or Blog or write a few kind words.

The trouble is that performance testing after release is the absolute worst time. Code is in production, fixing stuff is tricky, rolling back often impractical.

At Discourse I developed the Discourse benchmark. It is a simple script that loads up a few thousand users and topics and then proceeds to measure performance of various hot spots using apache bench.

I have found that the results of the benchmark are strikingly similar to real-world performance we see. If the benchmark is faster, it is very likely that it will also be faster in production. We use this benchmark to test Ruby before major Ruby releases. Koichi used this benchmark to help optimise the current GC algorithms in MRI.

It is often tempting to look at micro benchmarks when working on performance issues. Micro benchmarks are a great tool, but MUST be followed with bigger macro benchmarks to see the real impact. 1000% speedup for a routine that is called once a day in a background job has significantly less impact than 0.5% improvement to a routing that is called on every web request.

###How was Rails 4.2 Looking?

For over a year now, Discourse had the ability to dual boot. This allowed me to quickly run a simple benchmark to see where we were at (arel , rails):

% RAILS_MASTER=1 ruby script/bench.rb name: [percentile]: [time] categories_admin: 50: 123 75: 136 90: 147 99: 224 home_admin: 50: 111 75: 122 90: 147 99: 224 topic_admin: 50: 69 75: 81 90: 89 99: 185 categories: 50: 81 75: 126 90: 138 99: 211 home: 50: 53 75: 63 90: 100 99: 187 topic: 50: 20 75: 21 90: 23 99: 77 timings: load_rails: 3052 ruby-version: 2.1.2-p95 rss_kb: 241604

VS.

categories_admin: 50: 62 75: 66 90: 77 99: 193 home_admin: 50: 51 75: 53 90: 55 99: 175 topic_admin: 50: 27 75: 27 90: 28 99: 87 categories: 50: 53 75: 55 90: 98 99: 173 home: 50: 35 75: 37 90: 55 99: 154 topic: 50: 12 75: 12 90: 13 99: 66 timings: load_rails: 3047 ruby-version: 2.1.2-p95 rss_kb: 263948

Across the board we were running at around half speed. Pages that were taking 12ms today were taking 20ms on Rails 4.2.

Given these results I decided to take some time off my regular work and try to isolate what went wrong. Contribute a few patches and help the Rails team correct the issue prior to the impending RC release.

Before I continue I think it is worth mentioning that the lion’s share of the performance optimisations we ended up adding were authored by Sean Griffin so a big huge thanks to him.

###Cracking open the black box

I find flame graphs are instrumental at quickly discovering what is going on. Flame graphs in Ruby 2.1 now have better fidelity than they did in 2.0 thanks to feeding directly off stackprof.

My methodology is quite simple.

Go to a page in Rails 4.1.8 mode, reload a few times, then tack on ?pp=flamegraph to see a flamegraph Repeat the process on Rails 4.2 Zoom in on similar area and look for performance regressions.

For example here is the before graph on Rails 4.1.8 (home page)

rails_baseline.html (214.9 KB)

Compared to the after graph on Rails Master (home page)

rails_master_v1.html (253.6 KB)

The first point of interest is the sample count:

Rails 4.1.8

vs.

Master

Flamgraph attempts to grab a sample every 0.5ms of wall time, the reality is that due to depth of stack, timers and so on you end up getting a full sample every 0.5ms - 2ms. That said we can use the sample counts a good rough estimate of execution time.

It looks like Rails master is taking up 108 samples where current Rails was taking 74, a 46% increase in the above example.

We can see a huge amount of the additional overhead is spent in arel.

Next we zoom in to particular similar areas and compare:

Rails 4.1.8



Master



We can see expensive calls to reorder_bind_params that we were not seeing in Rails 4.2

Clicking on the frame reveals:

So 14.5% of our cost is in this method.

Additionally we see other new areas of overhead:

Like a method missing being invoked when casting DB values

We repeat the process for various pages in our app, and discover that 3.2% of the time on the categories page is spent in ActiveRecord::AttributeMethods::PrimaryKey::ClassMethods#primary_key and a pile of other little cuts.

With this information we can start working on various fixes:

There are a lot of little cuts and some big cuts, here are some examples

github.com/rails/rails Improve the performance of reading attributes We added a comparison to "id", and call to `self.class.primary_key` a *lot*. We also have performance hits from `&block` all over the... changed 12 files with 24 additions and 18 deletions.

github.com/rails/rails Reintroduce cache with tests changed 3 files with 56 additions and 14 deletions.

github.com/rails/rails Improve the performance of quoting table names on PG This caused a pretty major performance regression for 4.2, as this is a hotspot for query construction. We're still slightly slower than... changed 1 files with 15 additions and 4 deletions.

github.com/rails/rails rm `reorder_bind_params` Arel handles this for us automatically. Updated tests, as BindParam is no longer a subclass of SqlLiteral. We should remove the second... changed 9 files with 7 additions and 25 deletions.

The list goes on, slowly but surely we regain performance.

###Memory profiling

Another very important tool in our arsenal is memory profiler. If we include it in our Gemfile, rack-mini-profiler will wire it up for us, then appending pp=profile-gc-ruby-head to any page gives us a memory report (note this requires Ruby 2.1 as it provides the new APIs needed):

It reveals huge allocation increase in Rails master, which was allocating 131314 objects on the front page in Rails master vs 17651 object in Rails 4.1.8

We can dig down and discover a huge addition of allocations in arel and activerecord

arel/lib x 107607 activerecord/lib x 12778

We can see the vast majority is being allocated at:

arel/lib/arel/visitors/visitor.rb x 106435

We can track down the 3 particular lines:

arel/lib/arel/visitors/visitor.rb:11 x 74432 lib/arel/visitors/visitor.rb:13 x 22880 lib/arel/visitors/visitor.rb:12 x 7040

This kind of analysis helps us fix issues like:

and

github.com/rails/rails PERF: avoid string allocations 1 commits changed 1 files with 6 additions and 1 deletions.

It is very important to be able to prove you have a real issue before applying optimisations blindly. Memory profiler allows us to pinpoint specifics and demonstrate the actual gain.

It is fascinating to see which particular strings are allocated where. For example the empty string is allocated 702 times during a front page load:

"" x 702 activerecord/lib/active_record/connection_adapters/postgresql_adapter.rb:433 x 263 activerecord/lib/active_record/type/boolean.rb:11 x 153 omniauth-1.2.2/lib/omniauth/strategy.rb:396 x 84

We can work through this information, and paper cut by paper cut eliminate a large number of allocations resulting in better performance.

Where are we with Rails 4.2 RC1

categories_admin: 50: 69 75: 74 90: 86 99: 165 home_admin: 50: 55 75: 57 90: 66 99: 133 topic_admin: 50: 27 75: 27 90: 28 99: 108 categories: 50: 56 75: 58 90: 60 99: 192 home: 50: 39 75: 40 90: 42 99: 165 topic: 50: 13 75: 13 90: 14 99: 75

So it looks like we are more or less 10% slower that 4.1 which is a huge improvement over where we were a couple of weeks ago (considering we were almost 100% slower).

There is still plenty of work left to do, but we are on the right path.

A taste of things to come

Lars Kanis a maintainer of the pg gem has been working for a while now on a system for natively casting values in the pg gem. Up till now the pg gem has followed libpq quite closely. When you query the DB using libpq you are always provided with strings. Historically the pg gem just echos it back to the Ruby program.

So for example:

ActiveRecord::Base.connection.raw_connection.exec("select 1").values[0][0] => "1"

The new pg release allows us to get back a number here, as opposed to a string. This helps us avoid allocating, converting and garbage collecting a string, just to get a number from our database. The cost of these conversions adds up quite a lot. It is much cheaper to do conversions in C code and avoid the allocation of objects that the Ruby GC needs to track.

cnn = ActiveRecord::Base.connection.raw_connection; puts cnn.type_map_for_results = PG::BasicTypeMapForResults.new cnn; puts cnn.exec('select 1').values[0][0] => 1

The tricky thing though is that people heavily over-query when using Active Record.

# no need to perform conversions for created_at and updated_at car_color = Car.first.color # above is more efficiently written as car_color = Car.limit(1).pluck(:color).first

If you perform conversions upfront for columns people never consume you can end up with a severe performance hit. In particular converting times is a very expensive operation which involves string parsing.

In my benchmarking it is showing a 5-10% improvement for the Discourse bench which in almost all cases means we are faster than Rails 4.1.

I think it is fairly likely that we can be 10-20% faster than Rails 4.1 by the time Rails 5 ships provided we watch the benchmarks closely and keep optimising.

Side Effects

A very interesting and rewarding side effect of this work was a series of optimisations I was able to apply to Discourse while looking at all the data.

In particular I built us a distributed cache powered by the message_bus gem. A Hash like structure that is usable from any process running Discourse (Discourse instances can span multiple machines). Using it I was able to apply caching to 3 places that we sorely needed to and avoid expensive database work.

Conclusion

The recent performance work strongly highlights the importance of a long running benchmark. There is no single magical method we can fix that will recover a year of various performance regressions. The earlier we pick up on regressions the easier they are to fix.

It also highlight the “paper cut” nature of this kind of work. It is easy to be flippant and say that 1% here and 0.5% percent there does not matter. However, repeat this process enough and you have a product that is 100% slower. We need to look at trends to see where we are at. Slow down dev and work on performance when performance regresses.

The Rails team did an admirable job at addressing a lot of the perf regressions we found in very short order. Sean did some truly impressive work. Godfrey did a spectacular job getting Discourse running on Rails master.

However, I am very sad that we do not have long running benchmark for either Ruby or Rails and worry that if we don’t we are bound to repeat this story again.