This article is about the Ruby GC. In particular it is about the GC present in Ruby MRI 2.0.

The Ruby GC has been through quite a few iterations, in 1.9.3 we were introduced to the lazy sweeping algorithm and in 2.0 we were introduced bitmap marking. Ruby 2.1 is going to introduce many more concepts and is out-of-scope for this post.

Heaps of heaps

MRI (Matz’s ruby interpreter) stores objects aka. RVALUEs in heaps, each heap is approx 16KB. RVALUE structs consume different amounts of memory depending on the machine architecture. On x64 machines they consume 40 bytes, on x32 machines they consume 20 to 24 bytes depending on the sub-architecture (some optimizations shave off a few extra bytes on say, cygwin using magic pragmas).

An RVALUE is a magical c struct that is a union of various “low level” c representations of Ruby objects. For example, in MRI, an RVALUE can accessed as a RRegexp or a RString or an RObject and so on. I strongly recommend the excellent Ruby Under a Microscope to get a handle of this, GC algorithms and MRI in general.

Given this, each heap in a x64 machine we are able to store about 409 Ruby objects give or take a few for heap alignment and headers.

[1] pry(main)> require 'objspace' => true [2] pry(main)> ObjectSpace.count_objects[:TOTAL] / GC.stat[:heap_used] => 406

A typical Rails application (like say Discourse) will have about 400 thousand objects in 1100 or so heaps (heaps can get fragmented with empty space), we can see this by running:

$ RAILS_ENV=production rails c > GC.start > GC.stat => {:count=>102, :heap_used=>1160, :heap_length=>1648, :heap_increment=>488, :heap_live_num=>369669, :heap_free_num=>102447, :heap_final_num=>0, :total_allocated_object=>3365152, :total_freed_object=>2995483}

About GC.stat in Ruby 2.0

GC.stat is a goldmine of information, it is the first place you should go to before doing any GC tuning, here is an overview of what they mean, unfortunately it is not documented some attempts are not that accurate, here is my go at it, after reading the GC source:

count: the number of times a GC ran (both full GC and lazy sweep are included)

heap_used: the number of heaps that have more than 0 slots used in them. The larger this number, the slower your GC will be.

heap_length: the total number of heaps allocated in memory. For example 1648 means - about 25.75MB is allocated to Ruby heaps. (1648 * (2 << 13)).to_f / (2 << 19)

heap_increment: Is the number of extra heaps to be allocated, next time Ruby grows the number of heaps (as it does after it runs a GC and discovers it does not have enough free space), this number is updated each GC run to be 1.8 * heap_used. In later versions of Ruby this multiplier is configurable.

heap_live_num: This is the running number objects in Ruby heaps, it will change every time you call GC.stat

heap_free_num: This is a slightly confusing number, it changes after a GC runs, it will let you know how many objects were left in the heaps after the GC finished running. So, in this example we had 102447 slots empty after the last GC. (it also increased when objects are recycled internally - which can happen between GCs)

heap_final_num: Is the count of objects that were not finalized during the last GC

total_allocated_object: The running total of allocated objects from the beginning of the process. This number will change every time you allocate objects. Note: in a corner case this value may overflow.

total_freed_object: The number of objects that were freed by the GC from the beginning of the process.

When will the GC run

The GC in Ruby 2.0 comes in 2 different flavors. We have a “full” GC that runs after we allocate more than our malloc_limit and a lazy sweep (partial GC) that will run if we ever run out of free slots in our heaps.

The lazy sweep takes less time than a full GC, however only performs a partial GC. It’s goal is to perform a short GC more frequently thus increasing overall throughput. The world stops, but for less time.

The malloc_limit is set to 8MB out of the box, you can raise it by setting the RUBY_GC_MALLOC_LIMIT higher.

Why a malloc limit?

Discourse at boot only takes up 25MB of heap space, however when we look at the RSS for the process we can see it is consuming way over 134MB. Where is all this extra memory?

sam@ubuntu:~/Source/discourse$ RAILS_ENV=production rails c irb(main):008:0> `ps -o rss= -p #{Process.pid}`.to_i => 134036 irb(main):009:0> (GC.stat[:heap_length] * (2 << 13)).to_f / (2 << 19) => 26.15625

The Ruby heaps store RVALUE objects, these objects at most can store 40 bytes. For Strings, Arrays and Hashes this means that small objects can fit in the heap, but as soon as they reach a threshold, Ruby will malloc extra memory outside of the Ruby heaps. We can see an example here:

sam@ubuntu:~/Source/discourse$ irb irb(main):001:0> require 'objspace' => true irb(main):002:0> ObjectSpace.memsize_of("a") => 0 irb(main):005:0> ObjectSpace.memsize_of("a"*23) => 0 irb(main):006:0> ObjectSpace.memsize_of("a"*24) => 24 # peace comes at a high cost irb(main):017:0> ObjectSpace.memsize_of("☮"*8) => 24

Note: in Ruby 2.0 memsize_of does not include the RVALUE size, hence 0, we are hoping to change this in Ruby 2.1 see: http://bugs.ruby-lang.org/issues/8984

Turns out that for Rails apps the vast majority of the RSS consumption is not by Ruby heaps but by attached information to objects allocated outside of the Ruby heap and general memory fragmentation.

$ RAILS_ENV=production rails c irb(main):005:0> size=0; ObjectSpace.each_object{|o| size += ObjectSpace.memsize_of(o) }; puts size/1024 67265

This fact puts a bit of a damper on the GC Bitmap Marking algorithm introduced in Ruby 2.0. For a large Rails app, at best, it is optimising reuse of 20% or so, further more this 20% can get fragmented which makes stuff worse.

We can explore the default malloc limit (it is 8MB out of the box). If we allocate 8 objects that are 1MB each we can trigger a GC:

$ irb irb(main):001:0> GC.start => nil irb(main):002:0> GC.count => 22 irb(main):003:0> 8.times { Array.new(1_000_000/8) } ; puts => nil irb(main):004:0> GC.count => 23 irb(main):005:0> require 'objspace' => true irb(main):006:0> ObjectSpace.memsize_of(Array.new(1_000_000/8)) => 1000000 irb(main):007:0>

Ruby protects your processes from using up all the available memory on your computer when making throw away copies of large objects.

However, this setting very outdated, it was introduced many years ago by matz when memory was scarce.

For an added bonus using very nasty hacks we can even raise this number in runtime.

sam@ubuntu:~/Source/discourse$ irb irb(main):001:0> 15.times { Array.new(16_000_000/8) }; puts => nil irb(main):002:0> GC.start; GC.count => 38 irb(main):003:0> 15.times { Array.new(1_000_000/8) }; puts => nil irb(main):004:0> GC.count => 38

MRI will raise the GC limit if it over-exhausted (a percentage each time). However, in the real world, in a real Rails app the GC limit is very unlikely to grow much during runtime, you just don’t allocate huge objects regularly. So, we usually use the environment var RUBY_GC_MALLOC_LIMIT to push this number up.

Every rails app should have a higher malloc limit. The default is too small, this tiny default means that many Rails apps in the wild are getting zero benefit from the faster “lazy sweep” algorithm implemented in Ruby 1.9.3. Further more, low malloc limits mean that the GC runs way too often. Typical Rails requests will regularly allocate a couple of megs of RAM.

What should you set it to? It totally depends on the app. For Discourse we recommend 50MB. The downside of setting this too high is that you are increasing general memory fragmentation.

How much memory is a page view allocating?

rack-mini-profiler (in master) contains a very handy report to get a handle of memory use in your various pages. Just apppend ?pp=profile-gc at the end of your url:

Overview ------------------------------------ Initial state: object count - 377099 , memory allocated outside heap (bytes) 76765247 GC Stats: count : 114, heap_used : 4283, heap_length : 4312, heap_increment : 0, heap_live_num : 459148, heap_free_num : 1283203, heap_final_num : 0, total_allocated_object : 6292870, total_freed_object : 5833722 New bytes allocated outside of Ruby heaps: 1458308 New objects: 38363 ObjectSpace delta caused by request: -------------------------------------------- String : 18638 Array : 10053 Hash : 3229 ActiveRecord::AttributeMethods::TimeZoneConversion::Type : 1297 Rational : 790 Time : 615 MatchData : 364 RubyVM::Env : 330

Here we can see that the front page is causing 1.45MB to allocate, so out-of-the-box, without any malloc tuning we can only handle 5 requests. 5 requests only generate 190k or so objects in the heap that is way below heap free num.

We spent a lot of time tuning Rails 4 to cut down on allocations, before we started tuning this we were easily allocating double the amount for a front page request.

note: running this report unavoidably is likely to cause your Ruby heaps to grow, due to iteration through ObjectSpace with GC disabled. It is recommended you cycle your processes in production after an analysis session.

The trouble with the heap growth algorithm

Ruby heaps will grow by a factor of 1.8 (times used heap size post GC) every time heap space is hits a threshold. This is rather problematic for real world apps. The number of heaps available may increase during an apps lifecycle, but it will never decrease. Say you have 1000 heaps in play, next time heaps grow you will jump to 1,800 heaps. However, your app may have optimal performance with 1,400 heaps. Remember, the more used heaps you have the slower it will take a GC to run.

note: the Ruby heap growth factor is configurable and adaptable in Ruby 2.1.

We have some control over the heap count using the RUBY_HEAP_MIN_SLOTS , we can tell ruby to pre-allocate heap space, unfortunately in Ruby 2.0 p247 this is a bit buggy and will result in over allocation, for example here we ask for 1000 heap slots but get 1803 in Ruby 2.0:

sam@ubuntu:~/Source$ rbenv shell ruby-head sam@ubuntu:~/Source$ RUBY_HEAP_MIN_SLOTS=$(( 408*1000 )) ruby -e "puts GC.stat[:heap_length]" 1000 sam@ubuntu:~/Source$ rbenv shell 2.0.0-p247 sam@ubuntu:~/Source$ RUBY_HEAP_MIN_SLOTS=$(( 408*1000 )) ruby -e "puts GC.stat[:heap_length]" 1803

So, you can use this setting but be careful with it, it will over commit heap space, meaning, slower GC times. See also: http://bugs.ruby-lang.org/issues/9134

We can also attempt to control heap space with RUBY_FREE_MIN . Unfortunately this setting does not work as expected.

sam@ubuntu:~/Source$ RUBY_FREE_MIN=$(( 408*10000 )) ruby -e " GC.start; p GC.stat[:heap_length]" 81 sam@ubuntu:~/Source$ RUBY_FREE_MIN=$(( 408*20000 )) ruby -e " GC.start; p GC.stat[:heap_length]" 81

All this setting does is forces Ruby to evaluate if it needs to grow a heap, more aggressively.

Out of the box this is how the algorithm works, more or less:

GC sweep runs Ruby checks if the free_num (the number of free objects in the used heaps) is smaller than free_min aka (RUBY_FREE_MIN) Ruby runs set_heaps_increment and heaps_increment set_heaps_increment checks to see if heaps_used * 1.8 is larger than heaps_length … if it is it will grow the heap by 0.8 * heaps_used.

The key here is that all free_num does is trigger a check. Out of the box free_min is dynamically adjusted to 20% of heaps_used. I can not think of any reason you would really play with this setting.

The implementation is much more more intuitive in Ruby 2.1 see: http://bugs.ruby-lang.org/issues/9137

###The holy grail of an out-of-band GC

A full GC can take a long time, in fact on a droplet at Digital Ocean, this blog can spend upwards of 100ms to perform a GC.

This GC stops the world and “stalls” your customers. In an ideal world you would be able to control the GC and run it between requests. As long as you have enough worker processes, this stall will be invisible to your customers.

The problem though is that it is very hard to predict when a GC will run, cause malloc information is totally invisible in Ruby 2.0. We are hoping to expose more information in Ruby 2.1.

This means that if RUBY_GC_MALLOC_LIMIT is set too low, you have no way of predicting when a GC will run.

There have been two attempts at an out-of-band-gc made public.

Unicorn OOBGC http://unicorn.bogomips.org/Unicorn/OobGC.html Passenger OOBGC https://github.com/phusion/passenger/blob/master/lib/phusion_passenger/rack/out_of_band_gc.rb

Both attempts are severely flawed. In modern web apps the amount of data a page can allocate varies wildly. Some pages may allocate a tiny amount of memory and objects others lots.

You can not deterministically guess when its best to run the GC based on request count alone. This means these attempt often run the GC way too often.

Worst still they often attempt to run GC.disable which has extreme possibility of creating rogue Ruby processes with massive heaps. Once you disable the GC all bets are off. A simple loop can create an very problematic process.

irb(main):008:0> GC.disable => false irb(main):009:0> 100_000_000.times{ "" } ; p => nil irb(main):010:0> GC.enable => true irb(main):011:0> GC.stat => {:count=>4472, :heap_used=>246240, :heap_length=>286126, :heap_increment=>39886, :heap_live_num=>100082676, :heap_free_num=>42424, :heap_final_num=>0, :total_allocated_object=>289369670, :total_freed_object=>189286994} iirb(main):014:0> t=Time.now; GC.start; puts (Time.now - t) 0.15620451

There, we now have a process that takes 156ms to run the GC on bleeding edge hardware.

And let’s not forget the obscene memory usage

sam@ubuntu:~/Source/discourse$ smem PID User Command Swap USS PSS RSS 8906 sam irb 0 3982736 3983692 3985700

Even with all the missing information, we can do better than a simple, flawed, request count. At Discourse I have been working on an out-of-band-GC that works quite successfully in production. Firstly we need to make sure malloc limit rarely affects us. We do so by raising it to 40MB.

Source is here: https://github.com/discourse/discourse/blob/master/lib/middleware/unicorn_oobgc.rb

It attempts to keep a running estimate of the live object count that will trigger a GC using:

# the closer this is to the GC run the more accurate it is def estimate_live_num_at_gc(stat) stat[:heap_live_num] + stat[:heap_free_num] end

This is extremely conservative and not that accurate. We can also experiment with:

# base on heap length def estimate_live_num_at_gc(stat) stat[:heap_length] * 408 # objects per slot end

The algorithm than tries to leave room for 2 “big” requests, if it notices there is not enough room, it will preempt a GC.

This worked very successfully for us at http://discourse.ubuntu.com as can be seen when running this in verbose mode.

OobGC hit pid: 28701 req: 56 max delta: 111782 expect at: 893328 67ms saved OobGC hit pid: 28680 req: 57 max delta: 50000 expect at: 893328 64ms saved OobGC hit pid: 28728 req: 45 max delta: 112105 expect at: 893328 61ms saved OobGC hit pid: 28687 req: 49 max delta: 50000 expect at: 949063 74ms saved OobGC hit pid: 28707 req: 66 max delta: 50000 expect at: 893328 71ms saved OobGC hit pid: 28695 req: 89 max delta: 50000 expect at: 893328 67ms saved OobGC hit pid: 28728 req: 20 max delta: 71807 expect at: 893328 61ms saved OobGC hit pid: 28680 req: 43 max delta: 62992 expect at: 893328 68ms saved OobGC hit pid: 28701 req: 75 max delta: 50000 expect at: 893328 73ms saved OobGC hit pid: 28707 req: 52 max delta: 50000 expect at: 893328 68ms saved OobGC hit pid: 28695 req: 34 max delta: 81301 expect at: 893328 61ms saved OobGC hit pid: 28687 req: 68 max delta: 50000 expect at: 949063 74ms saved OobGC hit pid: 28728 req: 69 max delta: 50000 expect at: 893358 69ms saved OobGC hit pid: 28701 req: 39 max delta: 73273 expect at: 893328 61ms saved OobGC hit pid: 28695 req: 47 max delta: 115067 expect at: 893328 65ms saved OobGC hit pid: 28707 req: 48 max delta: 185909 expect at: 893328 68ms saved OobGC hit pid: 28680 req: 85 max delta: 50000 expect at: 893328 68ms saved OobGC hit pid: 28695 req: 20 max delta: 52118 expect at: 893328 62ms saved OobGC hit pid: 28687 req: 63 max delta: 50000 expect at: 949063 73ms saved OobGC hit pid: 28728 req: 42 max delta: 64944 expect at: 893328 63ms saved OobGC hit pid: 28680 req: 41 max delta: 138184 expect at: 893328 65ms saved OobGC hit pid: 28701 req: 50 max delta: 50000 expect at: 893328 70ms saved OobGC miss pid: 28707 reqs: 50 max delta: 50000

Once in a while you get a miss, cause it is impossible to predict malloc and potentially massive requests, however, in general it helps a lot. You can see the out-of-band-gc kicking in at different request counts, sometimes we can handle 20 requests between GCs, other times 80. As an added bonus, you don’t need to run unicorn_killers and risk is very low.

Keep exploring

Given the built in tooling and Mini Profiler, you are not running blind, you can do quite a lot to investigate and understand your GC behavior.

Try running these snippets and tools, try exploring.

Many very exciting changes both to GC algorithms and tooling are forthcoming in Ruby 2.1 thanks to work by Koichi Sasada, Aman Gupta and others. I hope to blog about it.

Special thank you to Koichi for reviewing this article.