This post is based on Jeremy Evans’s Optimization Techniques Used by the Benchmark Winners presentation. Slides | YouTube at Ruby Kaigi 2019. It’s better to watch his fabulous presentation, but you can come back here for written references. These techniques coming from maintaining Sequel and Roda. These two gems have always been 0 issues. It’s probably the best maintained gems. Sequel and Roda are the leaders for Ruby in TechEmpower benchmark results. Let’s see what lessons can we learn to optimize Ruby.

These techniques may not be the best to write at application code level but good when building libraries, optimize critical path of your application. The ideas and principles are universally applicable.

Avoid processing

2:24

No code is faster than no code

-- Merb Motto

Fastest code does the least. Execute less code. Run code once instead of many times, once is better than many. In big-o terms: O(1) > O(n) .

Take initialize model instance as example, the Sequel is definitely faster than Active Record because it does much less things.

Active Record:

# https://github.com/rails/rails/blob/8ae8e19a33f4749e529b1649f4f0ace1dbb37285/activerecord/lib/active_record/core.rb#L357-L369 def init_with_attributes ( attributes , new_record = false ) # :nodoc: @new_record = new_record @attributes = attributes init_internals yield self if block_given? _run_find_callbacks _run_initialize_callbacks self end

Sequel:

# https://github.com/jeremyevans/sequel/blob/4a146796f273ce14e121954ad4e7f3134209386d/lib/sequel/model/base.rb#L221 def call ( values ) o = allocate o . instance_variable_set ( :@values , values ) o end

Delay Computations

For web app, it means to do at initialization time is better than doing at Runtime. See Hound applies this for putting environment variables into constants.

Example: Active Record & Sequel Callbacks

Active Record runs find and initialize callbacks during initialization. Even when there are no callbacks defined on the model, so it slows down:

# lib/active_record/core.rb def init_with_attributes ( attributes , new_record = false ) ... _run_find_callbacks _run_initialize_callbacks ... end

Sequel does this by having a Plugin System . In Sequel, if you need callbacks, add plugin :after_initialize to your model. Nothing gets run if you’re not using callbacks.

Inheritance gives every subclass things that they may not need. Prefer Composition over Inheritance. Plugins give all subclass a chance to only take what they need. Sequel and Roda use plugins for most features so they are faster.

Reduce Object Allocations

10:28

The init_internals in Active Record set few instance variables to nil :

def init_internals @primary_key = self . class . primary_key @readonly = false @destroyed = false @marked_for_destruction = false @destroyed_by_association = nil @_start_transaction_state = nil @transaction_state = nil self . class . define_attribute_methods end

Jeremy found avoid @instance_variable = nil resulted in 150% faster in Sequel and Roda. This is controversial because only in verbose mode, using instance variable without define first will emit warnings. Emit warnings makes it slower. He made a patch to remove such warnings but got rejected for backward compatibility.

You can find another example of delay allocation of instance variables in this commit from rails/rails. Delay setting instance variable to hash by setting it to nil first.

Another common place to reduce object allocations is string. Before Ruby 2.3, we need to call freeze on strings. Since 2.3, we can add a frozen string literal in a file and by default all strings in that file are frozen.

Reduce String Allocations

Before frozen string literal (Ruby < 2.3), introduce a constant for common places of strings SELECT = 'SELECT'.freeze was okay:

SELECT = 'SELECT' . freeze SPACE = ' ' . freeze FROM = 'FROM' . freeze def select_sql sql = String . new sql ‌ << SELECT ‌ << SPACE sql ‌ << literal ( columns ) sql ‌ << SPACE << ‌ FROM << ‌ SPACE sql ‌ << literal ( table ) end

Since Ruby 2.3:

# frozen-string-literal: true def select_sql sql = String . new sql << "SELECT " sql << literal ( columns ) sql << " FROM " sql << literal ( table ) end

It’s simpler and easier to understand. Avoids string allocations in the meantime with less operations ( << ), hence faster.

Reduce Hash allocations

Avoid default hash in method definition:

class Sequel::Dataset def union ( dataset , opts = {}) compound_clone ( :union , dataset , opts ) end end

Introduce an empty frozen hash constant:

class Sequel::Dataset OPTS = {}. freeze # 100% faster than `opts={}`. def union ( dataset , opts = OPTS ) compound_clone ( :union , dataset , opts ) end # Keyword argument is slower than Optimal hash constant. def union ( dataset , ** opts ) compound_clone ( :union , dataset , opts ) end end

Right now double splat keyword arguments are slower (2.7) than above empty frozen hash constant technique. Keyword arguments also have maintenance barriers the method caller and callee all need to change when keyword changes.

Reduce Proc Allocation

15:03

For proc does not depend on something else or any runtime state:

def indifferent_params Hash . new { | h , k | h [ k . to_s ] if k . is_a? ( Symbol ) } end

can extract into a constant:

IND = proc { | h , k | h [ k . to_s ] if k . is_a? ( Symbol ) } def indifferent_params2 Hash . new ( & IND ) end

About 300%+ faster.

Benchmark . ips do | x | x . report ( "inline proc" ) { indifferent_params } x . report ( "constant proc" ) { indifferent_params2 } x . compare! end Comparison : constant proc: 5645873.6 i / s inline proc: 1469374.7 i / s - 3.84 x slower

Reduce Array Allocation

This is not from the presentation but I think it’s great to also include an example here.

Use transform_values when modifying hashes. See this commit from rails/rails:

def escape(params) - Hash[params.map { |k, v| [k, Rack::Utils.escape(v)] }] + params.transform_values { |v| Rack::Utils.escape(v) } end

Minimize Indirection

Say a common operation in our app is to convert a string into Integer. We could have a lambda , a kernel method, a dedicated object with call to do so, or an object that aliased call to Integer() :

string = "42" using_lambda = lambda { | str | Integer ( str ) } using_kernel = Kernel . method ( :Integer ) using_object = Object . new def using_object . call ( str ); Integer ( str ); end no_indirection_object = Object . new class << no_indirection_object alias call Integer public :call end Benchmark . ips do | x | x . report ( "lambda" ) { using_lambda . call ( string ) } x . report ( "Kernel" ) { using_kernel . call ( string ) } x . report ( "Object" ) { using_object . call ( string ) } x . report ( "Object v2" ) { no_indirection_object . call ( string ) } x . compare! end

Comparison: Object v2: 6258404.2 i/s Object: 5674514.7 i/s - same-ish: difference falls within error lambda: 5199998.6 i/s - 1.20x slower Kernel: 5130274.6 i/s - 1.22x slower

We see using Kernel.method is slowest here. A lambda is slightly faster than that. Define a method on an object is faster than calling a lambda. Avoid indirection (by alias call to Integer ) further makes it ~10% faster. The benchmark example above aligns with Jeremy’s observations on benchmarking Sequel from the presentation. A similar observation is also found from graphql-ruby land that they’re now in favor of method call instead of using procs.

Defining Methods

19:02

def foo 1 end # 50% slower than `def foo; 1; end` define_method ( :foo ) do 1 end

So we always want to define method with def . Sometimes we want to define at runtime. Take Sequel as example, it needs to dynamically define model columns’s setters and getters, can achieve with class_eval + def .

# def_column columns . each do | column | class_eval ( "def #{ column } ; @values[: #{ column } ]; end" ) class_eval ( "def #{ column } =(v); @values[: #{ column } ] = v; end" ) end

This works for column name without spaces "name" , with space "employee name" it does not work. We need to use define_method :

columns . each do | column | column = column . to_sym define_method ( column ) do @values [ column ] end define_method ( :" #{ column } =" ) do | v | @values [ column ] = v end end

which is slower than using class_eval + def .

These columns with "bad" names are not the majority of the case but changing the implementation to this hurts performance. Here comes another general optimization principle.

Separate Common from Uncommon

21:18

We usually have a fast approach for the simple cases but it fails in more complex cases. Assume the simple cases are more common, we can speed up by separating the two cases. Sequel separates the common from the uncommon, to have the best of both worlds. Below are simplified from Sequel’s column definitions:

def def_column_accessor ( * columns ) columns , bad_columns = columns . partition { | x | /\A[A-Za-z_][A-Za-z0-9_]*\z/ . match ( x . to_s ) } bad_columns . each { | x | def_bad_column_accessor ( x )} # define_method columns . each do | column | # faster with `def` end end

In general def > define_method , but there is one case where define_method is preferred for performance. Let’s see an example:

class Def def self . def_numbers ( first , last ) class_eval ( "def numbers; ( #{ first } .. #{ last } ).to_a.freeze; end" ) end def self . def_numbers2 ( first , last ) array = ( first .. last ). to_a . freeze define_method ( :numbers2 ) do array end end end

Benchmark . ips do | x | x . report ( "def" ) { Def . def_numbers ( 1 , 10 ) } x . report ( "define_method" ) { Def . def_numbers2 ( 1 , 10 ) } x . compare! end Comparison : define_method: 542551.0 i / s def: 62085.7 i / s - 8.74 x slower

So prefer def over define_method unless you can access local variables from surrounding scope to avoid computations, only then define_method > def .

Related to this. If we need to accept blocks and store blocks, and use instance_exec to execute these blocks. Let’s implement a before hook to demonstrate this idea.

def self . before ( & block ) before_hooks << block end def before # execute all blocks passed to self.before self . class . before_hooks . each { | block | instance_exec ( & block ) } end

This works but instance_exec creates singelton class for the instance, it’s faster to use methods. We can define before hooks methods by naming them based on their position in the array, then we pass to define_method :

def self . before ( & block ) meth = :"_before_hook_ #{ before_hooks . length } " define_method ( meth , & block ) before_hooks << meth end def before self . class . before_hooks . each { | m | send ( m ) } end

Execute these before hooks could be a send which is faster. This is good but we can do better. Since we know which methods will be executed, we can define the before to execute these before hook methods by:

def self . before ( & block ) # ... class_eval ( "def before; #{ before_hooks . join ( ';' ) } ; end" ) end

This is faster because it avoids the need to call each on before hooks array and each method invoked directly instead of one indirection send . But we can still do better.

def self . before ( & block ) # ... if before_hooks . length > 1 class_eval ( "def before; #{ before_hooks . join ( ';' ) } ; end" ) else class_eval ( "alias before #{ before_hooks . first } " ) end end

If there is only a single before hook, we can alias to that method hence minimized indirection.

This approach to implementing before hooks compare to using define_method , is about 200% faster because it avoids a lot of internal indirections. One caveat of this approach is that it does not work with before { |x| } . But we can fix this by:

unless block . arity == 0 || block . arity == - 1 b = block block = lambda { instance_exec ( & b ) } end

Optimize Inner Loops

26:16

Another best place to start optimizing is inside any inner loops. Let’s see an example from Sequel’s Anywhere adapter:

# https://github.com/jeremyevans/sequel/blob/ae50b5e98efb3ad902fb37f370cbf237644a12d3/lib/sequel/adapters/sqlanywhere.rb#L157-L187 while db . api . sqlany_fetch_next ( rs ) == 1 max_cols = db . api . sqlany_num_cols ( rs ) h2 = {} max_cols . times do | cols | h2 [ col_map [ db . api . sqlany_get_column_info ( rs , cols )[ 2 ]] || db . api . sqlany_get_column_info ( rs , cols )[ 2 ]] = cps [ db . api . sqlany_get_column_info ( rs , cols )[ 4 ]]. nil? ? db . api . sqlany_get_column ( rs , cols )[ 1 ] : db . api . sqlany_get_column_info ( rs , cols )[ 4 ] != 500 ? cps [ db . api . sqlany_get_column_info ( rs , cols )[ 4 ]]. call ( db . api . sqlany_get_column ( rs , cols )[ 1 ]) : convert ? cps [ db . api . sqlany_get_column_info ( rs , cols )[ 4 ]]. call ( db . api . sqlany_get_column ( rs , cols )[ 1 ]) : db . api . sqlany_get_column ( rs , cols )[ 1 ] end yield h2 end unless rs . nil?

And there are some hidden ternary operators without parentheses...So this is complex, but not the reason why it’s slow. It’s slow because:

db.api.sqlany_get_column_info(rs, cols) repeated 5 times with same arguments in the inner loop. Another one is db.api.sqlany_get_column(rs, cols)[1] repeated 4 times. After a serious refactoring, now we arrive at:

def fetch_rows ( sql ) db = @db cps = db . conversion_procs api = db . api execute ( sql ) do | rs | convert = convert_smallint_to_bool col_infos = [] api . sqlany_num_cols ( rs ). times do | i | _ , _ , name , _ , type = api . sqlany_get_column_info ( rs , i ) cp = if type == 500 cps [ 500 ] if convert else cps [ type ] end col_infos << [ output_identifier ( name ), cp ] end self . columns = col_infos . map ( & :first ) max = col_infos . length if rs while api . sqlany_fetch_next ( rs ) == 1 i = - 1 h = {} while ( i += 1 ) < max name , cp = col_infos [ i ] v = api . sqlany_get_column ( rs , i )[ 1 ] h [ name ] = cp && v ? cp . call ( v ) : v end yield h end end end self end

The inner loop now becomes much simpler:

while ( i += 1 ) < max name , cp = col_infos [ i ] v = api . sqlany_get_column ( rs , i )[ 1 ] h [ name ] = cp && v ? cp . call ( v ) : v end

All operations inside are stored on local variables. This seems not a big deal but if you are retrieving 100,000 rows with each row of 10 columns, this makes a big difference. By define local variables, this saves millions of method calls. This is another general optimization principle.

Prefer Local Variables

29:02

Prefer local variables whenever possible especially in inner loop as demostrated above.

Local variables > Instance variables

Local variables > Constant

Local variables > Method Calls

local variables are faster because it minimize the number of indirections.

while v.s. each

Let’s go back to the above refactoring example. The deliberate use of while in the inner loop:

while ( i += 1 ) < max name , cp = col_infos [ i ] v = api . sqlany_get_column ( rs , i )[ 1 ] h [ name ] = cp && v ? cp . call ( v ) : v end

Inner loop like this is one of the few places that make sense to use while instead of each because using each for inner loops can hurt performance, because it requires a seperate a stackframe to push and pop for each iteration.

Choose Faster Algorithms

31:07

Sinatra routes (by pattern) v.s. Roda’s routing tree

Given thousands of routes. Sinatra will be busy matching against all routes. O(N) Roda’s routing tree would take route segments. Filter out the segment to the right tree. O(log(N))

Roda . route do | r | r . on "foo" do # /foo ... end r . on "bar" do # /bar ... end end

But the worst case is still O(N) and Jeremy does not accept the worst case being O(n) ... Here comes multi route:

Roda . plugin :multi_route Roda . route ( "foo" ) { | r | # /foo... } Roda . route ( "bar" ) { | r | # /bar... } Roda . route { | r | r . multi_route }

Multi route builds a regexp to match initialial segment ( /foo , /bar , ...) then dispatch:

Roda . route { | r | r . multi_route }

which brings the performance back to O(log(n)) . But he wants O(1) ...

Roda . plugin :static_routing Roda . static_get ( "foo" ) { | r | # /foo... } Roda . static_get ( "bar" ) { | r | # /bar... } Roda . route { | r | }

But O(1) gives up many advantages. In the end, he created hash_routes plugin to have the best of both worlds:

Roda . plugin :hash_routes Roda . hash_routes do on ( "foo" ) do | r | r . hash_routes end is ( "foo/bar" ) do | r | end Roda . route do | r | r . hash_routes end

Cache when Possible

39:40

Introduce cache is the highest ratio of the following formula:

% Increase in Performance ----------------------------- Lines of Code Changed

Maximum performance increase over minimum lines of code changed.

Use case: Sequel Symbol Literalization

40:00

:column_name # => '"column_name"'

def self . split_symbol ( sym ) v = case s = sym . to_s when /\A((?:(?!__).)+)__((?:(?!___).)+)___(.+)\z/ [ $1 . freeze , $2 . freeze , $3 . freeze ]. freeze when /\A((?:(?!___).)+)___(.+)\z/ [ nil , $1 . freeze , $2 . freeze ]. freeze when /\A((?:(?!__).)+)__(.+)\z/ [ $1 . freeze , $2 . freeze , nil ]. freeze else [ nil , s . freeze , nil ]. freeze end v end

Introduce symbol cache:

SPLIT_SYMBOL_CACHE = {} def self . split_symbol ( sym ) unless v = Sequel . synchronize { SPLIT_SYMBOL_CACHE [ sym ] } v = case s = sym . to_s when /\A((?:(?!__).)+)__((?:(?!___).)+)___(.+)\z/ [ $1 . freeze , $2 . freeze , $3 . freeze ]. freeze when /\A((?:(?!___).)+)___(.+)\z/ [ nil , $1 . freeze , $2 . freeze ]. freeze when /\A((?:(?!__).)+)__(.+)\z/ [ $1 . freeze , $2 . freeze , nil ]. freeze else [ nil , s . freeze , nil ]. freeze end Sequel . synchronize { SPLIT_SYMBOL_CACHE [ sym ] = v } end v end

This is over 1000% faster after introduced this cache.

Globally Frozen, Locally Mutable

41:50

Freeze global state like classes, objects that is accessed by multiple threads. Local objects instantiated per request remain mutable for ease of use (cache). This approach is mainly to improve reliability (thread safe). This also improves performance because frozen object means it is easy to cache them.

Note that Frozen is not the same as Immutable. Frozen means Immutable State + Mutable Cache.

# https://github.com/jeremyevans/sequel/blob/cc5c1612fb4c42e7cbf6db6b0572001fd5bf6b58/lib/sequel/dataset/misc.rb#L25-L30 OPTS = {} def initialize ( db ) @db = db @opts = OPTS # state in frozen hash cache = {} # <-- mutable cache freeze # <-- freeze this object end

Make sure access to mutable cache is protected by Mutex for thread safety.

Optimization Through Metaprogramming

46:11

class Album < Sequel :: Model dataset_module do def by_name order ( :name ) end def released where ( released: true ) end end end # Album.where(released: true).order(:name).first # => Album.released.by_name.first

This is good to DRY up code but does not improve performance. Jeremy introduced metaprogramming for users to define like following and speed up by introduced dataset cache to these metaprogramming generated methods:

class Album < Sequel :: Model dataset_module do order :by_name , :name # def by_name # cache { order(:name) } # end where :released , released: true # def released # cache { where(released: true) } # end end end

Example: Roda String Matching

48:41

Roda was forked from Cuba. The matching of routing segment was:

def match_string ( str ) consume ( Regexp . escape ( str )) end # https://github.com/soveran/cuba/blob/1dce54d0926c5d6e5daab033a8e9685c6faa4227/lib/cuba.rb#L214-L226 def consume ( pattern ) matchdata = env [ Rack :: PATH_INFO ]. match ( /\A\/( #{ pattern } )(\/|\z)/ ) return false unless matchdata path , * vars = matchdata . captures env [ Rack :: SCRIPT_NAME ] += "/ #{ path } " env [ Rack :: PATH_INFO ] = " #{ vars . pop }#{ matchdata . post_match } " captures . push ( * vars ) end private :consume

This code looks perfectly fine. But let’s see how Jeremy optimized it. First is to avoid allocations. The first change was to include the proceeding slash in the first capture to avoid extra array allocation for the captures:

- /\A\/(#{pattern})(\/|\z)/ + /\A(\/(#{pattern}))(\/|\z)/ ... - path, *vars = matchdata.captures + vars = matchdata.captures - env[Rack::SCRIPT_NAME] += "/#{path}" + env[Rack::SCRIPT_NAME] += vars.shift

Next is to use a regular expression feature, positive lookahead assertion instead of a capture:

- /\A(\/(#{pattern}))(\/|\z)/ + /\A(\/(#{pattern}))(?=\/|\z)/

So we no longer need to pop the last element of capture variables ( vars ):

- env[Rack::PATH_INFO] = "#{vars.pop}#{matchdata.post_match}" + env[Rack::PATH_INFO] = matchdata.post_match

Also avoid the allocation of additional string by using post match directly.

He then profile and found generating a new regular expression every time consume was called takes a large portion of time. Then he introduced cached matcher:

def match_string(str) - consume(Regexp.escape(str)) + consume(self.class.cached_matcher(str) { Regexp.escape(str) }) end

This change is possible because consume was a private method. Another optimizations technique is to keep most methods private.

Keep Most Methods Private

Only makes a method public if it needs to be public. When method is private, you are free to change its API to improve performance. Public methods have limit options to optimize.

Prefer String operations over Regexp operations

In Roda 3.0, the matching of string segment changed to use string operations instead of regular expression matching:

def match_string ( str ) rp = @remaining_path if rp . start_with? ( "/ #{ str } " ) # <---- last = str . length + 1 case = rp [ last ] # <---- when "/" @remaining_path = rp [ last , rp . length ] # <---- when nil @remaining_path = "" end end end

But still, there are two String allocations: rp.start_with?("/#{str}") and rp[last] .

Prefer Integer operations over String operations

Iterating by index and checking if first character of string is 47 ( 47 is ascii code of / ):

def _match_string ( str ) rp = @remaining_path length = str . length match = case rp . rindex ( str , length ) # <------ when nil return when 1 rp . getbyte ( 0 ) == 47 # <--- start_with?("/") else length == 0 && rp . getbyte ( 0 ) == 47 # <--- start_with?("/") end if match length += 1 case rp . getbyte ( length ) when 47 @remaining_path = rp [ length , 100000000 ] when nil @remaining_path = "" end end end

The most common failure case when nil also moved up because it happens more often. Another trick is to use a large number that is larger than any reasonable path length: rp[length, 100000000] .

Finally...

Remember that optimizations come last.

First make it work, then make it correct.

After you make it fun, then you can make it fast.

—Jeremy Evans

Profile to find where to optimize. Then Benchmark your change.

That’s all from Jeremy’s presentation.

There is no silver bullet. It depends on the context. We must try ourselves. Verify it’s faster than the other instead of directly quoting anything here.

The faults are probably mine. I’m happy to fix if you would let me know. You can reach out to him on twitter for more questions: @jeremyevans0 or GitHub to see more of his work.

Thanks for reading!