Even though registers are a low-level CPU concept, having some knowledge about them can help write faster code. Simply put, a CPU register is a storage for a single variable. CPU can keep data in memory or cache or in registers and registers are often much faster. Furthermore, some operations are possible only when the data is in registers. Hence, the OCaml compiler tries to keep as many variables as it can in the registers.

Code with more than 13 variables is slow?

Consider this primitive statistics computation:

let stats xs ys = let len = Array.length xs in if len Array.length ys then raise not_of_same_length; let sx = ref 0 in let sy = ref 0 in let sxx = ref 0 in let syy = ref 0 in let sxy = ref 0 in let sax = ref 0 in let say = ref 0 in for i = 0 to len - 1 do let x = Array.unsafe_get xs i in let y = Array.unsafe_get ys i in let ax = abs x in let ay = abs y in let xx = x * x in let yy = y * y in let xy = x * y in sx := !sx + x; sy := !sy + y; sxx := !sxx + xx; syy := !syy + yy; sxy := !sxy + xy; sax := !sax + ax; say := !say + ay; done; !sx, !sy, !sax, !say, !sxx, !syy, !sxy

Rearranging just a few lines produces code 1.5-2x faster:

let x = Array.unsafe_get xs i in sx := !sx + x; let xx = x * x in sxx := !sxx + xx; let ax = abs x in sax := !sax + ax; let y = Array.unsafe_get ys i in sy := !sy + y; let xy = x * y in sxy := !sxy + xy; let ay = abs y in say := !say + ay; let yy = y * y in syy := !syy + yy;

Why is that? CPU has just a few registers:

13 which can contain integers and pointers to arrays and records

16 which can contain floating point numbers

If the code has more than that many variables, OCaml compiler has to park the extra variables in memory and this parking is called spilling. Actually, spilling may happen even when there are less variables, because for example some operations like integer multiplication use extra registers.

Therefore, it’s good to try to keep the number of frequently used variables to 13 or less, or to rearrange the code so that fewer variables are used at the same time.

The OCaml compiler can show spilled variables when called with the option -dreload.

Calling a single function makes subsequent code slower?

If a function is called, all of the active registers are spilled because it is not known whether the called function will need those registers. That can often be the largest penalty when calling a function, assuming the function is not inlined.

Let’s change the previous function:

let stats xs ys = let sx = ref 0 in let sax = ref 0 in let sxx = ref 0 in let sy = ref 0 in let say = ref 0 in let syy = ref 0 in let sxy = ref 0 in let len = Array.length xs in if len <> Array.length ys then failwith (Printf.sprintf "Arrays not of the same length: %d %d" len (Array.length ys)); for i = 0 to len - 1 do ...

This produces 1.35x slower code simply because OCaml compiler will spill all of the variables because of Printf.sprintf. In each iteration, OCaml will pull sx from the memory and store it back.

It’s a pity that this is actually not necessary. OCaml has to pull sx from the memory and store it back just once, not in each iteration. Looks like that can be improved in the OCaml compiler.

Recursive functions with more parameters are faster?

A function can get data from input parameters, from closure environment and from memory. Input parameters are normally stored in registers. Therefore, using input parameters will generally be slightly faster than the closure environment and memory. This can come in handy in recursions.

Take for example this function which finds the index of the given integer in the array:

let index (a : int array) e = let len = Array.length a in let rec loop i = if i < len then begin if Array.unsafe_get a i = e then Some i else loop (i + 1) end else None in loop 0

The variables e and len are pulled from closure environment in each iteration. Moving them to input parameters speeds up the function 1.15-1.33 times:

let index (a : int array) e = let rec loop e len i = if i < len then begin if e = Array.unsafe_get a i then Some i else loop e len (i + 1) end else None in loop e (Array.length a) 0

Further reading

Processor Register

Register Allocation