Some notes before the blood starts flowin’

CAUTION: What you are about to read is dangerous, non-portable, and (in most cases) stupid.

What you are about to read is dangerous, non-portable, and (in most cases) stupid. The code and article below refer only to the x86_64 architecture.

architecture. Grab some gauze. This is going to get ugly.

TLDR

If you enjoy this article, subscribe (via RSS or e-mail) and follow me on twitter.

This article shows off a Ruby gem which has the power to overwrite a Ruby binary in memory while it is running to allow your code to execute in place of internal VM functions. This is useful if you’d like to hook all object allocation functions to build a memory profiler.

This gem is on GitHub

Yes, it’s on GitHub: http://github.com/ice799/memprof.

I want a memory profiler for Ruby

This whole science experiment started during RubyConf when Aman and I began brainstorming ways to build a memory profiling tool for Ruby.

The big problem in our minds was that for most tools we’d have to include patches to the Ruby VM. That process is long and somewhat difficult, so I started thinking about ways to do this without modifying the Ruby source code itself.

The memory profiler is NOT DONE just yet. I thought that the hack I wrote to let us build something without modifying Ruby source code was interesting enough that it warranted a blog post. So let’s get rolling.

What is a trampoline?

Let’s pretend you have 2 functions: functionA() and functionB() . Let’s assume that functionA() calls functionB() .

Now also imagine that you’d like to insert a piece of code to execute in between the call to functionB() . You can imagine inserting a piece of code that diverts execution elsewhere, creating a flow: functionA() –> functionC() –> functionB()

You can accomplish this by inserting a trampoline.

A trampoline is a piece of code that program execution jumps into and then bounces out of and on to somewhere else.

This hack relies on the use of multiple trampolines. We’ll see why shortly.

Two different kinds of trampolines

There are two different kinds of trampolines that I considered while writing this hack, let’s take a closer look at both.

Caller-side trampoline

A caller-side trampoline works by overwriting the opcodes in the .text segment of the program in the calling function causing it to call a different function at runtime.

The big pros of this method are:

You aren’t overwriting any code, only the address operand of a callq instruction.

instruction. Since you are only changing an operand, you can hook any function. You don’t need to build custom trampolines for each function.

This method also has some big cons too:

You’ll need to scan the entire binary in memory and find and overwrite all address operands of callq . This is problematic because if you overwrite any false-positives you might break your application.

and find and all address operands of . This is problematic because if you overwrite any false-positives you might break your application. You have to deal with the implications of callq , which can be painful as we’ll see soon.

Callee-side trampoline

A callee-side trampoline works by overwriting the opcodes in the .text segment of the program in the called function, causing it to call another function immediately

The big pro of this method is:

You only need to overwrite code in one place and don’t need to worry about accidentally scribbling on bytes that you didn’t mean to.

this method has some big cons too:

You’ll need to carefully construct your trampoline code to only overwrite as little of the function as possible (or some how restore opcodes), especially if you expect the original function to work as expected later.

You’ll need to special case each trampoline you build for different optimization levels of the binary you are hooking into.

I went with a caller-side trampoline because I wanted to ensure that I can hook any function and not have to worry about different Ruby binaries causing problems when they are compiled with different optimization levels.

The stage 1 trampoline

To insert my trampolines I needed to insert some binary into the process and then overwrite callq instructions like this:

41150b: e8 cc 4e 02 00 callq 4363dc [rb_newobj] 411510: 48 89 45 f8 ....

In the above code snippet, the byte e8 is the callq opcode and the bytes cc 4e 02 00 are the distance to rb_newobj from the address of the next instruction, 0×411510

All I need to do is change the 4 bytes following e8 to equal the displacement between the next instruction, 0×411510 in this case, and my trampoline.

Problem.

My first cut at this code lead me to an important realization: the callq instructions used expect a 32bit displacement from the function I am calling and not absolute addresses. But, the 64bit address space is very large. The displacement between the code for the Ruby binary that lives in the .text segment is so far away from my Ruby gem that the displacement cannot be represented with only 32bits.

So what now?

Well, luckily mmap has a flag MAP_32BIT which maps a page in the first 2GB of the address space. If I map some code there, it should be well within the range of values whose displacement I can represent in 32bits.

So, why not map a second trampoline to that page which can contains code that can call an absolute address?

My stage 1 trampoline code looks something like this:

/* the struct below is just a sequence of bytes which represent the * following bit of assembly code, including 3 nops for padding: * * mov $address, %rbx * callq *%rbx * ret * nop * nop * nop */ struct tramp_tbl_entry ent = { .mov = {'\x48','\xbb'}, .addr = (long long)&error_tramp, .callq = {'\xff','\xd3'}, .ret = '\xc3', .pad = {'\x90','\x90','\x90'}, }; tramp_table = mmap(NULL, 4096, PROT_WRITE|PROT_READ|PROT_EXEC, MAP_32BIT|MAP_ANONYMOUS|MAP_PRIVATE, -1, 0); if (tramp_table != MAP_FAILED) { for (; i < 4096/sizeof(struct tramp_tbl_entry); i ++ ) { memcpy(tramp_table + i, &ent, sizeof(struct tramp_tbl_entry)); } } }

It mmap s a single page and writes a table of default trampolines (like a jump table) that all call an error trampoline by default. When a new trampoline is inserted, I just go to that entry in the table and insert the address that should be called.

To get around the displacement challenge described above, the addresses I insert into the stage 1 trampoline table are addresses for stage 2 trampolines.

The stage 2 trampoline

Setting up the stage 2 trampolines are pretty simple once the stage 1 trampoline table has been written to memory. All that needs to be done is update the address field in a free stage 1 trampoline to be the address of my stage 2 trampoline. These trampolines are written in C and live in my Ruby gem.

static void insert_tramp(char *trampee, void *tramp) { void *trampee_addr = find_symbol(trampee); int entry = tramp_size; tramp_table[tramp_size].addr = (long long)tramp; tramp_size++; update_image(entry, trampee_addr); }

An example of a stage 2 trampoline for rb_newobj might be:

static VALUE newobj_tramp() { /* print the ruby source and line number where the allocation is occuring */ printf("source = %s, line = %d

", ruby_sourcefile, ruby_sourceline); /* call newobj like normal so the ruby app can continue */ return rb_newobj(); }

Programatically rewriting the Ruby binary in memory

Overwriting the Ruby binary to cause my stage 1 trampolines to get hit is pretty simple, too. I can just scan the .text segment of the binary looking for bytes which look like callq instructions. Then, I can sanity check by reading the next 4 bytes which should be the displacement to the original function. Doing that sanity check should prevent false positives.

static void update_image(int entry, void *trampee_addr) { char *byte = text_segment; size_t count = 0; int fn_addr = 0; void *aligned_addr = NULL; /* check each byte in the .text segment */ for(; count < text_segment_len; count++) { /* if it looks like a callq instruction... */ if (*byte == '\xe8') { /* the next 4 bytes SHOULD BE the original displacement */ fn_addr = *(int *)(byte+1); /* do a sanity check to make sure the next few bytes are an accurate displacement. * this helps to eliminate false positives. */ if (trampee_addr - (void *)(byte+5) == fn_addr) { aligned_addr = (void*)(((long)byte+1)&~(0xffff)); /* mark the page in the .text segment as writable so it can be modified */ mprotect(aligned_addr, (void *)byte+1 - aligned_addr + 10, PROT_READ|PROT_WRITE|PROT_EXEC); /* calculate the new displacement and write it */ *(int *)(byte+1) = (uint32_t)((void *)(tramp_table + entry) - (void *)(byte + 5)); /* disallow writing to this page of the .text segment again */ mprotect(aligned_addr, (((void *)byte+1) - aligned_addr) + 10, PROT_READ|PROT_EXEC); } } byte++; } }

Sample output

After requiring my ruby gem and running a test script which creates lots of objects, I see this output:

... source = test.rb, line = 8 source = test.rb, line = 8 source = test.rb, line = 8 source = test.rb, line = 8 source = test.rb, line = 8 source = test.rb, line = 8 source = test.rb, line = 8 ...

Showing the file name and line number for each object getting allocated. That should be a strong enough primitive to build a Ruby memory profiler without requiring end users to build a custom version of Ruby. It should also be possible to re-implement bleak_house by using this gem (and maybe another trick or two).

Awesome.

Conclusion

One step closer to building a memory profiler without requiring end users to find and use patches floating around the internet.

It is unclear whether cheap tricks like this are useful or harmful, but they are fun to write.

to write. If you understand how your system works at an intimate level, nearly anything is possible. The work required to make it happen might be difficult though.

Thanks for reading and don't forget to subscribe (via RSS or e-mail) and follow me on twitter.

References