A few years back I found myself reverse-engineering a vendor lock-in handshake in a proprietary application. It was clear to me: I needed a code tracer.

In my mind the ultimate code tracer should be able to trace code with minimal slowdown, it should provide high granularity and not require me to trace the entire process, and it should be able to cope with anti-debug trickery. The conventional approaches relying on software and hardware breakpoints/traps have some issues. Software breakpoints modify the original code in memory and may be detected by applications that periodically checksum their own code. Hardware breakpoints may be detected or sabotaged quite easily. More importantly though, each trap will require a context switch to the kernel, then to the debugger process which takes an action, back to the kernel, then back to the target process again. That’s a lot of round-trips, and even for branch-level tracing it’s just way too much overhead. Granularity-wise it should be really easy to start tracing when the application calls an API of interest, for example recv(), and stop tracing on the next call. Now, how do we achieve all of this? The answer is dynamic recompilation. Enter the Stalker.

The idea is quite simple. As a thread is about to execute its next instructions, you make a copy of those instructions and interlace the copy with the logging code that you need. You thus leave the original instructions untouched in order to keep anti-debug checksum logic happy, and execute the copy instead. For instance, before each instruction you would put a CALL to the log handler and give it details about the instruction that’s about to get executed.

Here’s a simplified illustration:

The beauty of dynamic recompilation in this context is performance. Want to trace every single instruction? Add logging code between every single instruction that you copy. Only interested in CALL instructions? Add logging code next to those instructions only. This means you add runtime overhead based on what you actually need.

There are some challenges though. First off, instructions on x86 are variable length, so you need a really good disassembler with support for the latest instructions. Capstone to the rescue. Next, you can’t just copy instructions elsewhere in memory and expect them to work like before. Lots of instructions are position-dependent and will have to be adjusted for their new location in memory. Also, the code being traced might be hostile, so only copy instructions up until the first branch instruction (one basic block at a time). Then just replace that instruction with a control transfer back into the engine, telling it to generate code for the branch target, and resume execution there.

This obviously has a cost, as it means we have to save and restore lots of state for each such context-switch. What we can easily optimize though, is to re-use the code we already recompiled before. Also, when a block ends with a static branch, like jmp +56, we can backpatch that with a direct branch to the recompiled version of the target. However, we have to be really careful, as the application might be modifying its own code at runtime. Because of this we introduce a trust-threshold parameter that can be adjusted based on the application. The idea is that we keep checking whether a block changed every time we’re about to execute it, and once it’s been successfully recycled trust-threshold times we consider it trusted, and backpatch static branches with a branch straight to the recompiled target. Zero overhead.

There is still room for further optimization, though. As only static branches are currently optimized, those involving a register or memory location require a context-switch back into the engine. A future improvement here would be to implement inline caching, where there’s some padding NOPs in front of the branch into the engine, and those would get dynamically overwritten with something along the lines of:

CMP branch_target, 1234

JZ precompiled_block_a

CMP branch_target, 5678

JZ precompiled_block_b

<NOP-padding for future cache entries>

JMP enter_engine_for_dynamic_lookup

The block recycling and static branch optimization go a long way performance-wise though. I put together a synthetic benchmark and measured roughly 14000x slowdown before these optimizations, and with them it was improved to ~30x.