Code obfuscation by embedding a virtual machine with a secret instruction set is one of the strongest types of obfuscation. It requires a great deal of tedious work by an analyst to recover the architecture and instruction set of the virtual machine by reverse engineering the VM interpreter code. However, some information can be obtained by analyzing the binary code of the program for the VM itself before the analysis of the virtual machine interpreter. We propose a number of heuristics that could achieve this task. The heuristics are based on reasonable assumptions regarding typical structure of binary code, produced by a compiler. The obtained information may facilitate further analysis of the virtual machine itself.

Some of these heuristics are implemented in a heuristics binary analyzer module of SmartDec decompiler (http://decompilation.info) being developed by the authors. In this presentation we are going to perform partial reconstruction of the instruction set: · Initial markup of the binary program. Identification of data sections and code sections. Prior information about the purpose of the program or even some documentation hints may be used. At this step the entry point to the code (or several entry points) are identified. · Reconstruction of subroutine structure by identification of the subroutine borders. The subroutine return (RET) instruction is identified. It is naturally to expect that the last instruction in the code segment would be the return instruction. RET instruction normally separates subroutines, so we may expect, that CALL instruction should pass the control right after RET instruction in many (or even most) cases. · Identification of the unconditional jump (JMP) instruction using the assumption, that code execution starts at some fixed address. · Identification of call instruction. Call instructions are similar to unconditional jumps. By investigation of initialization code several candidates for the CALL instruction can be identified, and the one candidate remained after validation on the whole code. · Recovering of absolute and relative jumps and call by looking at the bit encoding of instruction and checking whether it could be an offset relative to the next instruction word. This way relative jumps, calls and candidates for conditional jumps were identified. · Identification of memory load and store instruction by observing load-store patterns for memory-memory copy operations. · Observations on the virtual machine register structure and register width. How many registers this VM probably has and how wide are they. · Observations on the arithmetics and logics operation by pairing with the identified conditional jumps. Then we will show some examples of binary code than can be deobfuscated using the presented method and will discuss possibilities for automatization. In the end of the talk the related features of SmartDec decompiler such as partial decompilation of partial reconstructed assembly program will be demonstrated.