Object-oriented programs present considerable challenges to reverse engineers. For example, C++ classes are high-level structures that lead to complex arrangements of assembly instructions when compiled. These complexities are exacerbated for malware analysts because malware rarely has source code available; thus, analysts must grapple with sophisticated data structures exclusively at the machine code level. As more and more object-oriented malware is written in C++, analysts are increasingly faced with the challenges of reverse engineering C++ data structures. This blog post is the first in a series that discusses tools developed by the Software Engineering Institute's CERT Division to support reverse engineering and malware analysis tasks on object-oriented C++ programs.

Identifying C++ classes in assembly code requires understanding low-level implementation details, such as specialized calling conventions that are not directly related to program functionality. To help analyze object-oriented software (including malware) we have been developing a suite of binary static program analysis tools. Our framework, Pharos, is built on top of Lawrence Livermore National Laboratory's (LLNL) ROSE compiler infrastructure. The Pharos tool suite includes many extensions to the binary analysis features of ROSE that we've jointly developed with LLNL. The Pharos tools use static analysis techniques, such as control flow analysis and dataflow analysis, to reason about the behavior of, and data structures in binary files.

One of our Pharos tools, ObjDigger, supports analysis and recovery of object-oriented data structures from 32-bit Microsoft Windows binary files. Specifically, ObjDigger can help recover C++ classes/structures including class members, methods, and virtual functions. ObjDigger can also help reason about class relationships, such as composition and inheritance. This posting describes the ObjDigger tool, how it works, and how to use it to help reverse engineer object-oriented C++ code.

For example, consider the simple C++ program below in Code Listing 1 that instantiates an object and then calls a virtual function. This relatively straightforward code results in many implicit operations when compiled.



Code Listing 1: Sample C++ Program

The program in Code Listing 1 was compiled using a 32-bit version of Microsoft Visual C++ 2010 with no command-line options except for the /FAsc option to generate x86 assembly code. The disassembly of the main function is shown below in Code Listing 2. First, an AddOp object is instantiated and assigned to a MathOp pointer, although the types are not immediately clear from the listing. Instantiating the AddOp object requires multiple steps that are not visible in source code. First, object memory is allocated from addresses 0x00401006 through 0x00401017. Second, the AddOp object is initialized from addresses 0x00401019 through the call to the constructor at address 0x00401020.



Code Listing 2: Section of main function that includes object instantiation



Note that the allocated memory for the AddOp object is passed to the AddOp constructor via the ECX register. This code is consistent with the __thiscall calling convention that is used to implement class mechanics, such as method invocation and member access in assembly code. Passing function arguments through registers, however, can be hard to detect. The body of the AddOp constructor, which is shown below in Code Listing 3, demonstrates additional complexities.



Code Listing 3: AddOp Constructor Assembly Listing



At the source-code level, the AddOp constructor simply initializes two class members (x and y). At the assembly-code level, however, more substantial setup is necessary to account for implicit operations needed by C++ objects. First, because AddOp extends MathOp, a MathOp object is automatically constructed at address 0x0040105A. Moreover, because both MathOp and AddOp contain virtual functions, a virtual function pointer is installed at address 0x00401062. Virtual function pointers are the mechanism that Microsoft Visual C++ uses to invoke the correct function in a polymorphic class arrangement. The virtual function pointer points to a table of pointers to actual virtual function implementations. Consider Code Listing 4 below, which shows the invocation of the virtual function Execute.

Code Listing 4: Virtual Function Call Invocation



Invoking the Execute function requires dereferencing the virtual function pointer installed during construction and then fetching the appropriate virtual function implementation. To reason about program control flow correctly the virtual function invocations must be resolved. Determining the target of virtual functions is hard and often forces reverse engineers to completely recover and reason about a program's class structures. In large C++ programs that contain many interrelated classes, this analysis can be a tedious and time-consuming task.

ObjDigger automates recovery of C++ data structures. It identifies potential class structures and methods and resolves virtual function calls where it can. Compiling the program in Code listing 1 and running the executable through ObjDigger with the --report option produces the output shown in Table 1.

Table 1: ObjDigger Output



Note that the names used in the source code are removed during compilation, but the tool is able to identify class data structures. For example, ObjDigger identifies AddOp as a data structure with the following constructor:

A virtual function pointer and virtual function table

and two members:

ObjDigger is able to reason about the relationship between AddOp and MathOp by identifying that the MathOp constructor (address 0x004010B0) is called from within the AddOp constructor (address 0x00401050). Calling the MathOp constructor from within the AddOp constructor indicates a relationship between MathOp and AddOp. ObjDigger identifies the relationship as inheritance (hence the parent class designation). In this case, ObjDigger correctly identifies the MathOp as the parent of AddOp because the MathOp constructor is called before the AddOp virtual function pointer is installed at address 0x00401062. This heuristic is one way to distinguish class inheritance from class composition.

Using ObjDigger Output

ObjDigger includes options to generate Javascript Object Notation(JSON) specifications for recovered object schemas that are suitable for automated processing with reverse engineering tools. To better support reverse engineers we've included an IDA Pro plugin named PyObjdigger that applies ObjDigger results to an IDA database. Table 2 shows IDA Pro screenshots of the constructor for AddOp before and after adding ObjDigger-generated class information. Again, class names are typically not preserved during compilation, so PyObjdigger uses generic names, such as Cls0, Cls1, etc. In the future we plan to parse run-time type information (RTTI) where possible to leverage more realistic class names.





Table 2: AddOp Constructor Disassembly before and after Running the PyObjdigger Plugin

Before



After



One of the more useful PyObjdigger features is its ability to annotate virtual function calls with clickable labels. For example, Table 3 shows an IDA Pro screenshot containing disassembly for the virtual function call add->Execute(). In the disassembly listing the clickable comment (4010e0) that was inserted by PyObjdigger is the target for this virtual function call.

Table 3: Annotated Virtual Function Call

ObjDigger Under the Hood

ObjDigger uses definition-use analysis to identify object pointers, known as this pointers. The analysis process works as follows:

First ObjDigger uses ROSE to gather a list of functions in the executable file. ObjDigger analyzes each function to determine if it is a class method based on whether it follows the __thiscall calling convention. In __thiscall functions the this pointer for an object is passed in as an argument in the ECX register. ObjDigger detects the this pointer passed in to the function by identifying reads of the ECX register without initialization. Once the set of __thiscall functions is identified, further analysis of this pointer usage in the body of each function is performed to identify possible class members and methods.

The Pharos binary analysis infrastructure provides information on which program instructions influence (i.e., read and write) computations on subsequent instructions. This abstract interpretation of instructions makes it possible to track values through an assembly listing. Reasoning about abstract values as they are accessed through a program enables identification of object-oriented constructs. For example, a call to a virtual function requires two pointer dereferences:

one to access the virtual function table

one to access the appropriate virtual function

In Code Listing 4 dereferencing the virtual function table pointer and fetching the correct virtual function corresponds to the pointer accesses at addresses 0x0040103A and 0x0040103F, respectively. For each indirect call found in the binary (i.e., a call on a register or memory address), ObjDigger searches for two previous dereferences connected through common pointers (i.e., a pointer that refers to another pointer that refers to a known class virtual function). That is, if the call instruction is preceded by two pointer dereferences and these pointers trace back to a known class structure with a virtual function table that contains a valid class method, then this arrangement is labeled as a virtual function call and bound to the class structure. The target of the call is determined by examining the virtual function table for the known class structure.

ObjDigger uses similar data flow analysis to identify class members and methods and class relationships. A more thorough, if slightly dated, discussion of the ObjDigger's data structure recovery algorithms is available in our paper titled Recovering C++ Objects From Binaries Using Inter-Procedural Data-Flow Analysis that was published at the ACM SIGPLAN on Program Protection and Reverse Engineering Workshop in 2014.

This post shows how binary static analysis tools, such as ObjDigger can help reverse engineers and malware analysts by automatically reasoning about C++ data structures at the binary level. Processing an object-oriented executable file with ObjDigger helps the analyst to quickly identify and understand the data structures in an executable file. Automatically recovering information about C++ data structures enables analysts to focus on reasoning about program functionality and spend less time analyzing low-level constructs that are inserted during compilation and have little bearing on program behavior. Automatic analysis of executables has more applications than dealing with object oriented code. In subsequent posts in this series we will discuss some of the other tools in the Pharos suite that automatically identify and reason about program behaviors.

We welcome your feedback on our work in the comments section below.

Additional Resources

We recently released ObjDigger publicly, and those who are interested in evaluating ObjDigger can download it from the Pharos Static Analysis Tools site.

We have also created a GitHub repository for Pharos, and plan to release selected components of our framework for inclusion back into the ROSE infrastructure.