In recent years, there has been a considerable amount of research activity to develop analysis tools to find bugs and security vulnerabilities. However, most of the effort has been on analysis of source code, and the issue of analyzing executables has largely been ignored. In the security context, this is particularly unfortunate, because performing analysis on the source code can fail to detect certain vulnerabilities due to the WYSINWYX phenomenon: ``What You See Is Not What You eXecute''. That is, there can be a mismatch between what a programmer intends and what is actually executed on the processor.

Even though the advantages of analyzing executables are appreciated and well-understood, there is a dearth of tools that work on executables directly. The overall goal of our work is to develop algorithms for analyzing executables, and to explore their applications in the context of program understanding and automated bug hunting. Unlike existing tools, we want to provide useful information about memory accesses, even in the absence of debugging information. Specifically, the dissertation focuses on the following aspects of the problem:

Developing algorithms to extract intermediate representations (IR) from executables that are similar to the IR that would be obtained if we had started from source code. The recovered IR should be similar to that built by a compiler, consisting of the following elements: (1) control-flow graphs (with indirect jumps resolved), (2) a call graph (with indirect calls resolved), (3) the set of variables, (4) values of pointers, (5) sets of used, killed, and possibly-killed variables for control-flow graph nodes, (6) data dependences, and (7) types of variables: base types, pointer types, structs, and classes.

Using the recovered IR to develop tools for program understanding and for finding bugs and security vulnerabilities.

Because executables do not have a notion of variables similar to the variables in programs for which source code is available, one of the important aspects of IR recovery is to determine a collection of variable-like entities for the executable. The quality of the recovered variables affects the precision of an analysis that gathers information about memory accesses in an executable, and therefore, it is desirable to recover a set of variables that closely approximate the variables of the original source-code program. On average, our technique is successful in identifying correctly over 88% of the local variables and over 89% of the fields of heap-allocated objects. In contrast, previous techniques, such as the one used in the IDAPro disassembler, recovered 83% of the local variables, but 0% of the fields of heap-allocated objects.

Recovering useful information about heap-allocated storage is another challenging aspect of IR recovery. We propose an abstraction of heap-allocated storage called recency-abstraction, which is somewhere in the middle between the extremes of one summary node per malloc site and complex shape abstractions. We used the recency-abstraction to resolve virtual-function calls in executables obtained by compiling C++ programs. The recency-abstraction enabled our tool to discover the address of the virtual-function table to which the virtual-function field of a C++ object is initialized in a substantial number of cases. Using this information, we were able to resolve, on average, 60% of the virtual-function call sites in executables that were obtained by compiling C++ programs.

To assess the usefulness of the recovered IR in the context of bug hunting, we used CodeSurfer/x86 to analyze device-driver executables without the benefit of either source code or symbol-table/debugging information. We were able to find known bugs (that had been discovered by source-code analysis tools), along with useful error traces, while having a low false-positive rate.

(Click here to access the dissertation: PDF.)