Reverse-Engineering Database - An IDA-Pro Plug-in

By Yaron Kaner, Barak Sternberg and Zion Nahici

Intro - Incentive

Reverse engineering (RE) is the process of discovering the technological principles of a device, object, or system through analysis of its structure, function, and operation. Software RE involves taking a software program's machine code (the string of 0's and 1's that are sent to the logic processor) apart and analyzing its workings in detail to study how the program performs certain operations. This is done in order to improve the performance of a program, to fix a bug, to identify malicious content or to adapt a program written for use with one microprocessor for use with another.

A person practicing software RE may require several tools in order to disassemble a program. One tool is a hexadecimal dumper, which displays the binary code constituting a program in hexadecimal format (making it easier to read). Another common tool is the disassembler. A disassembler is a computer program that translates machine language into assembly language. Disassembly, the output of the disassembler, is often formatted for human-readability rather than suitability for input to an assembler.

The Interactive Disassembler, more commonly known as IDA, is a disassembler for computer software which generates assembly language source code from machine-executable code. It supports a variety of executable formats for different processors and operating systems. It also can be used as a debugger for certain executable file types. A decompiler plug-in for programs compiled with a C/C++ compiler is also available.

RE is a tedious process that takes a lot of time and dedication, as well as high familiarity with assembly language, compilers, and operating systems. During the past few years, many tools were made to make the reverse engineering process a faster one, and to lower the bar on the knowledge required to begin practicing it.

One of the problems that remain is identifying functions from the disassembled code according to their functionality - this still done manually today. Since code is often recycled, reused and rewritten, a lot of the RE effort is redundant. Many people have RE-ed the same pieces of code over the past years, without even realizing the potential of sharing their work with others. This project aims to solve this problem by allowing users to share their RE findings.

About the Plug-in

Our solution consists of two main parts:

1. A client-side Python plug-in for IDA.

2. A complementary Django server (hosting a database).

The two combined allow sharing of findings between those who practice RE.

When installed, the plug-in offers two main functionalities:

1. Submitting the user's description (function's name and comments).

2. Requesting public descriptions (submitted by other reverse engineers).

The client was written in Python and uses IDAPython, "an IDA Pro plugin that integrates the Python programming language, allowing scripts to run in IDA Pro".

The server uses the Django Web framework and is also written in Python.

A difficulty we tried to tackle is that the same source code will compile differently into byte code, under different circumstances . The resulting compiled byte code depends on many factors , such as the compiler and its many configurations and the context in which the source code is found. The smallest change in the source code or in the compiler and its configurations will result in a different byte code, and therefore a completely different checksum. Therefore, a simple checksum of a function's byte code will not suffice for our needs.

Before executing both the Submit and Request functionalities, the plug-in will collect several attributes which characterize the function the user is working on. When submitting, these attributes are sent alongside the user's description, and when requesting descriptions, these attributes are sent to the server so it can look for similar functions.

Upon receiving a request the server compares the user's function with the functions stored in its Database. For each function in the DB it calculates a Matching Grade. This grade is simply a weighted average of grades, given to each pair of attribute instances (one attribute extracted from the user's function and one from the function taken from the DB). Descriptions for functions with the highest similarity grade will be returned to the user.

A pair of attribute instances is compared using what have defined in this project's context as heuristics.

Heuristics and Attributes

Heuristics

1. List Similarity – Given two lists of objects, gives a grade to the two lists' similarity. Uses Python's SequenceMatcher.

2. Dictionary Similarity – Given two dictionaries of (object, number of occurrences) pairs, gives a grade to the two dictionaries' similarity. Uses to following algorithm:

Given two dictionaries A,B, such that: A={(a 1 ,x 1 ),(a 2 ,x 2 ),…,(a n ,x n )} B={(b 1 ,y 1 ),(b 2 ,y 2 ),…,(b k ,y k )} Where a 1 ,…,a n and b 1 ,…,b k are keys and x 1 ,…,x n and y 1 ,…,y k are values. Define: m:=|Union[{a 1 ,…,a n },{b 1 ,…,b k }]| {c 1 ,…,c m }:= Union[{a 1 ,…,a n },{b 1 ,…,b k }] Compute: d i :=min[A[c i ],B[c i ]] / max[A[c i ],B[c i ]] f i =min[A[c i ],B[c i ]] + max[A[c i ],B[c i ]] D:= d 1 f 1 + d 2 f 2 +…+d m f m F:=f 1 +f 2 +…+f m Return D/F

3. Graph Similarity – Given two graphs, gives a grade to the two graphs' similarity. See the Graph Similarity document.

4. Integer Equality – Given two integers, simply checks if they are equal.

5. String Equality – Given two strings, simply checks if they are equal.

Attributes

General Attributes Executable's Name - the executable's name. Executable's MD5 - the executable's MD5 signature. First Address - the function's first instruction's address in the executable. Number of Instructions - the number of assembly instructions in the function. Function's MD5 - the function's MD5 signature. Instruction-related Attributes Instruction Data List – an ordered list of integers representing the instructions themselves. Instruction Type List – an ordered list of instruction types. Instruction Type Dictionary - a dictionary of (instruction type, count) pairs. String-related Attributes String List – an ordered list of the strings which appear in the function. String Dictionary - a dictionary of (string, count) pairs. Library Calls-related Attributes Library Calls List – an ordered list containing the library function names for library calls which occur in a function. Library Calls Dictionary - a dictionary of (library function name, count) pairs. Immediates-related Attributes Immediates List – an ordered list of immediate values. Immediates Dictionary - a dictionary of (immediate value, count) pairs. Control-Flow-related Attributes Graph Representation - a representation of the function's control-flow.

Lists are compared using the List Similarity heuristic, dictionaries using the Dictionary Similarity heuristic, graph representations using the Graph Similarity heuristic, integers using the Integer Equality heuristic and strings using the String Equality heuristic.

Matching Grade

The server calculates the similarity grade for a pair of functions using several attributes. For each attribute, two of its instances are given as input to a heuristic, which in return outputs a grade. The final similarity grade is a weighted-average of the grades returned by the heuristics.

One of the questions that arose during the development is: what is the importance of each attribute when comparing functions – what weight should be given to each attribute. For further information see the Testing the Project document.

Testing the project

After the initial development we wanted to quantitate the project's success. That is, check how many times when the server is asked for descriptions, it actually returns relevant ones, and how many of the relevant descriptions it does not return, even though they exist in its DB. For further information see the Testing the Project document.

Client

Server

Lab

Contact us at redb.project.tau.secws12@gmail.com