Before deploying binaries to third party environments, it is very common to strip the binaries from any information that is not required for them to function properly. This is done to make reverse engineering of the binary more difficult. Some of the information that is erased from the binary is the boundaries of each function. For someone who wants to reverse engineer a binary this information can be extremely useful.

Function identification is a task in the reverse engineering field where given a compiled binary, one should determine the addresses of the boundaries of each function. Boundries of a function are the start address and the end address of the function.

Why Neural Networks?

There are no simple rules for recognizing the boundaries, especially when it comes to binaries which have been optimized during compilation.

A huge amount of data — it is very easy to find on the internet code to compile or binaries already compiled with debug information to create our dataset.

Almost no domain knowledge is required! One of the big advantages of neural networks (especially deep ones) is that they are capable of processing raw data well and no feature extraction is required.

The idea of using neural networks for function identification is not new. It was first introduced in a paper called Recognizing Functions in Binaries with Neural Networks written by Eui et al. The authors used a bidirectional RNN for learning the function boundaries. According to their paper, they not only achieved similar or better results relative to the former state of the art but also reduced the computation time from 587 hours to 80 hours. I think this research really demonstrates the power of neural networks.

So Why CNN? (CNN vs RNN)

CNN (Convolutional Neural Network) is highly popular in tasks regarding computer vision. One of the reasons is that CNN can only capture local features.

Local features describe the input patches (key points in the input). For example, for an image, it can be any feature regarding a specific area in the image such as a point or an edge. Global features are features that describe the input as a whole.

RNN, on the other hand, is a “stronger” model in the sense that it can learn both local and global features.

But stronger is not always better. Using a model capable of learning both local and global features for a task that requires learning only local features, might lead to overfitting and increase the training time.

For function identification it is enough, for each byte in the binary, to look at the 10 bytes before it and 10 bytes after it to determine if it is a start or a stop of a function. This property makes it seems a CNN is supposed to achieve good results for this task.

With that being said, there are global features that can help to determine the boundaries of the function. For example, call opcodes can help us determine the start of a function. However, even RNN will have a hard time learning those features as RNN does not perform well on long sequences (that is why in the Eui et al. paper they train their network with a random 1000 bytes sequence from the binary and not the whole binary).

In addition, unlike RNN which is a sequential model, CNN can be run in parallel which means both training and testing the network are supposed to be faster.

We are done with the introduction. Let’s code!

Code

The code is implemented in Python3.6 using PyTorch library.

For simplicity, we are going to implement a model that identifies the beginning of each function, but the same code can be applied to identify the ending as well.

The full code is available here:

The Data

We are going to use the same dataset Eui et al. used in their paper.

The dataset was originally created for a paper called ByteWeight: Learning to Recognize Functions in Binary Code. Eui et al. used the same dataset to compare their results to the results reported in the original paper.

The dataset is available at http://security.ece.cmu.edu/byteweight

The dataset consists of a set of binaries compiled with debug information.

We are going to use the elf_32 dataset but the same code can be applied also for the elf_64 dataset (and the PE_dataset, but that will require different debug info parsing procedure).

The dataset can be downloaded by running:

wget — recursive — no-parent — reject html,signature http://security.ece.cmu.edu/byteweight/elf_32

Preprocessing the Data

First, we need to extract from each binary its code section and its function addresses.

Elf files are composed of sections. Each section contains different information. The sections we are interested in are the .text section and the .symtab section.

.text contains the opcodes that are executed when running the binary.

contains the opcodes that are executed when running the binary. .symtab contains information about the functions in the binaries (and more).

Note that the information in the .symtab section can be stripped from the binary. This project is useful for those cases.

For parsing the sections in the binaries we are going to use Pyelftools library.

First, let’s extract the .text section data

For each byte in the code of the binary, we need to extract whether it is a start of a function.

Now let’s iterate over the binaries in our dataset.

We will use tqdm library to make a nice progress bar for our preprocessing with zero effort!

Great! We have our data and its tags.

To feed the data into the model we should not just feed it file by file. Instead, we should determine the size of the data we would like to train the model with each time and split our data into blocks with that size.

Also, if we want a CNN to output a vector with the size of tags , we need to pad the input according to the CNN kernel size.

Let’s wrap it up under a torch.utils.data.Dataset class:

Building the Model

The input of the model is going to be a vector where each value is between 0 to 257 (0–255 for the byte value, 256 is a symbol for a start of a file and 257 is a symbol for an end of a file).

The output of the model is going to a matrix where each row contains two values — the probability of a byte to be a start of a function and the probability to not be a start of a function (the values are summed to 1).

Since every byte value represents a different symbol we would like to convert every value to a vector. The way to do so is to use an embedding layer.

A guide for embedding:

After the embedding layer, we are ready to add the convolution layer with Relu activation function.

Notice we want the convolution to work on whole bytes so the kernel size should be: the number of bytes we want to look at X the size of each byte (the output dimension of the embedding layer).

Now we add a fully connected output layer with softmax activation function.

The whole architecture: