As briefly mentioned here-above, our method aims at a vectorization of a graph via specific function, a kernel. The proposed kernel was originally designed for labeled graphs. A labeled graph is a graph which nodes are simple identifiers (strings or integers for instance), not necessarily unique. Since we choose to represent each function through its CFG, the graphs we would like to analyze are made of basic-blocks. Thus, we first need to give a label to each basic block before computing its kernel representation. Trivially, we could use a simple cryptographic hash function to give the same label to the same basic-blocks. Such a labelization scheme would not only lead to an exponential increase of different labels, but also distinguish very similar basic-blocks. For instance, consider two basic-blocks where the instructions have been equivalently reordered, or where a same instruction uses different registers. Using cryptographic hash function, those basic-blocks would be considered totally different whereas we might want them to be considered alike. In order to permit few changes in a basic block to be given the same label, we will use another labelization scheme, known as Locality-Sensitive Hashing (LSH) .

Following these insights, we represent a basic-block as a set of instructions independently from its execution context or the object it may refer to (addresses, strings, etc.). Using such representation, each instruction (mnemonic + potential operands) can be seen as an element of a finite (though large) alphabet (or dictionary). Thus, each basic-block can be considered as a bag-of-words (the number of appearance of each word in a text) each of which have been drawn from this alphabet (see Figure 2). Note that the alphabet itself might be arbitrarily defined such that, for instance, look-alike instructions refer to the same words, or known sequence of instructions patterns are merged together into a single word. We now have a proper representation of a basic-block that limits the influence of most alteration factors and permits a large scale comparison with others. Note that this representation can be viewed as a large vector recording the number of all possible words in the alphabet. Let us now introduce our proposed labelization framework.

Figure 2: Left: In its original form, the basic-block contains many alteration factors that might heavily bias the retrieval of its nearest neighbors. Center: To limit the impact of these factors, we first proceed to a simplification by gathering the mnemonic by types as well as unifying any register names, address references or hard-coded string or numbers. Right: Once simplified, the basic-block is represented as a bag-of-words in order to ignore the instructions order. Note that the proposed simplification does not consider optimization schemes or known patterns. For instance, the penultimate instruction ``xor esi esi`` is semantically equivalent to ``mov esi 0`` and could have been reformulated as such.

It is well known that the compilation of a same source-code may output different machine-codes that would though have the same overall behavior. Such resulting binaries are said to be semantically equivalent while syntactically different. These variations may be due to different targeted OS or architectures, different compiler versions, different compiling optimizations, etc. Usually, we consider four main possible alteration factors : addresses attribution, register usage, instructions reordering and optimization scheme (such as peephole optimization, constant folding, loop optimization, function inlining, etc.). Thus, any measure of similarity between two basic-blocks should be as much independent as possible from these alteration factors. In our opinion, a satisfying solution to limit the impact of the three first factors is to simply ignore them (drop any address information, consider registers indifferently, ignore the order of instructions). However, efficiently detect and interpret alterations due to the optimization schemes is a much harder problem. If some heuristics could be used to track down known patterns, designing an exhaustive and automated method was out of scope in our work. Thus, we will mostly focus on a syntactic-based measure of similarity rather than a semantic-based one. It is worth noticing that even if not properly corrected during the simplification process, some alterations due to optimization schemes can be overcame by design in other steps of the method (the choice of the similarity measure, of the relabelization scheme, etc.).

Let us first introduce our representation of a basic-block as well as our definition of similarity among them. Note that these definitions are purely arbitrary and that other considerations may be just as much legitimate as ours.

Random Projections

Recall that we seek at designing a labelization framework that gives the same label to similar inputs. A naive approach would consist in computing the distances from the input to all the \(n\) known basic-blocks of the database and to determine a collision based on a threshold (if similar enough). Such approach would require at least \(\mathcal{O}(n)\) operations and would thus quickly become prohibitive for a \(n\) large enough. On the other hand, LSH techniques use probabilistic hash functions that collide similar inputs with high probability. Any label can thus be computed in \(\mathcal{O}(1)\) operations.

Any LSH scheme is based on an arbitrary distance and can be seen as a dimension reduction that tends to preserve this distance. Similar inputs in the original space become equivalent items in the reduced one. Roughly speaking, it consists in a partition of the original vector space into "buckets", so that spatially close items (in the sense of the distance) are likely to end up in the same bucket, while dissimilar ones lie on different buckets. This framework trivially implies a surjection from the buckets to the labels. Unlike, conventional classification or clustering methods, this space partitioning is mostly computed irrespective of the input data and implicitly defines a great (or infinite) number of different buckets.

In practice, LSH techniques are widely used for nearest-neighbors search and near-duplicate detection problems. In this case, the hashing scheme is often completed by a query mechanism to limit the number of retrieved candidates (precision) while ensuring the presence of the nearest-neighbors among them (recall) . Note that in our case, we only need to assign a label to each basic-block, so we will mostly focus on the hashing step and leave the query step untold.

In many LSH schemes, the space partitioning is computed using random projections. The idea is to randomly generate some hyperplanes and to encode each input with its position with regards to these hyperplanes. Close (similar) items have then higher probability to end up in the same bucket (space partition) since they are more likely to have the same relative position to the hyperplanes. The definition of the generative distribution of the hyperplanes as well as the encoding scheme depend on the distance one is willing to approximate. The number of hyperplanes (as well as the query mechanism) depends on the desired sensibility of the hashing process (maximizing precision or recall).

Figure 3: The probability that a random hyperplane (line in 2-dimensionnal space) lies between two points is proportional to the angle they form with the origin.

Two particular metrics are usually used for document near duplicate detection, the cosine similarity and the Jaccard index. In our work, we choose to use the cosine similarity to evaluate the distances between two basic-blocks. To approximate this similarity with LSH, we use uniform random projection. The idea is to compute the angle two points are forming with the origin. It is easy to show that this angle is equal to half the probability that a randomly generated hyperplane lies between the two points (see Figure 3).

Figure 4: Left: The estimated cosine similarity using 256 random projections. Right: The relation between the estimated angle and the cosine similarity. For similar enough inputs, the angle accurately characterizes the cosine similarity. This later can thus be directly approximated using the estimated angle or corrected using a cosine function.

To efficiently encode the relative position of a point with respect to an hyperplane, one can simply keep the sign of its projection to this hyperplane. Hence, if two points share this same sign, it means that they both lie on the same side of the hyperplane, ie. it does not pass through them. Hence, by comparing the signs of the projections of two points to multiple hyperplanes, one can efficiently approximate the angle they are forming with the origin (see Figure 4). Depending on the desired result, one can approximate the angle using fast bit-wise (Hamming) comparisons or find the nearest neighbors using a bucket-based query mechanism.

Figure 5: Left: Finding the k-most similar vectors to a query requires the computation of all pairwise cosine similarity scores. Right: With a partition of the space using uniformly generated hyperplanes, one can immediately retrieve some of the most relevant candidates. This partition can be use directly to hash the inputs or as a first filter to retrieve similar items.

In our case, we only seek at giving a label to each basic-block such that similar enough items have high probability to collide (see Figure 5). To define the desired ratio similarity/probability of collision, we usually use AND/OR amplification schemes. AND amplification schemes consist in the concatenation of several sign codes into a single label such that the probability of collision decrease exponentially with the similarity. On the other hand, OR schemes aims at allowing some sign differences, mostly by assigning several buckets to a same point. In practice, AND schemes usually improve the precision score but decrease the recall while OR schemes mostly do the opposite. Usually, both hash structures are used simultaneously. The resulting amplification can be easily computed (see Figure 6).

Figure 6: The probability of collision of two inputs according to their similarity. The desired accuracy can be set using AND/OR amplification schemes.

As an illustration, in a primary experimental work (10000 basic-blocks randomly taken from a large binary), we decided to target a 10%-sensitive hash function (we would like the function to give the same label to every basic blocks with cosine similarity above 0.9). Thus, we chose to use 32 randomly generated hyperplanes as well as an AND structure such that each label can be encoded in a 32 bits (unsigned) integer. Though very simple, this scheme achieved good results. We obtained a 98.4% precision score (the ratio of equally labelized basic blocks that are similar enough) with a 82.9% recall score (the ratio of similar enough basic blocks that are actually given the same label). Note that more complex schemes lead to better results but they require some query mechanisms that are out of scope in this blogpost.

Note that these results are given as an illustration of the precision/recall trade-off and do not augur of the overall performance of the proposed labelization technique in much larger and more heterogeneous configurations (different binaries, ages, developer's practices, compilers, optimizations, etc.). Note also that, while we wait for an in-depth analysis, it remains unclear to us if one should favors precision or recall (a tighter of larger partition of the space) to get the better results using the Weisfeiler-Lehman graph kernel that we now introduce.