Inspiration

Look at the 3 pictures above. Which of the query pictures: 1 or 2, is more similar to the target picture? Can you say why?

If you thought (like we did) that image 2 is more similar to the target, then the above pictures explains why. The pictures mark and number the common regions of the queries and target image. Image 2 is more similar to the target, simply because they share more mutual regions, and these regions are big and non-trivial (for instance the black background in the pictures should not be used as evidence for similarity). This observation is inspired by work done by Michal Irani and Oren Boiman in the field of image processing (dancer images courtesy of Prof. Michal Irani). Esh shows that the same principals of similarity apply to code!

Consider the assembly code in the diagram above, extracted from binaries stripped of debug information. The blocks show partial assembly code of three procedures, two of them (t and q2) containing the “Heartbleed” vulnerability and another (q1) is unrelated. The vulnerable procedures were compiled using different compilers. Finding similarity between the procedures using syntactic techniques is challenging, as different compilers can produce significantly different assembly code.

Instead, we decompose each procedure to strands, semantically compare similarity of strands, and lift the results to procedures. In the diagram, the target code t and the query code q1, q2 share three matching strands, numbered in the figure as 1, 2, and 3 (within circles). Each strand is a sequence of instructions, and strands are considered as matches when they perform an equivalent computation. In the figure, we mark matching strands using the same circled number. Two syntactically different strands can be equivalent.