We evaluated MemPick on two sets of applications. For the first set, we gathered as many popular libraries for lists and trees as we could find. We then exercised the most interesting/relevant functions using test programs from the libraries’ test suites. These synthetic tests allow us to control exactly the features we want to test. We also evaluated the quiescent period detection mechanism on these libraries to identify the requirements for gap size selection. Next, we evaluated MemPick on a set of real-world applications, like chromium, lighttpd, wireshark, and the clang compiler. We also evaluate the analysis time for these applications to show the scalability of the proposed approach. Finally we also looked into low-level system code with two major file system implementations, ZFS and ntfs.

Popular Libraries

We tested MemPick on 16 popular libraries that feature a diverse set of implementations for a wide range of data structures. Including libraries in the evaluation has multiple benefits. Firstly, they provide strong ground-truth guarantees since the data structures and their features are well documented. Secondly they provide a means to evaluate a wider range of implementation variants, since most applications typically rely on a few standard implementations like STL and GLib in practice. For all the libraries we tried to use the built-in self-test functionality. Only if such a test was not available, we built simple test harnesses to probe all functionalities.

In the following we present a short summary of the reasoning behind some of the library choices. The evaluation set contains 4 major STL variants and GLib, the libraries typically used by major Linux applications. In addition, libavl brings a large variety of both balanced and unbalanced binary trees, with different overlay configurations, like the presence of parent pointers or threadedness. Several libraries (like UTlist, BSD queues, and Linux lists) implement inline data structures with no explicit interface in the final binary—typically by offering access macros instead of functions. We also include the Google implementation of in-memory B-Trees to validate the ability of MemPick to detect balanced non-binary trees. Typically B-Trees are implemented in database applications, which operate on persistent storage, leading to a lack of pointers in the data structure nodes.

Table 2 presents a summary of our results gathered from the libraries. We do not present results for individual data structure partitions as that number is dependent on the specific test applications. In all scenarios MemPick classified all partitions for any given data structure the same way.

Table 2 MemPick’s evaluation across 16 libraries Full size table

For all tests executed, we encountered a total of two misclassifications, while all other data structures were successfully identified by MemPick (no false negatives). In the case of GDSL, the shape of the misclassified binary tree is detected appropriately, however MemPick reports perfect balancedness since the tree is limited to 3 nodes. The results is still valuable, since MemPick reports all other classification details accurately, this error is also unlikely to occur in applications that deal with real data. The misclassification in GLib is more subtle. The implementation of the N-ary tree uses parent, left child and sibling pointers. For optimization purposes the authors also include a previous pointer in the sibling list. MemPick correctly identifies the presence of an N-ary parent pointer and binary child pointer (left child + next sibling) trees, but it also detects an overlay using the left child and previous sibling pointers. This overlay does not match any basic shape and is reported as a graph, bringing the overall classification to a graph. Since MemPick also reports the overlay classification to the user, a human reverse engineer can accurately interpret the results. Alternatively, the user can add a (trivial) refinement classifier for this scenario, since the presence of two overlays does imply more structure than a generic graph, but we wanted to keep the number of refinement classifiers to a minimum. One observation is that both errors were found in libraries supporting a large variety of data structure implementations. This is not surprising, since the chance of non-standard data structures is increased with the size of the library. Still our results show that the overlay based classifier is resilient to unexpected data structure shapes, by correctly classifying all basic overlays contained within. Even if the overall classification fails, the partial results are still beneficial as an anchor for the reverse engineering process.

When testing the utlist library, we ran across an interesting classification report. MemPick reported a cyclic and a non-cyclic list overlay for the non-cyclic doubly linked list in the library. This behavior was confirmed to be a design decision, when correlating the results with the source code. This example shows the importance of the overlay based classification employed in MemPick. Without this approach the observed shape and behavior would not match any assumption about linked lists as the overall structure is neither properly cyclic nor non-cyclic.

Summarizing these results, we see that MemPick successfully deals with a large variety of data structure implementations. It is capable of correctly identifying the underlying type, independent of the presence of interface functions and independent of overlay variations. The results also show the efficiency of classifying balanced binary trees based only on shape information, provided the tree is sufficiently large.

Next we aimed to identify the impact of the gap size percentile used with quiescent periods. In Section 6 we suggested the use of 1 % longest gaps as signal for quiescent periods. In this part of the evaluation we vary this number all the way to 20 % and observe its impact on classification accuracy. We perform this part of the evaluation on libraries, instead of applications, since they offer more precise ground truth information. Table 3 presents the results, counting for all the data-structure implementations where the classification was degraded in comparison to the original proposal. In some instances this degradation was in the form of missing overlays, while in other instances MemPick was unable to offer any valid classification. In general, one can observe that search-trees are more sensitive to the gap size, especially the implementations within the libavl library. This library offers a cloning interface for trees, which in some implementation variants does not respect the validity of the tree throughout the operation. If a quiescent period intervenes during the clone operation, the system will observe an invalid tree. One must also take into account, that we tested the libraries using the built-in unit tests whenever they were available. In this scenario data-structure operations are typically executed in quick sequence, without any intervening application code. This explains the gap size sensitivity for some of the list implementations. Overall, these results suggest that the quiescent periods should be considered conservatively, especially in the presence of heavily used, complex data-structures. Our suggestion when performing manually assisted reverse engineering is to start with a highly conservative gap size, which can progressively be increased to detect detect data-structures potentially missed by the initial setting.

Table 3 MemPick’s gap size evaluation across 16 libraries Full size table

Applications

MemPick is designed as a powerful reverse-engineering tool for binary applications, so it is natural to evaluate its capabilities on a number of frequently used real applications. For this purpose we have selected 10 applications from a wide range of classes, including a compiler (Clang), a web browser (Chromium), a webserver (Lighttpd), multiple networking and graphics applications. Table 4 presents the number of code lines for each of these applications, giving an idea of their size.

Table 4 Number of C/C ++ lines of code for the 10 real-life applications, excluding potential third party libraries Full size table

As we discussed in the Section 4, MemPick operates under the assumption that it can track all memory allocations. Two of the selected applications, namely Clang and Chromium, use custom memory allocators to manage the heap. In the case of Clang we also instrumented the custom memory allocators to gain insight to the internal data structures. For Chromium we were currently unable to perform such instrumentation. MemPick was still able to detect a large number of data structures that are defined in third-party libraries which still employ the system allocation routines. In principle, it would be straightforward to detect custom memory allocators automatically using techniques developed by Chen et al. (2013).

Table 5 presents an overview of the results from all applications. It is important to note that for applications there exists no ground-truth information that we can compare against. For every application reported by MemPick we manually checked the corresponding source code to confirm the classification. We report two types of errors in Table 5. One is typing errors, when a given data structure is misclassified by MemPick. The other is partition errors. They refer to data structures that were classified accurately overall, but for which a number of their partitions contained errors.

Table 5 MemPick’s evaluation across 10 real-life applications Full size table

The accuracy of MemPick is demonstrated by the fact that only 3 type misclassifications were detected in all tests on all 10 applications. MemPick was successful in identifying a wide-range of data structures, from custom designed singly-linked lists to large n-ary trees used for ray-tracing. MemPick also highlights different developer trends in the use of data structures. Some application developers prefer static storage such as arrays over complex heap structures. Examples for this pattern include wget and lighttpd. To ensure that this observation is not the result of false negatives, we manually inspected these two applications for undetected data structure implementations. As far as our evaluation goes, no data structures were missed by MemPick in these two applications.

Now let us focus our attention on the analysis of the erroneous classification reported by MemPick. The first example is a type misclassification in one of the linked list implementations in chromium. In this scenario MemPick reported a parent-pointer tree between the memory nodes. Browsing the source reveals the root of the error to be a programming decision. Nodes removed from the list never have their internal data cleared, nor are they freed until the end of the application. These unused memory links will stay resident in memory and confuse our shape analysis. A potential solution for this problem is a more advanced heap tracking mechanism with garbage collection. The latter would identify dead objects in memory and ensure that they are removed from the analysis. However we feel that this is not in the scope of the current paper.

The other two type misclassifications both stem from composite data structures. Templated libraries such as STL make it possible for the programmer to build composite data structures like list-of-trees or list-of-lists. MemPick correctly identifies the data structure boundaries in situations where node types are mixed, but is unable to do so if both components have the same type, like dealing with list-of-lists. Without such boundaries, MemPick will evaluate the shape of the data structure as a whole. Intuitively, the resulting data structure still has a consistent shape, but features increased complexity. A combination of singly-linked lists turns into a child-pointer tree, while binary trees turn into ternary trees with the addition of the ”root of sub-tree” pointer. This is also exactly what MemPick reports in these two scenarios. Pure shape analysis is not sufficiently expressive to distinguish between this pattern and regular child-pointer or ternary-trees, respectively. A reverse-engineer using MemPick can still identify this pattern with good confidence, by observing that the other partitions of the same type are classified as lists or trees.

Looking at the partition errors in Table 5, the reader can notice that the vast majority belong to binary trees. We focus our attention on this class of errors first. For all misclassifications of this category, MemPick erroneously detects AVL balancedness instead of the weaker red-black or unbalanced properties. As presented previously in Section 8 measuring the balancedness of a tree does carry uncertainty if the tree is too small. We confirmed that for each of the erroneous partitions, the tree contained no more than 7 nodes, a number too small to identify the difference between the two tree types. For all trees larger than this size our algorithm has an error rate of 0 %.

Outside of the 3 main groups of errors, MemPick reports a few more misclassified partitions. Considering the total number of partitions reported across the 10 applications, these errors represent less than 1 % and do not impact the overall analysis.

As part of evaluating, we also look at the analysis times required when processing these applications. We broke down the analysis times to different stages to identify potential problem areas within the analysis pipeline. We exclude the tracing component from this evaluation, since none of the proposed contributions relate to application tracing. The explicit tracing overhead can also be mitigated when combined with multi-path analysis. The KLEE family of multi-path analysis tools (Cadar et al. 2008; Marinescu and Cadar 2012; 2013) is a prime example within the software engineering research community. Tools within the KLEE family emulate memory operations, by first looking up detailed information about the allocation site at the target address. The tracing within MemPick performs a similar look-up to identify the target heap object, while also performing a look-up on the value as-well. Thus, the desired tracing functionality could also be integrated within tools from the KLEE family with an additional 2X overhead in the worst case.

Table 6 presents the running time of the different analysis stages. The TypeGen stage includes type inference and the detection of the quiescent periods. The GraphGen stage represents graph generation, while the OverlayGen stage identifies al potential overlays. Finally the Classificaton stage is the time it takes to perform the final classification. Applications with limited heap usage finish within a matter of seconds as expected. Once the heap usage increases, so does the analysis time, especially for the TypeGen stage. This stage operates on raw traces and its execution time is unaffected by the semantics of the heap objects. This is highlighted within Lighttpd and Pachi, which make good use of heap memory, but few heap objects are members of high-level data-structures. For these applications the bulk of the analysis is performed within the the first stage, after which all non-desirable heap objects are purged from further analysis. Another particular application is Clang, where the OverlayGen stage is significantly more costly than the rest. This behavior is due to some heap objects featuring a large set of the pointer elements. Overlay identification requires testing an exponential number of pointer combinations, but for most data-structures (except B-trees) the number of pointers is limited. Since we don’t expect B-trees to come up often during analysis, this behavior can be considered an outlier and not the general case. Finally, for applications with heavy data-structure usage, such as Tor and Wireshark, the execution time can increase to the range of minutes, but the total analysis time is still only around 30 minutes. These execution times suggest that the proposed methodology is well suited for the offline analysis of complex applications. Further optimizations can also be applied to reduce the analysis time within a production setting. For more detailed discussions about scalability, we refer the reader to Section 11.

Table 6 MemPick’s analysis time evaluation across 10 real-life applications Full size table

System Code

One of the proposed use cases for MemPick was to analyze low-level system code for potentially vulnerable data structures. In this section we analyze the effectiveness of MemPick when dealing with this application class. MemPick relies on the PIN (Intel 2011) framework for dynamic instrumentation, thus currently cannot analyze kernel-space code. However this does not mean that the mechanics behind MemPick cannot be applicable to system code. To overcome this technical limitation we leverage the FUSE project (Szeredi http://fuse.sourceforge.net), which allows file system implementations to reside in user-space. Two major file-system implementations NTFS-3g and ZFS-FUSE are built on top of this framework on Linux. For this evaluation we choose two additional projects,s3fs, which allows mounting buckets from the S3 online storage service of Amazon and sshfs, which allows mounting remote folders via ssh.

Table 7 presents the overview of the results from MemPick when analyzing these four systems. The format is the same as the one used for the evaluation of real-world applications. For these four systems no typing errors were observed, only partition errors where small red-black trees were mistakenly classified as being AVL trees. This type of error does not affect the ability of the reverse engineer to identify the underlying data structure since the results offer a comprehensive summary of all partitions, including the right classification. One peculiar detail is the lack of complex data structures for the two systems dealing with real file systems, NTFS-3g and ZFS-FUSE. No tree-like data structures related to inodes were discovered in the case of these two systems. By examining the intermediate results, we discover that MemPick correctly identifies the inode objects, but detects no direct pointer links between them. This discovery was confirmed by examining the underlying source code, which uses additional levels of indirection between inode objects. This programming pattern does not match our initial definition of homogeneous data structures. Future work may look into the discovery of heterogeneous data structures consisting of different object types. We conclude that MemPick was successful in analyzing these four systems and shows great promise in handling system code.