Being able to find files that are similar to a particular file is quite useful, although it can be difficult to handle at scale. It can often require an infeasible number of comparisons, which need to take place outside of a database. In an attempt to make this task more manageable, Brian Wallace has devised an optimization to ssDeep comparisons, which drastically decreases the time required to compare files.

Being able to find files that are similar to a particular file is quite useful, although it can be difficult to handle at scale. It can often require an infeasible number of comparisons, which need to take place outside of a database. In an attempt to make this task more manageable, I devised an optimization to ssDeep comparisons, which drastically decreases the time required to compare files.

Considering these issues, it is quite clear that ssDeep becomes computationally heavy at scale. This is likely what leads to services providing limited functionality to use with ssDeep.

Furthermore, clustering (or grouping) based on ssDeep requires every ssDeep hash to be compared against every other hash. This means that if you are clustering 1,000 ssDeep hashes, 499,500 (the number of pairs among 1,000 elements) ssDeep comparison function calls are required.

The largest issue with ssDeep as it stands is that it does not scale particularly well. In order to compare a fixed ssDeep hash against a set of other ssDeep hashes, the ssDeep compare function must be called for each hash being tested. This means if you are comparing an ssDeep hash against 1,000 other ssDeep hashes, you need to call the ssDeep comparison function 1,000 times. This can become an issue when these hashes must be retrieved from a database, requiring all hashes to be retrieved to do a lookup of similar hashes.

There are also alternative fuzzy hashing methods that may be worth exploring, but they are outside the scope of this article.

There are services that utilize ssDeep. These services tend to supply limited ssDeep functionality, such as reduced searching or no automated queries. The likely reason for this is how ssDeep scales, which will be covered in the next section.

In the past, I have used ssDeep to preprocess a large number of samples. For instance, during Cylance’s Operation Cleaver investigation [ 2 ], there were an almost insurmountable number of malware samples to reverse engineer. A number of methods were required to reduce the sample set into clusters. One of the methods used was ssDeep. Using ssDeep clustering, I was able to see which files were similar, making it simpler to determine which samples were from the same family, as well as identify when a sample was embedded in another sample.

ssDeep is useful when searching for similar files. For instance, two malware samples generated by the same builder which inserts configuration statically into a stub sample, may be easy to identify as having a high similarity.

Once we have computed hashes for more than one input, we can conduct the comparison method (generally referred to in implementations as ‘compare’) to compare the two hashes. This similarity comparison is done completely independently of the files the hashes are based on. This allows for simple high-level comparisons without the need to compare each file byte by byte.

ssDeep works by computing a fuzzy hash of each piece of data supplied to it (string/file/etc.). Most implementations of ssDeep refer to this computing of the fuzzy hash as ‘compute’. The output of this compute function is an ssDeep hash, which looks like the following:

ssDeep [ 1 ] is a fuzzy hashing algorithm which employs a similarity digest in order to determine whether the hashes that represent two files have similarities. For instance, if a single byte of a file is modified, the ssDeep hashes of the original file and the modified file are considered highly similar. ssDeep scores range from zero (no similarity or negligible similarity) to 100 (very similar, if not an exact match).

Scaling optimizations

My methodology for optimizing ssDeep comparisons at scale focuses on reducing the number of ssDeep hashes that need to be compared, which reduces the search space. This methodology avoids the need for a custom-developed library to conduct ssDeep comparisons. It also establishes that these optimizations are for the application of ssDeep at scale, not for ssDeep itself.

In order to develop optimizations to decrease the search space for similar hashes, we need to inspect how these ssDeep hash comparisons are made. Since the source code for ssDeep is publicly available, this does not require a great deal of reverse engineering. The comparisons conducted by the fuzzy_compare function are our primary focus [3]. While I will not cover everything this function does, I will cover the relevant portions.

Testing methodology In the following sections, I will describe the optimization methods I developed for utilizing ssDeep at scale. When generating these methods, I used an isolated testing environment that is easy to reproduce. In order to maintain this high level of reproducibility, all benchmarks are computed in a single-threaded application being executed on an Odroid XU4, which was isolated from any networks to maintain a sanitized testing environment. No timed portion of the code relies on accessing resources on disk. In order to reproduce an environment where a database is being queried, a simple SQLite database, hosted completely in memory, is used. No advanced features of SQLite are employed, and all methods are easily available to anyone wishing to reproduce these results or optimization methods. In order to test the optimization methods, they need to accomplish a task. For this reason, all benchmarks include two tasks that are needed to use ssDeep. One of these tasks, ‘Lookup’, is searching for any ssDeep hash in our database where the comparison value is greater than zero. This task is computed for 1,000 hashes in order to increase the accuracy of the benchmarks for smaller database sizes. The other task, ‘Cluster’, requires every hash in our database to be compared against every other hash. More specifically, the algorithm must return every file comparison where the value is greater than zero. The methods for clustering the resulting matrix of distance values are independent of these optimizations. Additionally, all data points are tested five times, and an average is taken over them all. The code used to collect the benchmarks as well as the specific optimization implementations can be found at [4].

Chunksize An ssDeep hash is formatted as follows: chunksize:chunk:double_chunk The chunksize is an integer that describes the size of the chunks in the following parts of the ssDeep hash. Each character of the chunk represents a part of the original file of length chunksize. The double_chunk is computed over the same data as chunk, but computed with chunksize * 2. This is done so that ssDeep hashes computed with adjacent chunk sizes can be compared. This tells us that if we are looking to perform comparisons on an ssDeep hash with chunksize=n, the comparison will not return any value other than zero unless the chunksize of the other hash is n/2, n, or 2 * n. This is our first optimization. By only retrieving ssDeep hashes with a compatible chunksize, we reduce the number of comparisons being made unless our dataset is extremely homogenous. In order to do this, we must store our ssDeep hashes with their chunksize. For instance, we can use the following SQL schema: CREATE TABLE hashes (chunksize INT, hash VARCHAR UNIQUE); With this smaller search space, we need to compute far fewer ssDeep comparisons and retrieve far fewer ssDeep hashes from our database. When benchmarks are compared with an unoptimized method, it is immediately clear that this simple optimization method is effective (see Figure 1 and Figure 2). Figure 1. 1,000 ssDeep lookups over database (plain vs. chunksize). (Click here to view a larger version of Figure 1.) Figure 2. ssDeep cluster over database (plain vs. chunksize). (Click here to view a larger version of Figure 2.) Clustering optimization We can further optimize our clustering method based on chunksize. By iterating over the chunk sizes incrementally and obtaining all ssDeep hashes of a certain chunksize, we can do our comparisons locally in an efficient manner. As long as the previous chunksize hashes are kept as well, all comparison values greater than zero can be determined. This method works by first comparing each hash against each with the same chunksize. Then, each one is checked against each one with chunksize / 2. The previous chunksize data set retrieved is used unless a chunksize has no representative hashes. As this is done incrementally, this will compute all comparisons for chunksize / 2, chunksize and chunksize * 2 (see Figure 3). Figure 3. ssDeep cluster over database (plain vs. chunksize with cluster optimization). (Click here to view a larger version of Figure 3.)

IntegerDB Further into the comparison process, each ssDeep hash goes through independent modifications (which will be mentioned in a later section). After these modifications, there is a check that acts as the last check before an edit distance algorithm is applied. This last check is simply checking to see if there is any seven-character string that is common between the two hashes. If there is not, the comparison value for these two hashes will be zero. This means that if we can search over a database for any ssDeep hash with the same seven-character string, we can highly optimize our searching. Consider the following example: Hash 1: 768:v7XINhXznVJ8CC1rBXdo0zekXUd3CdPJxB7mNmDZkUKMKZQbFTiKKAZTy:ShT8C+fuioHq1KEFoAU Hash 2: 768:C7XINhXznVJ8CC1rBXdo0zekXUd3CdPJxB7mNmDZkUKMKZQbFTiKKAZTV6:ThT8C+fuioHq1KEFoAj6 Hash 3: 768:t2m3D9SlK1TVYatO/tkqzWQDG/ssC7XkZDzYYFTdqiP1msdT1OhN7UmSaED7Etnc:w7atyfzWgGEXszYYF4iosdTE1zz2+Ze Since they all have the same chunksize, these three hashes can be compared. Now let’s search for common seven-character strings by generating all seven-character strings for each chunk of each hash. Hash 1: Chunk: set([['v7XINhX', '7XINhXz', 'XINhXzn', 'INhXznV', 'NhXznVJ', 'hXznVJ8', 'XznVJ8C', 'znVJ8CC', 'nVJ8CC1', 'VJ8CC1r', 'J8CC1rB', '8CC1rBX', 'CC1rBXd', 'C1rBXdo', '1rBXdo0', 'rBXdo0z', 'BXdo0ze', 'Xdo0zek', 'do0zekX', 'o0zekXU', '0zekXUd', 'zekXUd3', 'ekXUd3C', 'kXUd3Cd', 'XUd3CdP', 'Ud3CdPJ', 'd3CdPJx', '3CdPJxB', 'CdPJxB7', 'dPJxB7m', 'PJxB7mN', 'JxB7mNm', 'xB7mNmD', 'B7mNmDZ', '7mNmDZk', 'mNmDZkU', 'NmDZkUK', 'mDZkUKM', 'DZkUKMK', 'ZkUKMKZ', 'kUKMKZQ', 'UKMKZQb', 'KMKZQbF', 'MKZQbFT', 'KZQbFTi', 'ZQbFTiK', 'QbFTiKK', 'bFTiKKA', 'FTiKKAZ', 'TiKKAZT', 'iKKAZTy']]) Double chunk: set(['ShT8C+f', 'hT8C+fu', 'T8C+fui', '8C+fuio', 'C+fuioH', '+fuioHq', 'fuioHq1', 'uioHq1K', 'ioHq1KE', 'oHq1KEF', 'Hq1KEFo', 'q1KEFoA', '1KEFoAU']) Hash 2: Chunk: set ['C7XINhX', '7XINhXz', 'XINhXzn', 'INhXznV', 'NhXznVJ', 'hXznVJ8', 'XznVJ8C', 'znVJ8CC', 'nVJ8CC1', 'VJ8CC1r', 'J8CC1rB', '8CC1rBX', 'CC1rBXd', 'C1rBXdo', '1rBXdo0', 'rBXdo0z', 'BXdo0ze', 'Xdo0zek', 'do0zekX', 'o0zekXU', '0zekXUd', 'zekXUd3', 'ekXUd3C', 'kXUd3Cd', 'XUd3CdP', 'Ud3CdPJ', 'd3CdPJx', '3CdPJxB', 'CdPJxB7', 'dPJxB7m', 'PJxB7mN', 'JxB7mNm', 'xB7mNmD', 'B7mNmDZ', '7mNmDZk', 'mNmDZkU', 'NmDZkUK', 'mDZkUKM', 'DZkUKMK', 'ZkUKMKZ', 'kUKMKZQ', 'UKMKZQb', 'KMKZQbF', 'MKZQbFT', 'KZQbFTi', 'ZQbFTiK', 'QbFTiKK', 'bFTiKKA', 'FTiKKAZ', 'TiKKAZT', 'iKKAZTV', 'KKAZTV6']) Double chunk: set(['ThT8C+f', 'hT8C+fu', 'T8C+fui', '8C+fuio', 'C+fuioH', '+fuioHq', 'fuioHq1', 'uioHq1K', 'ioHq1KE', 'oHq1KEF', 'Hq1KEFo', 'q1KEFoA', '1KEFoAj', 'KEFoAj6']) Hash 3: Chunk: set ['t2m3D9S', '2m3D9Sl', 'm3D9SlK', '3D9SlK1', 'D9SlK1T', '9SlK1TV', 'SlK1TVY', 'lK1TVYa', 'K1TVYat', '1TVYatO', 'TVYatO/', 'VYatO/t', 'YatO/tk', 'atO/tkq', 'tO/tkqz', 'O/tkqzW', '/tkqzWQ', 'tkqzWQD', 'kqzWQDG', 'qzWQDG/', 'zWQDG/s', 'WQDG/ss', 'QDG/ssC', 'DG/ssC7', 'G/ssC7X', '/ssC7Xk', 'ssC7XkZ', 'sC7XkZD', 'C7XkZDz', '7XkZDzY', 'XkZDzYY', 'kZDzYYF', 'ZDzYYFT', 'DzYYFTd', 'zYYFTdq', 'YYFTdqi', 'YFTdqiP', 'FTdqiP1', 'TdqiP1m', 'dqiP1ms', 'qiP1msd', 'iP1msdT', 'P1msdT1', '1msdT1O', 'msdT1Oh', 'sdT1OhN', 'dT1OhN7', 'T1OhN7U', '1OhN7Um', 'OhN7UmS', 'hN7UmSa', 'N7UmSaE', '7UmSaED', 'UmSaED7', 'mSaED7E', 'SaED7Et', 'aED7Etn', 'ED7Etnc']) Double chunk: set([['w7atyfz', '7atyfzW', 'atyfzWg', 'tyfzWgG', 'yfzWgGE', 'fzWgGEX', 'zWgGEXs', 'WgGEXsz', 'gGEXszY', 'GEXszYY', 'EXszYYF', 'XszYYF4', 'szYYF4i', 'zYYF4io', 'YYF4ios', 'YF4iosd', 'F4iosdT', '4iosdTE', 'iosdTE1', 'osdTE1z', 'sdTE1zz', 'dTE1zz2', 'TE1zz2+', 'E1zz2+Z', '1zz2+Ze']) Now that we have all the seven-character strings from the chunks, we want to find any overlap between the sets with the same chunk sizes. If we have any overlap, a comparison between them will return a result greater than zero. Hash 1 chunk & Hash 2 chunk: set(['mNmDZkU', '0zekXUd', '1rBXdo0', '8CC1rBX', 'd3CdPJx', 'CC1rBXd', 'ekXUd3C', 'o0zekXU', 'PJxB7mN', 'B7mNmDZ', '3CdPJxB', 'FTiKKAZ', 'C1rBXdo', 'ZkUKMKZ', 'dPJxB7m', 'Ud3CdPJ', 'kUKMKZQ', 'XINhXzn', 'INhXznV', 'kXUd3Cd', 'znVJ8CC', 'UKMKZQb', '7XINhXz', 'nVJ8CC1', 'ZQbFTiK', 'Xdo0zek', 'JxB7mNm', 'KMKZQbF', 'XznVJ8C', 'MKZQbFT', 'QbFTiKK', 'rBXdo0z', 'CdPJxB7', 'TiKKAZT', 'NmDZkUK', 'J8CC1rB', 'VJ8CC1r', 'hXznVJ8', 'bFTiKKA', 'do0zekX', 'DZkUKMK', 'BXdo0ze', 'zekXUd3', 'mDZkUKM', 'KZQbFTi', 'XUd3CdP', '7mNmDZk', 'xB7mNmD', 'NhXznVJ']) Hash 1 double_chunk & Hash 2 double_chunk: set(['oHq1KEF', 'uioHq1K', 'C+fuioH', '+fuioHq', 'q1KEFoA', 'Hq1KEFo', '8C+fuio', 'T8C+fui', 'hT8C+fu', 'fuioHq1', 'ioHq1KE']) Hash 1 chunk & Hash 3 chunk: set([]) Hash 1 double_chunk & Hash 3 double_chunk: set([]) Hash 2 chunk & Hash 3 chunk: set([]) Hash 2 double_chunk & Hash 3 double_chunk: set([]) With these values, we should see that the comparison between Hash 1 and Hash 2 results in a value greater than zero, but the comparisons between Hash 1 and Hash 3, and between Hash 2 and Hash 3, will result in comparisons that equal zero. >>> ssdeep.compare("768:v7XINhXznVJ8CC1rBXdo0zekXUd3Cd PJxB7mNmDZkUKMKZQbFTiKKAZTy:ShT8C+fuioHq1KEFoAU", "768:C7XINhXznVJ8CC1rBXdo0zekXUd3CdPJxB7mNmDZkUKMKZQbFTiKKA ZTV6:ThT8C+fuioHq1KEFoAj6") 97 >>> ssdeep.compare("768:v7XINhXznVJ8CC1rBXdo0zekXUd3Cd PJxB7mNmDZkUKMKZQbFTiKKAZTy:ShT8C+fuioHq1KEFoAU", "768:t2m3D9SlK1TVYatO/tkqzWQDG/ssC7XkZDzYYFTdqiP1msdT1OhN7UmSaED7Etnc:w7atyfzWgGEXszY YF4iosdTE1zz2+Ze") 0 >>> ssdeep.compare("768:C7XINhXznVJ8CC1rBXdo0zekXUd3Cd PJxB7mNmDZkUKMKZQbFTiKKAZTV6:ThT8C+fuioHq1KEFoAj6", "768:t2m3D9SlK1TVYatO/tkqzWQDG/ssC7XkZDzYYFTdqiP1msdT1 OhN7UmSaED7Etnc:w7atyfzWgGEXszYYF4iosdTE1zz2+Ze") 0 In order to perform this comparison optimally, we will store all seven-character string values in a database for each hash. These can be reduced to integers, as the string value consists of base64 characters, which when decoded, can optimally be represented as five-byte integers in our database. The resulting schema is as follows: CREATE TABLE ssdeep_hashes (hash_id INTEGER PRIMARY KEY, hash VARCHAR UNIQUE); CREATE TABLE chunks (hash_id INTEGER, chunk_size INTEGER, chunk INTEGER); Now our database consists of all the integers representing seven-character strings that reside in each ssDeep hash chunk (any chunks for double_chunk have a chunksize that represents their doubled chunk size). In order to query against this database, we need to split our query hash into integers and chunksize, then query for any hash that has the same integer and the same chunksize. Since these are queries over integers, when used with a simple index, this can be powerfully effective (see Figure 4 and Figure 5). Figure 4. 1,000 ssDeep lookups over database. (Click here to view a larger version of Figure 4.) Figure 5. ssDeep cluster over database (plain vs. IntegerDB). (Click here to view a larger version of Figure 5.) This method is so effective, that in comparison to the other methods, it is difficult to visualize the curve. For this reason, additional benchmarks must be gathered on a greater scale (see Figure 6, Figure 7, Figure 8 and Figure 9). Figure 6. 1,000 ssDeep lookups over database (IntegerDB). (Click here to view a larger version of Figure 6.) Figure 7. ssDeep cluster over database (IntegerDB). (Click here to view a larger version of Figure 7.) Figure 8. 1,000 ssDeep lookups over database (extended). (Click here to view a larger version of Figure 8.) Figure 9. ssDeep cluster over database (extended). (Click here to view a larger version of Figure 9.)