MD5 is known to be broken for more than a decade now. Practical attacks have been shown since 2006, and public collision generator tools are also available since that time. The dangers of the developed collision attacks were demonstrated by academia and white-hat hackers too, but in case of the Flame malware we’ve also seen malicious parties exploiting the weaknesses in the wild.

And while most have already moved away from MD5, there is still a notable group that heavily uses this obsolete algorithm: security vendors. It seems that MD5 became the de-facto standard of fingerprinting malware samples and the industry doesn’t seem to be willing to move away from this practice. Our friend Zoltán Balázs collected a surprisingly long list of security vendors using MD5, including the biggest names of the field.

The list includes for example Kaspersky, the discoverer of Flame who just recently reminded us that MD5 is dead, but just a few weeks earlier released a report including MD5 fingerprints only – ironically even the malware they analysed uses SHA-1 internally…

And in case you think that MD5 “good enough” for malware identification let’s take another example. The following picture shows the management console of a FireEye MAS – take a good look at the MD5 hases, the time delays and the status indicators:

Download sample files

As you can see, binaries submitted for analysis are identified by their MD5 sums and no sandboxed execution is recorded if there is a duplicate (thus the shorter time delay). This means that if I can create two files with the same MD5 sum – one that behaves in a malicious way while the other doesn’t – I can “poison” the database of the product so that it won’t even try to analyze the malicious sample!

After reading the post of Nat McHugh about creating colliding binaries I decided to create a proof-of-concept for this “attack”. Although Nat demonstrated the issue with ELF binaries, the concept is basically the same with Windows (PE) binaries that security products mostly target. The original example works by diverting the program execution flow based on the comparison of two string constants. The collision is achieved by adjusting these constants so that they match in one case, but not in the other.

My goal was to create two binaries with the same MD5 hash; one that executes arbitrary shellcode (wolf) and another that does something completely different (sheep). My implementation is based on the earlier work of Peter Selinger (the PHP script by Nat turned out to be unreliable across platforms…), with some useful additions:

A general template for shellcode hiding and execution;

RC4 encryption of the shellcode so that the real payload only appears in the memory of the wolf but not on the disk or in the memory of the sheep;

Simplified toolchain for Windows, making use of Marc Stevens fastcoll (Peter used a much slower attack, fastcoll reduces collision generation from hours to minutes);

The approach may work with traditional AV software too as many of these also use fingerprinting (not necessarily MD5) to avoid wasting resources on scanning the same files over and over (although the RC4 encryption results in VT 0/57 anyway…). It would be also interesting to see if “threat intelligence” feeds or reputation databases can be poisoned this way.

The code is available on GitHub. Please use it to test the security solutions in your reach and persuade vendors to implement up-to-date algorithms before compiling their next marketing APT report!

For the affected vendors: Stop using MD5 now! Even if you need MD5 as a common denominator, include stronger hashes in your reports, and don’t rely solely on MD5 for fingerprinting!