Dropbox recently clarified (via their blog and privacy policy) that they “de-duplicate” user files. This has been known for quite a while, and is obvious to anyone who’s had a large file “upload” instantly. But how exactly does Dropbox store files? Are they really de-duplicated or just single-instanced? I set out to discover the answer.

Single Instance Storage

It’s fairly simple for a system to eliminate duplicate data by storing only a single instance of multiple identical files. In other words, if you and I both upload “Presentation.pptx” and it’s bit-for-bit identical, it would be a simple matter to store just one copy.

Dropbox definitely does this. I proved it with a simple experiment:

Create a new 10 MB encrypted disk image in TrueCrypt (so it’ll be 100% unique, random data) Move it to the Dropbox folder and wait a few minutes as it uploads Copy the file with a new name to the folder and notice that it “uploads” instantly

Dropbox is at least single-instancing storage. This helps users, since it speeds uploads and reduces bandwidth usage. It helps Dropbox in the same way, but goes further since they still “charge” files against your account whether they’re single-instanced or not.

Note that this single-instancing works across users and geographies. I gave a file to a friend to upload to a different Dropbox account, and saw the same “acceleration effect.” This would be quite useful to users and the company for files like iTunes songs which are identical and widespread.

Clashing MD5 Hashes?

A global single-instance storage system sounds great, but it opens the door to hash collision issues. Imagine if you and I both uploaded identical files. Both would have the same “fingerprint” and Dropbox would only store it once. Now imagine instead that, out of coincidence or malice, I uploaded a file with the same fingerprint as yours but different contents. This is not so far-fetched as it seems, and could lead to all sorts of security nightmares.

A common and compromised file checksum method is MD5, so I decided to test how Dropbox handles files of identical size, name, and MD5 hash using the “Nostradamus Attack” PDFs generated by Marc Stevens. My tests show that Dropbox correctly handled the files I tried, and no combination of uploading and naming could force it to incorrectly store the right file. So Dropbox either doesn’t use MD5 or uses a combination of hashing and other mechanisms. Testing other schemes is left as an exercise to the reader.

One more thought: The fact that de-duplication is mentioned in the “privacy” section of the Dropbox policies raises my eyebrows, since it indicates that they see this hash collisions as a matter of privacy rather than data corruption. This indicates that Dropbox is both aware of and susceptible to hash collision attacks generally, though obviously not as simply as creating a bogus MD5 match.

Note: Dropbox is well aware of this issue, having recently squashed an open-source exploit called Dropship!

Sub-File De-Duplication

Data de-duplication is like single-instancing, but it applies to some subset of data. Some enterprise storage systems de-duplicate at multi-megabyte levels, while others are far more granular.

To test whether Dropbox de-duplicates data, I devised a simple experiment:

Create a new local copy of my existing random TrueCrypt file Add a single byte to the end using the “cat” command Copy the resulting file to Dropbox Watch as Dropbox takes just a few seconds to upload the new file

This test proves that Dropbox does indeed de-duplicate at the sub-file level. Since it took a bit longer to upload that would be expected for a single byte, we can see that Dropbox “chunks” files for hashing and uploading.

De-Duplication Granularity

The next question is just what size chunks or blocks Dropbox uses to de-duplicate data. To test this, I created various blocks of random data using TrueCrypt and experimented to see where the “stair-steps” were in terms of upload time.

My tests used four basic building blocks of 512 KB, 1024 KB, 2048 KB, and 4096 KB in size. Guessing that Dropbox used one of these sizes for their chunking system, I assumed these would quickly demonstrate the answer.

First, I uploaded each file individually and watched as Dropbox took about 30 seconds per MB. This will vary greatly, of course, but the absolute performance doesn’t matter. Only relative performance matters for demonstrating chunking.

Next, I concatenated each file with itself to create a new file twice as large. This would be ideally “chunkable” since it consists of exactly identical data with a nice, clean, evenly-divisible “border”. I uploaded each of these and noticed that the “4096 KB x 2” file uploaded nearly instantly, while all others took the expected amount of time.

I repeated this with “x 3”, “x 4”, and “x 8” files and noticed that the 4096 KB (4 MB) “barrier” was very obvious. Whenever a file contained 4096 KB or less of data Dropbox had seen before, it single-instanced it. Any time it saw a unique “block” smaller than this, it uploaded it fresh.

This proves, at least in the case of my own Mac OS X install of Dropbox, that a 4 MB chunk size is used for de-duplication.

Stephen’s Stance

Dropbox is a very useful service, and I appreciate the technology they use to make it work. By single-instancing storage, the company is able to keep costs and transfer time in check and offer a basic service for free for many users. Despite the recent security issue, I continue to use Dropbox myself and would not hesitate to recommend it. But I do suggest using your own encryption for any sensitive data, as demonstrated in my recent post, Mac Users, Secure Your Stuff in Dropbox.

I remain somewhat concerned about the privacy and security implications of global de-duplication of shared random data. If they use SHA-1 hashes alone, which I suspect, there is a chance that an object will not be stored correctly once 2^80 (or perhaps 2^69 or even 2^52) objects are stored. This would lead to issues of data corruption or inadvertent disclosure. This is a very remote chance indeed, but “birthday problems” like this work against hashing systems. I would love to hear from Dropbox regarding how they prevent this from happening, including disclosure of their methods of hashing data. It’s nice to see the company taking responsibility by disclosing this in their privacy policy, though!