The case of the mystery data blocks

Let me tell you the story of the trickiest bug I’ve ever encountered at MokaFive.

We were having a problem where our virtual disk would detect block corruption, where a block signature would not match. And these were not just any corruption. We saw partial encrypted blocks. We saw unencrypted data. We saw executable code and snippets of text files. In short, something really strange was going on.

It didn’t help that this problem was very hard to reproduce. We had thousands of users using the product day in and day out and would get a new report every week or so. So we scoured our code for race conditions, we added a ton of invariant checking in the code, we built and ran every stress test we could think of, we read every block immediately after it was written to verify it, ran scans on the machine to check for host filesystem corruption, checked and double-checked for use-after-free and resource bugs, and a bunch of other stuff. This went on for a few months.

Eventually, we noticed a trend. First off, this was only happening on Windows 7 machines. Not Windows XP, not Mac. Second, the corruption always happened on 4K boundaries. And finally, after getting a copy of one of these corrupted disks, we had the biggest smoking gun. Our tdsk file had 4K blocks of data from a completely unrelated file on the host. This was a file that our processes could have never read. Not only that, but we found 4K blocks in the middle of our file that contained NTFS directory listing structures for unrelated directories on the host. So what were they doing in the middle of our file?

Eventually we pulled on that thread and after a ton of intimate time with ProcMon, we tracked it down to a bug in the Windows 7 kernel. More specifically, it was a race condition between NTFS and the Cache Manager. Essentially, a write issued (WriteFile, NtWriteFile, IRP_MJ_WRITE) for a file which has a pending flush (FlushBuffersFile, NtFlushBuffersFile, IRP_MJ_FLUSH_BUFFERS) can cause arbitrary host OS pages to be written to the target file in some circumstances.

The workaround was relatively easy but unfortunate – we now use a mutex to block writes to the file while a sync operation is occurring. This hurts concurrent performance but hey, data integrity is more important. We reported the bug to Microsoft and included a short program that easily reproduced the issue. Little did we know that the adventure was just beginning!

A non-privileged process can cause arbitrary operating system pages to be written to a file that it controls. In our minds, this is a really bad bug. Unprivileged processes can get access to secret data and files on the same system that they normally would not have access to. Even beyond the security considerations, being able to successfully write data to a file and not have it be corrupted seemed to be a pretty fundamental and important role of an operating system kernel.

However, Microsoft didn’t seem to agree, or at least, didn’t seem to understand the issue. Their responses were shocking.

Priceless comments from the Microsoft rep:

MS rep: Why would anyone do something like that? (simultaneous sync and write). Nobody would/should do that, so it’s not a serious issue. Our response: We have a good reason for doing it. In any case you guys are the operating system, you should work no matter how we call the standard APIs and not allow me to access the content of random files! That’s a security issue. MS rep: It is not a security issue because anyone who has data worth securing will encrypt it or overwrite it with zeroes. Our response: (speechless). Ummmm, what? That’s news to us and every other software developer out there. You mean to say that it is expected that programs shouldn’t rely on Windows to enforce file system protections or permissions? Besides, overwriting a file with zeroes won’t necessarily overwrite the data on the disk. What if the file moved due to defrag? Then the old data won’t be overwritten. MS rep: Sounds like a minor corner case. We will get around to it when we get around to it.

Now, to give Microsoft credit, they did eventually fix this bug, although not directly due to us (apparently some Microsoft Live software hit the same issue so they decided to actually fix it). It seemed to have been a race condition that was introduced in the run-up towards the Windows 7 release, because it did not happen on XP or Vista. Now, we reported this bug in November 2009 and the fix eventually rolled out as part of Windows 7 Service Pack 1, released in February 2011. So it only took them 14 months to get a fix released, and now we don’t need that mutex anymore (at least on systems with SP1 installed). So 3 months for us to find, 1 day to implement a workaround, and 14 months to wait for the real fix. Priceless.

And that was the trickiest bug I’ve encountered at MokaFive.