A few weeks ago, some coworkers were complaining about the relative performance of Mercurial cloning on Windows. I investigated on my brand new i7-6700K Windows 10 desktop machine and sure enough they were correct: cloning times on Windows were several minutes slower than Linux on the same hardware. What gives?

I performed a few clones with Python under a profiler. It pointed to a potential slowdown in file I/O. I wanted more details so I fired up Sysinternals Process Monitor (strace for Windows) and captured data for a clone.

As I was looking at the raw system calls related to I/O, something immediately popped out: CloseFile() operations were frequently taking 1-5 milliseconds whereas other operations like opening, reading, and writing files only took 1-5 microseconds. That's a 1000x difference!

I wrote a custom Python script to analyze an export of Process Monitor's data. Sure enough, it said we were spending hundreds of seconds in CloseFile() operations (it was being called a few hundred thousand times). I posted the findings to some mailing lists. Follow-ups in Mozilla's dev-platform list pointed me to an old MSDN blog post where it documents behavior similar to what I was seeing.

Long story short, closing file handles that have been appended to is slow on Windows. This is apparently due to an implementation detail of NTFS. Writing to a file in place is fine and only takes microseconds for the open, write, and close. But if you append a file, closing the associated file handle is going to take a few milliseconds. Even if you are using Overlapped I/O (async I/O on Windows), the CloseHandle() call to close the file handle blocks the calling thread! Seriously.

This behavior is in stark contrast to Linux and OS X, where system I/O functions take microseconds (assuming your I/O subsystem can keep up).

There are two ways to work around this issue:

Reduce the amount of file closing operations on appended files. Use multiple threads for I/O on Windows.

Armed with this knowledge, I dug into the guts of Mercurial and proceeded to write a number of patches that drastically reduced the amount of file I/O system calls during clone and pull operations. While I intend to write a blog post with the full details, cloning the Firefox repository with Mercurial 3.6 on Windows is now several minutes faster. Pretty much all of this is due to reducing the number of file close operations by aggressively reusing file handles.

I also experimented with moving file close operations to a separate thread on Windows. While this change didn't make it into Mercurial 3.6, the results were very promising. Even on Python (which doesn't have real asynchronous threads due to the GIL), moving file closing to a background thread freed up the main thread to do the CPU heavy work of processing data. This made clones several minutes faster. (Python does release the GIL when performing an I/O system call.) Furthermore, simply creating a dedicated thread for closing file handles made Mercurial faster than 7-zip at writing tens of thousands of files from an uncompressed tar archive. (I'm not going to post the time for tar on Windows because it is embarassing.) That's a Python process on Windows faster than a native executable that is lauded for its speed (7-zip). Just by offloading file closing to a single separate thread. Crazy.

I can optimize file closing in Mercurial all I want. However, Mercurial's storage model relies on several files. For the Firefox repository, we have to write ~225,000 files during clone. Assuming 1ms per file close (which is generous), that's 225s (or 3:45) wall time performing file closes. That's not going to scale. I've already started experimenting with alternative storage modes that initially use 1-6 files. This should enable Mercurial clones to run at over 100 MB/s (yes, Python and Windows can do I/O that quickly if you are smart about things).

My primary takeaway is that creating/appending to thousands of files is slow on Windows and should be addressed at the architecture level by not requiring thousands of files and at the implementation level by minimizing the number of file close operations after write. If you absolutely must create/append to thousands of files, use multiple threads for at least closing file handles.

My secondary takeaway is that Sysinternals Process Monitor is amazing. I used it against Firefox and immediately found performance concerns. It can be extremely eye opening to see how your higher-level code is translated into function calls into your operating system and where the performance hot spots are or aren't at the OS level.