As you might have noticed, we are doing a lot of performance work. We recently moved some of our code to use async I/O in the hope of getting even more performance from the system.

The result was decidedly not what we expected. On average we saw about 10% – 30% reduction in speed, just from the use of aysnc operations. So we decided to test this.

The test is simple, make a read of a large file (1.4GB) from a network drive without buffering. The synchronous code is:

private static void SyncWork( int pos) { var sp = Stopwatch.StartNew(); var buffer = new byte [1024 * 4]; long size = 0; using (var sha = SHA1.Create()) using (var stream = new FileStream( @"p:\dumps\dump-raven.rar" , FileMode.Open, FileAccess.Read, FileShare.Read, 4 * 1024, FileOptions.SequentialScan | FILE_FLAG_NO_BUFFERING)) { stream.Seek(pos * ReportSize, SeekOrigin.Begin); int read; while ((read = stream.Read(buffer, 0, buffer.Length)) != 0) { sha.ComputeHash(buffer, 0, read); size += read; if (size >= ReportSize) { Console.WriteLine($ "Read {size / 1024 / 1024:#,#} mb sync {sp.ElapsedMilliseconds:#,#}" ); return ; } } } }

To make things interesting, we are reading 32 MB in 4KB chunks and computing their SHA1 hash. The idea is that this is a mix of both I/O and CPU operations. The machine I’m testing this on has 8 cores, so I run 16 copies of this code, with different start positions.

for ( int i = 0; i < 16; i++) { var copy = i; new Thread(state => { SyncWork(copy); }).Start(); Thread.Sleep(250); }

The basic idea was to simulate work coming in, doing different things, and to simulate slow I/O and computation. 16 threads means that I have more threads than CPU cores, so we’ll have some context switches. Note that the use of unbuffered I/O means that we have to go over the network (slow).

The output of this code is:

Read 32 mb sync 8,666 Read 32 mb sync 8,794 Read 32 mb sync 8,995 Read 32 mb sync 9,080 Read 32 mb sync 9,123 Read 32 mb sync 9,299 Read 32 mb sync 9,359 Read 32 mb sync 9,593 Read 32 mb sync 9,376 Read 32 mb sync 9,399 Read 32 mb sync 9,381 Read 32 mb sync 9,337 Read 32 mb sync 9,254 Read 32 mb sync 9,207 Read 32 mb sync 9,218 Read 32 mb sync 9,243

Now, let us look at the equivalent async code:

private static async Task AsyncWork( int pos) { var sp = Stopwatch.StartNew(); var buffer = new byte [1024 * 4]; using (var sha = SHA1.Create()) using (var stream = new FileStream( @"p:\dumps\dump-raven.rar" , FileMode.Open, FileAccess.Read, FileShare.Read, 4 * 1024, FileOptions.SequentialScan | FileOptions.Asynchronous | FILE_FLAG_NO_BUFFERING)) { stream.Seek(pos * ReportSize, SeekOrigin.Begin); long size = 0; int read; while ((read = await stream.ReadAsync(buffer, 0, buffer.Length)) != 0) { sha.ComputeHash(buffer, 0, read); size += read; if (size >= ReportSize) { Console.WriteLine($ "Read {size / 1024 / 1024:#,#} mb async {sp.ElapsedMilliseconds:#,#}" ); return ; } } } }

Note that here I’m using async handle, to allow for better concurrency. My expectation was that this code will interleave I/O and CPU together and result in less context switches, more CPU utilization and overall faster responses.

Here is the network utilization during the async test:

And here is the network utilization during the sync test:

Trying the async using 64Kb buffers gives better results:

And output of:

Read 32 mb async 8,290 Read 32 mb async 11,445 Read 32 mb async 13,327 Read 32 mb async 14,088 Read 32 mb async 14,569 Read 32 mb async 14,922 Read 32 mb async 15,053 Read 32 mb async 15,165 Read 32 mb async 15,188 Read 32 mb async 15,148 Read 32 mb async 15,040 Read 32 mb async 14,889 Read 32 mb async 14,764 Read 32 mb async 14,555 Read 32 mb async 14,365 Read 32 mb async 14,129

So it is significantly worse than the sync version when using 4KB buffers. The bad thing is that when using 64Kb buffer in the sync version, we have:

And the whole process completed in about 2 seconds.

I’m pretty sure that I’m doing everything properly, but it seems like the sync version is significantly cheaper.

Short summary, the solution is throw all of async code way in favor of pure sync code, because it is so much faster. Banish async, all hail to the synchronous overload.

However, the plot thickens!

Before before declaring death to asynchronicity, with thunderous applause, I decided to look further into things, and pulled out my trusty profiler.

Here is the sync version:

As expected, most of the time is spent in actually doing I/O. The async version is a bit harder to look at:

This is interesting, because no I/O actually occurs here. At first I thought that this is because we are only using async I/O, so all of the missing time (notice that this is just 625 ms) is lost to the I/O system. But then I realized that we are also missing the ComputeHash costs.

Profiling async code is a bit harder, because you can’t just track the method calls. We found the missing costs here:

And this is really interesting. As you can see, most of the cost is in the ReadAsync method. My first thought was that I accidently opened the file in sync mode, turning the async call into a sync call. That didn’t explain the different in costs from the sync version, through, and I verified that the calls are actually async.

Then I looked deeper:

Why do we have so many seeks?

The answer lies in this code. And that explained it, including a big comment on why this happens. I created an issue to discuss this.

Calling SetFilePointer is typically very fast, since the OS just need to update an internal structure. For some reason, it seems much more expensive on a remote share. I assume it need to communicate with the remote share to update it on its position. The sad thing is that this is all wasted anyway, since the file position isn’t used in async calls, each actual call to ReadFileNative will be given the offset to read there.