Single-stepping through code in Visual Studio is usually fast, but sometimes it is torturously painfully slow. When I ran into this recently – twice – I used xperf to find the sources of the slowdowns, and then work around them. These slowdowns were seen in VS 2012 – the latest and greatest – as well as in VS 2010.

In this post we:

Use xperf to investigate and fix one cause of slow Visual Studio single stepping

Briefly mention another slowdown investigation and its fix

Recap how to interpret call stacks in xperf’s WPA visualizer

Show how sampling data can sometimes be used to infer call counts

Show how xperf’s usually low overhead distorts the cost of some functions

This post is one in a series of investigative reporting articles, using xperf and other tools to understand and work around problems in other people’s software.

Getting a trace

I always have xperf recording in the background, recording all file, disk, and CPU activity. It runs all day, every day, no exceptions. I have it recording to circular buffers in memory which can be saved to disk at any point. If I encounter a performance problem I can save those buffers to disk, thus giving me the ability to do retroactive profiling of the previous 30-60 seconds. I have the recording attached to a global hot key to make it easy to save the buffers. I also have an optional key logger that puts mouse and keyboard events in the trace, to make it easier to interpret the traces.

Pro-tip: use UIforETW to record to circular buffers in memory.

So, when Visual Studio’s single stepping was taking about a second per step it was trivial to record an xperf trace for later analysis.

Measuring the slowdown

The screen shot below of WPA (Windows Performance Analyzer, part of the xperf toolkit) records me single-stepping in Visual Studio. The diamonds along the top record the keystrokes – you can see that I used the single-step keyboard shortcut three times.

Pro-tip: putting input data in traces makes them easier to interpret – UIforETW does this by default.

The graph at the bottom represents CPU consumption in devenv.exe, clearly associated with each keystroke. The numbers along the bottom represent elapsed time. With this graph we can see that each single-step action takes about 700-750 ms, which is plenty slow enough to be annoying:

But why?

It’s interesting to be able to measure precisely how bad Visual Studio’s single-step performance is, but it’s more satisfying if we can actually do something about it. So let’s dig in and see what we find. Luckily WPA automatically loads symbols from Microsoft’s symbol servers, and this includes symbols for Visual Studio and Windows.

The first thing I did was to zoom in on one of the 750 ms CPU consumption spikes and look at the CPU Usage (Sampled) data. I have xperf configured to record samples, including call stacks, on all running CPUs at a rate of about 1.1 KHz. WPA does a great job of collating these stacks, but it takes a bit of practice to interpret them.

In the screen shot below I’m exploring the CPU sampling data grouped by Process, Thread ID, then collated call Stack. I’ve got the Count and Weight columns to the right of the grouping bar and I’m sorting by Weight. Microsoft doesn’t document these columns, but I have. This isn’t the default layout for the sampling data, but it should be. I’ve saved this as a startup profile so for me it is the default layout. Recommended.

Pro-tip: use WPA’s Save Startup Profile option to get better defaults, or make sure you use the UIforETW startup profiles (available from the Settings dialog).

The Count column is how many samples are summarized on that row. We can see that the whole devenv process got 786 samples and thread 6148 got 663. The [Root] row represents how many samples have call stacks associated with them, and the next row (#5, _RtlUserThreadStart) is the 650 samples that have call stacks that go all the way back to the beginning of the thread. Those are the good samples so we’ll focus on them. The other samples for thread 6148 are in the bottom four rows. I only point them out to show that everything adds up. See this post for more details on interpreting CPU Usage (Sampled) call stacks in WPA.

The Weight column is an estimate of how many ms of CPU time are summarized on that row – the Weight column is useful because its units are always ms, even when you increase the sampling rate. But, always keep in mind that it is an estimate that requires many samples to be accurate.

The 650 samples of interest all share the top five functions on their call stacks and then they diverge – the indentation and the minus sign indicate this. Since the CPU sampling data exclusively records data about when the CPU is executing code it may seem odd that WaitForSingleObject consumes as much time as it does, given that the purpose of this function is to wait. Obviously there is some cost associated with the implementation of WaitForSingleObject, but the reason it costs as much as it does in this case is because when I enable xperf tracing I request call stacks on CSWITCH+READYTHREAD. Obtaining these call stacks increases the cost of WaitForSingleObject – xperf both interferes with the performance that it is measuring, and measures that interference.

Pro-tip: understand how xperf tracing distorts your performance

However, the majority of time is spent in ProcessSyncTask and ProcessAsyncTask. If we drill down into ProcessSyncTask we can see that the majority of that time is spent in GetSourcePosition:

Intrinsic properties of sampled data

The data we have been looking at is based on CPU sampling. One intrinsic property of a sampling based profiler is that it shows how expensive a function or call path is, but it doesn’t say why. A function might be expensive because it is called thousands of times, or because it runs for hundreds of milliseconds, but a sampling profiler cannot tell you call-counts so it is mostly mute on this important distinction.

However, if instead of expanding ProcessSyncTask we just select ProcessAsyncTask we get this interesting pattern:

When the ProcessAsyncTask entry in the call stack is selected WPA helpfully highlights all of the time periods represented by the samples. It looks like there might be a pattern of a couple of samples, and then a gap, and then a couple of samples, and so on. If this is true (and be careful about making these inferences) then we can try to estimate how many times ProcessAsyncTask is called. It works about to about sixty (not all shown above), which happens to be roughly the number of threads in the process being debugged. Together with seeing GetSourcePosition consuming lots of time this leads me to the following conclusion:

The profiling data suggests that Visual Studio is spending 750 ms on single stepping. This includes about sixty calls to ProcessAsyncTask and probably a similar number of calls to ProcessSyncTask. The calls to ProcessSyncTask include calls to GetSourcePosition. Since these appear to be happening once per thread this suggests that Visual Studio is obtaining the source position of each thread.

If we do a web search for “get source position thread visual studio” then a bit of poking around finds pages like this which mention the Show Threads in Source button. This button could also be described as the slow-down-debugging button, if you have a lot of threads.

Pro-tip: uncheck Show Threads in Source – it isn’t worth it

Here is a a WPA screenshot from the same debugging session but with Show Threads in Source unchecked – the scale is the same as in the first screenshot. CPU usage is now moderate, and I can single-step as fast as I want:

Better call count inferences

My original data only had a couple of samples per call to ProcessAsyncTask. This means that the sampling data could easily miss some calls completely, or miss other code executing in-between calls. This means that it’s hard to distinguish between a real pattern of repeated calls, and sampling artifacts. In order to increase the confidence in my analysis I recorded a fresh trace with an 8 KHz sampling rate. In the screen shot below you can see that we have hundreds of samples and a clear pattern of the ProcessAsyncTask samples being collected into three groups. Our conclusion that this represents three calls to ProcessAsyncTask seems reasonably justified now.

Pro-tip: click Fast sampling in UIforETW to set an 8 KHz sampling rate

Single step slowdown second cause

While writing this up I ended up debugging some C# code and I hit another crazy slowdown – 2.5 s per single-step. The Show Threads in Source button wasn’t checked so I grabbed another trace and started guessing. It turns out that having the disassembly window open was the problem. I always have the disassembly window open for native code debugging but apparently it can be painfully expensive for managed code.

Pro-tip: close the disassembly window when debugging managed code.