Earlier this month I wrote about how Windows 10 holds a lock during too much of process destruction, which both serializes this task and causes mouse-cursor hitches and UI hangs (because the same lock is used for these UI tasks).

I thought I’d use this as an excuse to dig slightly deeper into what is going on using a clunky-but-effective ETW profiling technique. This technique shows that a 48 instruction loop is consuming a huge chunk of the CPU time while the lock is held – the 80/20 rule is alive and well. And, thanks to some discussion on hacker news I know have an idea of what that function does and why it got so much more expensive in Windows 10 Anniversary Edition.

I started by grabbing a trace of my ProcessCreateTests program (source is here), zooming into the serialized part of process destruction, and then viewing the sampling data grouped by Process Name/Module/Function. This gives me a view of where the samples are actually hitting – almost entirely in win32kbase.sys, especially in HmgNextOwned:

I then added the Address column and then sorted by address, giving me a heat map of where in the function the samples are hitting:

This is usually too low-level, but sometimes it’s pretty handy to be able to see which instructions are consuming the most CPU time. Except, WPA won’t show me the actual CPU instructions.

I tried using livekd to disassemble those addresses but it didn’t work – I don’t know why. Probably user error because I am actually not a kernel hacker.

So, I ran dumpbin /disasm win32kbase.sys. For some reason this refused to show symbol names which made finding HmgNextOwned in the output more challenging. But, I just searched for “CA90:” in the output (DLLs are relocated on 64 KB boundaries so the last four hex digits stay consistent) and I quickly found a function that started at such an address (it was the only one) and which had instruction addresses consistent with all of the samples. I then manually lined up the samples with the instructions in a spreadsheet and added commentary.

The annotated disassembly can be found here (the main loop is in bold) and an onlinedissasembler.com version with branch diagrams can be found here.

It’s hard to draw any concrete conclusions because I don’t know what data structures the code is traversing and I don’t care enough to investigate further. It seems clear that the function must be traversing some large arrays because the code appears to spend a lot of time waiting on cache misses:

But the main thing I always realize when using this technique is that modern CPUs are weird and confusing. Because CPUs are massively out-of-order and super-scalar it is not at all clear what it means for a sampling interrupt to hit “on” a particular instruction. If an instruction is particularly expensive then samples are more likely to hit “near there” but I’m sure where “near there” actually is:

If there are three instructions executing simultaneously when the interrupt fires then which one “gets the blame”?

If a load instruction misses in the cache and forces the CPU to wait then will the samples show up on the load instruction, or on the first use of the data? Both seem to happen.

If a branch is mispredicted then will the samples show up on the branch instruction or on the branch target?

What’s going on with the expensive cmp instruction on line 24 of the spreadsheet?

If anyone has a good model for what happens to the CPU pipelines when a sampling interrupt happens I would appreciate that. Ideally that would explain the relationship between clusters of samples and expensive instructions. I suppose I could write test programs to explore this, but not today. If anyone wants to run the function through one of Intel’s emulators and share the results then that would be great.

I hope that this lock-contention bug (apparently Windows: 12699333) is fixed by the Fall Creators Update of Windows 10 so that I don’t need to reboot my machine so frequently.

In the original trace where I discovered this issue I also see that win32kbase.sys!MultiUserGreCleanupHmgOwnRemoveAllLocks seems to be consuming a lot of CPU time – I don’t know what the significance of that difference (more CPUs? different Windows 10 version?) might be.

There’s some good hacker news discussion of this post here. In particular, this comment links to this article which explains the Windows 10 Anniversary Edition changes which are probably the root cause of the problem. It sounds like Anniversary Edition changed the management of GDI objects to avoid leaking kernel addresses and that made HmgNextOwned more expensive, thus triggering the visible serialization of process destruction.

The other hilarious thing about this bug, as far as I can tell, is that all of this iteration over the GDI object table happens even when the processes in question have zero GDI objects. Maybe somebody needs to add an early-out for that case, because right now it looks like a lot of CPU time is spent looking for objects that have in fact never been allocated!

If my test program loads user32.dll (which loads gdi32.dll) then user and GDI objects are created and things seem to get even worse, but they are plenty bad even without doing that. Here is task manager when user32.dll/gdi32.dll are loaded (in the test code just call PeekMessage instead of PeekMessage_f):