How is Thread.SpinWait actually implemented?

I’m always drawn into disassembling stuff and learning how something works under the hood. The Thread.SpinWait is something I’m going to explore. Because .NET Core is open source I can attack this from side of both sources as well as pure disassembly.

Sources

Let’s start simply from sources. Following where the Thread.SpinWait goes, I eventually ended up in internal (yes, internal ) class Thread , that derives from RuntimeThread , where the SpinWait method is. This method calls SpinWaitInternal , that is conveniently right above. That’s where the C# code ends and we need to go lower (in this case the “VM”).

The implementation is in comsynchronizable.cpp file, using the FCIMPL1 macro (which I think is an abbreviation for ” fastcall function implementation with one argument”). It simply checks what the number of iterations is. If it’s over 100000 the preemptive mode is used to avoid stalling a GC, else the code stays in cooperative mode. In both cases the YieldProcessorNormalized is called passing result from YieldProcessorNormalizationInfo .

The YieldProcessorNormalized method calls YieldProcessor number of times based on YieldProcessorNormalizationInfo.yieldsPerNormalizedYield . The YieldProcessor is again a macro, defined in gcenv.base.h (together with MemoryBarrier ). Looking at it shows that the implementation differs based on platform. For example on AMD64 using Visual C++ it uses _mm_pause intrinsic. This eventually puts pause instruction into the resulting code. For x86 it simply uses rep nop . The important part of that file is included at the bottom as a reference.

Looks like I have the implementation. On platforms where I’m running my code most often, it’s simply pause instruction.

Disassembly

All the above is nice, but what if I’ve made some mistake? I should be able to see the result in pure disassembly, right?

I compiled a simple .NET Framework (non-Core) console application with full optimizations enabled and loaded it into WinDbg. Using the Disassembly and F11 went deeper and deeper into the code. Eventually I ended in this piece of code for 32bit.

7217f56c 8bf1 mov esi,ecx 7217f56e 8975e4 mov dword ptr [ebp-1Ch],esi 7217f571 bf60f51772 mov edi,offset clr!ThreadNative::SpinWait (7217f560) 7217f576 897de0 mov dword ptr [ebp-20h],edi ss:002b:00f3ee58=00000000 7217f579 81fe40420f00 cmp esi,0F4240h 7217f57f 0f8f26492100 jg clr!ThreadNative::SpinWait+0x35 (72393eab) 7217f585 85f6 test esi,esi 7217f587 7e07 jle clr!ThreadNative::SpinWait+0x123 (7217f590) 7217f589 f390 pause 7217f58b 83ee01 sub esi,1 7217f58e 75f9 jne clr!ThreadNative::SpinWait+0x29 (7217f589)

The pause instruction is nicely there and the esi register is used to count (down) the iterations.

For 64bit, the code obviously still uses pause , but the looping is done slightly differently.

00007ffe`b5a2b556 33db xor ebx,ebx 00007ffe`b5a2b558 81f940420f00 cmp ecx,0F4240h 00007ffe`b5a2b55e 7f0e jg clr!ThreadNative::SpinWait+0x4e (00007ffe`b5a2b56e) 00007ffe`b5a2b560 3bd9 cmp ebx,ecx 00007ffe`b5a2b562 0f8dc5010000 jge clr!ThreadNative::SpinWait+0x20d (00007ffe`b5a2b72d) 00007ffe`b5a2b568 f390 pause 00007ffe`b5a2b56a ffc3 inc ebx 00007ffe`b5a2b56c ebf2 jmp clr!ThreadNative::SpinWait+0x40 (00007ffe`b5a2b560)

The ebx ( rbx ) register is incremented and compared with ecx ( rcx ) where the total number of interations is stored.

The decision for cooperative or preemptive mode is visible in both with cmp with 0F4240h value.

Summary

True, for day-to-day programming in .NET one does not need to know this, heck one does not need Thread.SpinWait at all, and I know it. So what’s the reason for all this? I like such disassembling (pun intended). It keeps my brain occupied and sometimes stretches my abilities, thus I’m learning new stuff.

Appendix

YieldProcessor macro etc. in gcenv.base.h