Poor Visual Studio code generation for reading from and writing to member variables

I spend quite a lot of time trying to write fast code, but it is something of an uphill battle with Microsoft's Visual Studio C++ compiler.

Here is an example of how MSVC struggles with member variable reads and writes in some cases.

Consider the following code. In the method writeToArray(), we write the value 'v' into the array pointed to by the member variable 'a'. The length of the array is stored in the member variable 'N'.

class CodeGenTestClass { public: CodeGenTestClass() { N = 1000000; a = new TestPair[N]; v = 3; } __declspec(noinline) void writeToArray() { for(int i=0; i<N; ++i) a[i].first = v; } __declspec(noinline) void writeToArrayWithLocalVars() { TestPair* a_ = a; // Load into local var int v_ = v; // Load into local var const int N_ = N; // Load into local var for(int i=0; i<N_; ++i) a_[i].first = v_; } TestPair* a; int v; int N; };

Compiler is Visual Studio 2015, x64 target with /O2.

The disassembly for writeToArray() looks like this:

--------------------------------------------------------------------------- for(int i=0; i<N; ++i) 0000000140055350 xor r8d,r8d 0000000140055353 cmp dword ptr [rcx+0Ch],r8d 0000000140055357 jle js::CodeGenTestClass::writeToArray+28h (0140055378h) 0000000140055359 mov r9d,r8d 000000014005535C nop dword ptr [rax] a[i].first = v; 0000000140055360 mov rdx,qword ptr [rcx] // Load this->a 0000000140055363 lea r9,[r9+8] 0000000140055367 mov eax,dword ptr [rcx+8] // Load this->v into eax 000000014005536A inc r8d 000000014005536D mov dword ptr [r9+rdx-8],eax // Store value in eax into array 0000000140055372 cmp r8d,dword ptr [rcx+0Ch] // Load this->N and compare with loop index. 0000000140055376 jl js::CodeGenTestClass::writeToArray+10h (0140055360h) } 0000000140055378 ret ---------------------------------------------------------------------------

I have bolded the inner loop and added some comments.

Rcx here is storing the 'this' pointer. What you can see is that inside the loop, the values of 'a', 'v', and 'N' are repeatedly loaded from memory, which is wasteful.

Let's compare with the disassembly for writeToArrayWithLocalVars():

--------------------------------------------------------------------------- TestPair* a_ = a; // Load into local var int v_ = v; // Load into local var const int N_ = N; // Load into local var for(int i=0; i<N_; ++i) 0000000140054AE0 movsxd rdx,dword ptr [rcx+0Ch] 0000000140054AE4 xor eax,eax 0000000140054AE6 mov r8,qword ptr [rcx] 0000000140054AE9 mov r9d,dword ptr [rcx+8] 0000000140054AED test rdx,rdx 0000000140054AF0 jle js::CodeGenTestClass::writeToArrayWithLocalVars+1Eh (0140054AFEh) a_[i].first = v_; 0000000140054AF2 mov dword ptr [r8+rax*8],r9d // Store value 'v' (in r9d register) into the array 0000000140054AF6 inc rax // increment loop index 0000000140054AF9 cmp rax,rdx // Compare loop index with N 0000000140054AFC jl js::CodeGenTestClass::writeToArrayWithLocalVars+12h (0140054AF2h) // branch } 0000000140054AFE ret ---------------------------------------------------------------------------

Again I have bolded the inner loop and added some comments.

As you can see, the member variables are not repeatedly loaded in the inner loop, but are instead stored in registers. This is much better, and executes faster:

test_class.writeToArray(): 0.000541 s (1.84977 B writes/sec) test_class.writeToArrayWithLocalVars(): 0.000380 s (2.63310 B writes/sec)

Needless to say Clang gets this right, here's the inner loop for writeToArray(): (see https://godbolt.org/g/juzpfV)

--------------------------------------------------------------------------- .LBB1_4: # =>This Inner Loop Header: Depth=1 mov dword ptr [rdx + 8*rsi], ecx mov dword ptr [rdx + 8*rsi + 8], ecx mov dword ptr [rdx + 8*rsi + 16], ecx mov dword ptr [rdx + 8*rsi + 24], ecx mov dword ptr [rdx + 8*rsi + 32], ecx mov dword ptr [rdx + 8*rsi + 40], ecx mov dword ptr [rdx + 8*rsi + 48], ecx mov dword ptr [rdx + 8*rsi + 56], ecx add rsi, 8 cmp rsi, r8 jl .LBB1_4 ---------------------------------------------------------------------------

Why this happens

It's hard to say for sure without seeing the source code for MSVC. But I think it's probably a failure of alias analysis.

Basically, a C++ compiler has to assume the worst, in particular it must assume that any pointer can be pointing at anything else in the program memory space, unless it can prove that it is not possible under the rules of the language (e.g. would be undefined behaviour).

In this particular case, we have two pointers in play - the 'this' pointer, and the 'a' pointer, and since we have a write through the 'a' pointer, it looks like MSVC is unable to determine that 'a' does not point to 'v', or 'this', or 'N'.

To be able to prove that a write through 'a' does not overwrite the value in N, or V, MSVC needs to be able to do what is called alias analysis. I believe in this case it would be best done with type-based alias analysis (TBAA).

Since in C++ it is undefined behaviour to write through a pointer with one type (in this case TestPair*), and read through a pointer with another type (CodeGenTestClass* for the this pointer?), therefore the write to this->a cannot store a value that is read from this->v or this->N. Unfortunately MSVC's TBAA is either absent or not strong enough to work this out.

(I may be wrong about the analysis pass required here, compiler experts please feel free to correct me!).

Moving the values into local variables as in writeToArrayWithLocalVars(), allows the compiler to determine that they are not aliasing. (It can determine this quite simply by noting that the address of the local variables is never taken, therefore no aliasing pointers can point at them) This allows the values to be placed into registers.

One thing to note is that MSVC can do the aliasing analysis and produce a fast loop when the 'a' array is of a simple type such as int instead of TestPair. (Edit: Actually this is not the case, MSVC fails in this case also)

Impact

This kind of code is pretty common in C++. I extracted this particular example from some hash table code I was writing.

You will see this kind of problem with MSVC whenever you are writing to and reading from member variables. (Depending on the exact types etc..). So I would say this is a pretty serious performance/codegen problem for MSVC.

Edit: Comment thread on reddit. In this comment Gratian Lup clarifies that MSVC does not do TBAA.