One of my “favorite” code-gen peculiarities in Visual Studio (VS) 2010 is its inconsistent handling of memcmp. VS 2010 had a really nice optimization that could generate faster and smaller code, but it only used the optimization if you lucked on to the ‘correct’ ways of calling memcmp, and if you were optimizing for size.

This bug was reported long ago and I wanted to check and see whether VS 2012 handled this task any better. The results, I am disappointed to say, are decidedly mixed.

The optimization that VS 2010 sometimes does is that if you are only interested in whether two buffers are the same – if you don’t care about their sort order – then the generated code can cheaply compare four bytes at a time instead of one byte at a time. This is faster, and smaller. See the original post for details – section 1.

The past: Visual Studio 2010

In the chart below there are three semantically identical statements that compare two buffers and return true if they are equal. I compiled each statement as a function, optimized for size (/Os) and then optimized for time (/Ot), for VS 2010, and looked at the function sizes:

The first two functions were close to perfect when optimized for size, but the third variation missed the optimization. Its code was not only a lot larger, it also ran about four times slower. Oops.

When optimizing for time you would always get suboptimal code. Visual Studio 2010 would generate a whopping 108 bytes of code for all three of these expressions, and most of this code was completely superfluous.

In short, your odds of getting optimal code were pretty poor.

The present: Visual Studio 2012

VS 2012 changes all this. The byte-by-byte comparisons are gone in all six cases. However a new problem has cropped up, one which threatens to make VS 2012 code slower in many cases. Here is the chart with VS 2012 results added:

The results when optimizing for size are consistent – but consistently worse!

The optimized for time functions are significantly better but, it turns out, not as good as they should be.

Boolean logic made complicated

All of these functions ultimately return true or false in the al register. That should be a simple matter of loading one or zero into al. A few bytes of code, and no memory accesses. Instead VS 2012 generates variations of this rather verbose code for setting the return value when optimizing for size:

Equal:

mov DWORD PTR tv79[ebp], 1

jmp BothCases

NotEqual:

and DWORD PTR tv79[ebp], 0

BothCases:

mov al, BYTE PTR tv79[ebp]



Ouch. The worst of these is the first instruction which is seven bytes long because the ‘one’ is embedded in the instruction as a 32-bit constant. In total these instructions add up to sixteen bytes, and they read and write memory! At the very least they could trivially simplify the code to this:

Equal:

mov al,1

jmp BothCases

NotEqual:

mov al,0

BothCases:

A little setne trickery could probably do even better than this, but the optimized-for-time versions actually do worse, using two instructions totaling ten bytes to set al to zero, and an identically expensive sequence to set it to one. Peculiar.

NotEqual:

mov DWORD PTR tv79[ebp], 0

mov al, BYTE PTR tv79[ebp]

pop esi

mov esp, ebp

pop ebp

ret 0

Equal:

mov DWORD PTR tv79[ebp], 1

mov al, BYTE PTR tv79[ebp]

pop esi

mov esp, ebp

pop ebp

ret 0



Code size and instruction count are not accurate measures of performance, but additional instructions with no advantages are going to be neutral at best. Unnecessary code bloat, especially when optimizing for size, is clearly a defect, and this bug – which I have only seen when using memcmp – wastes at least ten bytes per function.

To repe or not repe

The other reason that the optimized-for-size versions got bigger is because they no longer use “repe cmpsd”. The rep prefix is a very compact way of expressing things like moving or filling memory. However, my understanding (section 16.10 String instructions) is that using the rep prefix with cmpsd is too slow. Even when optimizing for size the compiler cannot completely forget about performance. I am sure that this decision was intentional and was carefully made.

Summary

With VS 2012 the results are at least more consistent. Sort of.

All three versions are 47 bytes when optimized for time

All three versions are 46 bytes when optimized for size

All three versions are at least ten bytes larger than they should be.

The time-optimized versions of this comparison all went from 108 bytes to 47 bytes. That’s a great improvement, and it probably helped to mask the fact that these versions should have been about 37 bytes. Whatever template churns out the inline code for memcmp apparently does not cooperate with VS 2012’s code generator.

I tested with g++ and clang++ on Linux but I could not convince either one of them to inline memcmp. It looks like they got burned by the cases when an inlined memcmp could end up slower, and decided to abandon the whole experiment.

Update

I wanted to address a couple of comments that I heard about this post:

Why are there no timing measurements? I previously verified that comparing memory byte-by-byte is four times slower than comparing a DWORD at a time, all else being equal. The main defect remaining is that the code is unnecessarily large. This will probably have no effect on any benchmark, but larger code always increases i-cache miss rates, which are significant on many large projects. Since the bloated code to set the return value has no advantages, it must cost some amount of performance in some cases, and there is no useful way to measure this. Why am I measuring memcmp given that std::equal is what all the cool kids use? std::equal can end up hitting many of the same issues, and in fact in some cases (on POD types in VC++) it ends up calling memcmp, which gets us back to where we started. The bloated code in VS 2010 with /Ot and VS 2012 with all options is then a real code-size cost even if you don’t call memcmp directly.