I'd be interested in hearing if people have good solutions for this. Runtime CPU detection is mandatory for any vectorized code I write since I target a pretty wide range of x86 CPUs, and this would require some unacceptable contortions in source code organization. I'd like to raise the portability of my source code even though I don't plan on using anything except Visual C++ in the near future, but issues like this are a bit more than I'd like to take on. Without a solution for issues like this I can only hope that Clang will turn out more reasonable than GCC has historically been.

...the version of GCC I tried, 4.6.2, produces an SSE2 CVTTSS2SI instruction in code that has no explicit SSE2 usage. This is great if you're trying to build an entire executable that requires SSE2. It's not so great if you are trying to build a module that does dynamic dispatch to multiple paths based on CPU runtime detection. Apparently, the recommendation is to split your source code into multiple files and compile each of them with different settings, which is lame. First, I hit this in a CPU detection routine, so I'd rather not take a one-page function and split it across three files. Second, at least with Visual C++, doing that is a recipe for getting nasty bugs silently introduced into your program. The problem is that inlined functions and template methods can be compiled with different compile settings from different modules, and when the linker merges them it can choose a version that uses instructions not valid on all calling paths. This would be fine if the compiler and linker would work together to segregate the code paths, but as far as I know only Intel C++ does that and only for non-intrinsics based code.

This blows the compile if SSE2 instructions aren't enabled in the build (-msse2). The problem with enabling that flag is that it apparently also gives the compiler license to use SSE2 instructions in any code, not just code using SSE2 intrinsics. For instance, if you have this simple function in one of your modules:

At first, it seemed to be going pretty well, since I had already done a pass with the Clang static analyzer and had already cleaned up some of the C++ transgressions that VS2010 had allowed through. A sticky point was the definition of CRITICAL_SECTION in the MinGW Win32 headers, since for some reason the MinGW headers define separate _CRITICAL_SECTION and _RTL_CRITICAL_SECTION types instead of typedef'ing one to another like the official headers do. This breaks code that manually forward declares InitializeCriticalSection() so as to avoid bringing windows.h into portable code.

Comments

Comments posted:

I think the gcc solution to this was to add the "target" function attribute in 4.4 to allow you to enable/disable -msse* on a per function basis - http://gcc.gnu.org/onlinedocs/gcc/Functi.. gcc has historically refused patches to allow you to compile intrinsics for optional instruction sets unless you also allow the compiler to use said optional extensions wherever it pleases (see -faltivec vs. -maltivec)

ducky - 28 12 11 - 18:28

That would be great if they hadn't put the blasted #error in the header file. That means the only way you could leverage the target attribute for this problem would be to set the highest target level on the command line and immediately drop it to the floor in the .cpp file. That's lame.

Phaeron - 28 12 11 - 18:32

The only meaningful thing about a file is that it is a container for the concept of TU. You seem to have applied some other meaning to what goes in/out of each file. Otherwise you'd have no problem in placing your different paths into different files. Is the problem to do with your text editor giving meaning to filenames that the compiler does not acknowledge as meaningful? Perhaps you need to reassess how you view the concept of a file as a container. What are the alternatives, the compiler is supposed to guess that (float)myInt; is supposed to use x86 instruction set and not SSE? Why would you leave anything upto a guess when you can specify it? You dont want to specify -mSSE2 then you leave it upto the compiler to guess the platform.

boris. - 28 12 11 - 21:51

I'm well aware of the concept of a translation unit, and it's precisely that boundary that I don't want involved here. What I want is for the compiler to use only the base x86 instruction set and still allow me to use intrinsics. It doesn't allow me to do that. I have to both enable SSE2 intrinsics and code generation at the same time. That would be fine if I could control it on a per code path or function basis, but because of the #error in the header the only choice I have is per translation unit. That leads to the following problems:- It's a mess from a code organization standpoint. If I have to segregate code, it should be at function level and not translation unit level.- If I wanted to use precompiled headers, I'd have to have a different one for each platform level, even though I don't want anything pulled in from those headers to be affected.- It's ambiguous what happens to inlines or templates. If std::min(float, float) is compiled from both x86 and SSE2 modules, which version should the linker pick? If it doesn't merge them, it's very wasteful (and I think potentially a standard violation).- How is this supposed to work with link-time optimization?Basically, as far as I'm concerned, forcing the programmer to switch compile settings for different parts of a project is broken. The One Definition Rule essentially says that this causes inherent and unavoidable conflicts during the compilation process.

Phaeron - 29 12 11 - 08:58

Did you try -mfpmath=387 ?

anne onymous - 29 12 11 - 22:12

Just tried it -- doesn't work. Even if it did fix the cast, it probably wouldn't help if there were other cases that the compiler decided to use SSE2 insns, like for integer math. I wasn't able to trip a case with SSE2 insns but I did manage to get conditional moves to show up, which would be PPro required (686).

Phaeron - 30 12 11 - 08:26

I wrote about the same frustration when I begun using GCC (see link on my name).I have found there is no good solution. What I did to work around it is to split the source into separate files, and only use -msse2 for that file.Of course, if you want to make it work on non-sse2 platforms, and older compiler versions you also have to test if the compiler is capable of compiling with sse2, and then have proper tests and fallback if not.

Klaus Post (link) - 03 01 12 - 02:24

So compiling VirtualDub in MinGW, even MinGW-W64 is a no go? Damn, and I was happy, that I was able to compile a ton of other stuff. Especially SDL.

diki - 03 01 12 - 06:35

Phaeron, sorry for the slight off-topic, but I just discovered an atrocious bug in MSVC compiler (cl.exe version 16.00.40219.01) which you should be aware of.If you have:_mm_lfence();v.QuadPart = __rdtsc();If you compile with /O2, compiler will change the instruction order (i.e. emit LFENCE _after_ RDTSC)!I know that you are testing new versions of Visual Studio and that you report your fair share of bugs to Microsoft so I would really appreciate to know whether this optimizer bug has been fixed or they don't know about it yet.

Igor Levicki (link) - 04 01 12 - 02:42

> So compiling VirtualDub in MinGW, even MinGW-W64 is a no go? Damn, and I was happy, that I was able to compile a ton of other stuff. Especially SDL.Well, while this is a pretty annoying roadblock for MinGW compilation, it turns out there's another one: the Win32 headers are a bit screwed up and out of date. Some of the definitions are split between winnt.h and winbase.h when they shouldn't, and more importantly, a whole bunch of definitions for APIs added since Vista are missing. The Direct3D 10.1+ headers also don't compile properly and apparently the toolchain doesn't support __declspec(selectany), so it's only going to get worse when I add more Windows 8 support. It's unfortunate, but I have to say it: using the non-native compiler on a platform is asking for it. To make that work you need a compiler vendor that is highly dedicated to tracking the primarily compiler like the Intel compiler group, whereas GCC almost goes out of its way to do otherwise.SDL... oof. I helped someone debug some issues in the Win32-specific SDL code, and some of the code there... is a bit fragile. Seeing a maximum size of 0x0 from WM_GETMINMAXINFO was not a high point.@Igor Levicki:I don't really have inside info on VC++, so your guess is as good as mine. I'd suggest filing it. Maybe adding _ReadWriteBarrier() would help?

Phaeron - 04 01 12 - 15:10

What if you just #define __SSE2__ before including the header? I know it could mess things up pretty badly, but I'd still be curious.

GrayShade - 04 01 12 - 18:01

@Phaeron:Adding _ReadBarrier(), _WriteBarrier(), or _ReadWriteBarrier() indeed helps for the reasons beyond my grasp.However, _mm_lfence() (LFENCE instruction) itself is a read barrier and compiler should not reorder it!Regarding compilation of VirtualDub under Windows and Linux your best bet is to go with Intel Compiler because its feature-set is mostly identical on both platforms, and it is even possible to use inline assembler in Intel syntax under Linux. It also covers some obscure stuff which are specific to Microsoft and GCC.

Igor Levicki (link) - 06 01 12 - 11:58

Hi friend, This is really a great work, req to share your email id, i need your guidence to build a codec convertor.Thanks, Shree Ammu

Shree Ammu - 18 02 12 - 05:04

I just have to ask, why you need to run you code on a nonSSE2 system? Everything made in the past decade has SSE2.

toyotabedzrock - 04 03 12 - 14:54

> Everything made in the past decade has SSE2.This is not true. The Athlon XP 2500+ was released in December 2004 and does not support SSE2.

Phaeron - 04 03 12 - 15:10

I tried clang version 2.8 for a basic SAD function:1) First of all, it accepts SSE2 intrinsics without complaint, even with a simple commandline like "clang++ -c -O3 test.cpp -o test.o". The pmmintrin.h header, however, contained one of the aforementioned error directives.2) It accepts MSVC-isms like __forceinline without complaint.3) When I forgot to include pmmintrin.h for the lddqu intrinsic, but had emmintrin.h, I had error messages of the form: "error: use of undeclared identifier '_mm_lddqu_si128'; did you mean '_mm_loadu_si128'?" Same deal when I had the header included, but omitted an option to allow sse3.4) Given code like:__m128 a, b;a = _mm_load_si128((__m128i*)ptr1);b = _mm_loadu_si128((__m128i*)ptr2);b = _mm_sad_epu8(a, b);clang figured out that it could emit something like:movdqu xmm0,[ptr2]psadbw xmm0,[ptr1]I haven't seen gcc/MSVC do that -- those two compilers always wanted to do an explicit load into a register.

Jeff - 12 03 12 - 05:58

Ok, looking at the headers -- the emmintrin.h does have an error directive, just like all the other intrinsic headers. It just happened to predefine __SSE2__ (or enable SSE2) on the Core2Duo I'd tested it on.

Jeff - 12 03 12 - 06:05

I know you love to support everything but really it's time to dump non-SSE2 machines.Lots of programs are now requiring SSE2 and as you have found the new Visual Studio 11 is compiling with SSE2 by default.I was surprised when looking in to it how few non-SSE2 systems there are running nowadays (less than 0.5%). I know you have specialised paths for newer architectures but there's still plenty of common code that would see benefits (albeit minor) from SSE2. Why have your code run less efficiently for the 99.5% majority?You can always let people without SSE2 know that there are earlier builds of VirtualDub for older architectures which will work for them.Not trying to flame, just saying maybe it's time to move on :)

roo - 13 04 12 - 23:41

I see what you're saying, but this wouldn't really help as much as you think it would.First, SSE2 isn't the only instruction set extension I support. Raising the bar to SSE2 would solve that particular issue, but there's still the issue of how to conditionally support SSE3, SSSE3, SSE4.1. AVX2 will be especially lucrative when it arrives. Raising the minimum requirement to these platforms is not as realistic. The only way to ease that with such an approach is simply to compile different versions of the executable for each supported CPU level. Tempting in some ways, but a bit of a testing and deployment nightmare.Second, supporting non-SSE2 platforms isn't hard. I don't necessarily optimize for all of the platforms I support, and if you're still running on an original Pentium, you'll get some pretty slow code. The reason for still supporting this at all is that I make it a point now to always have a reference scalar C implementation for any optimized routines, and as long as I have that it makes sense to use it as the baseline -- not only does it permit support back to P5, but it also allows the reference code to be exercised in the live environment. Optimizations, however, are targeted for at least MMX or now more usually SSE2 and above.Third, most code simply isn't in the hot path (90/10 rule). The profile is very highly skewed in VirtualDub's case because of the large data sets that it works on (frame buffers); this means that global optimizations don't mean as much. The debug build still runs fast because the vast majority of the non-assembly routines don't run that much in comparison. In addition, VirtualDub uses fixed point far more than floating point, and the optimizations that /arch:SSE2 enables in non-FP cases aren't very significant. This means that /arch:SSE2 is not that attractive.When I used to have a P4 build generated with the Intel compiler with P4 optimizations enabled, I found that a lot of the code that the Intel compiler was able to optimize with SSE2... were initialization loops. Not only were these usually not performance critical, but it would actually have been better to compile these optimized for *size* instead of speed. Profile-guided optimization (PGO) does some nice analysis for this, but I've avoided it up to this point because of how much it complicates the build process.