It doesn't.

The Conclusion

A couple of days later, I got a mail back from Julian, basically telling me: "Uhm, what should I improve?" It turned out that he ran the tests on his own system and all curves, safe for the one based on non-temporal stores, more or less converged. Trying to repoduce Julian's results, I re-ran the benchmarks on my new notebook and what I got are the curves in Fig. 2. So, is the GCC 4.7 the silver bullet, case closed? I decided that I needed more data. The plots for Nehalem (Fig. 3 and 4) are remarably clear. As Julian suggested, GCC's tree vectorizer is doing an outstanding job. Just for completeness, I decided to return to Core 2 -- just to see how well the GCC 4.7 would fare on that machine.Fig. 5 (click to enlarge) is hardly different from Fig. 1. Terrible terrible terrible.My takeaway message is this: if you run a modern Intel CPU and a newer compiler, you probably don't need to worry so much about vectorization. But since LibGeoDecomp needs to run on various architectures (AMD Bulldozer, IBM BG/Q) and we are sometimes tied to older compilers (GCC 4.6 because of CUDA, anyone?) or even ones of which we cannot know how well they vectorize stencils (e.g. Cray or Intel's compilers for MIC), we need a more robust scheme. So again, I'm looking forward to Julians next results -- possibly next time based on Boost.SIMD