Here's some code which GCC 6 and 7 fail to optimize when using std::array :

#include <array> static constexpr size_t my_elements = 8; class Foo { public: #ifdef C_ARRAY typedef double Vec[my_elements] alignas(32); #else typedef std::array<double, my_elements> Vec alignas(32); #endif void fun1(const Vec&); Vec v1{{}}; }; void Foo::fun1(const Vec& __restrict__ v2) { for (unsigned i = 0; i < my_elements; ++i) { v1[i] += v2[i]; } }

Compiling the above with g++ -std=c++14 -O3 -march=haswell -S -DC_ARRAY produces nice code:

vmovapd ymm0, YMMWORD PTR [rdi] vaddpd ymm0, ymm0, YMMWORD PTR [rsi] vmovapd YMMWORD PTR [rdi], ymm0 vmovapd ymm0, YMMWORD PTR [rdi+32] vaddpd ymm0, ymm0, YMMWORD PTR [rsi+32] vmovapd YMMWORD PTR [rdi+32], ymm0 vzeroupper

That's basically two unrolled iterations of adding four doubles at a time via 256-bit registers. But if you compile without -DC_ARRAY , you get a huge mess starting with this:

mov rax, rdi shr rax, 3 neg rax and eax, 3 je .L7

The code generated in this case (using std::array instead of a plain C array) seems to check for alignment of the input array--even though it is specified in the typedef as aligned to 32 bytes.

It seems that GCC doesn't understand that the contents of an std::array are aligned the same as the std::array itself. This breaks the assumption that using std::array instead of C arrays does not incur a runtime cost.

Is there something simple I'm missing that would fix this? So far I came up with an ugly hack:

void Foo::fun2(const Vec& __restrict__ v2) { typedef double V2 alignas(Foo::Vec); const V2* v2a = static_cast<const V2*>(&v2[0]); for (unsigned i = 0; i < my_elements; ++i) { v1[i] += v2a[i]; } }

Also note: if my_elements is 4 instead of 8, the problem does not occur. If you use Clang, the problem does not occur.

You can see it live here: https://godbolt.org/g/IXIOst