Spent a little bit time tweaking RDE vector class again. As I already mentioned, the container itself is not terribly fascinating, there are not too many choices here. There’s another battle going on the lower level though, it’s interesting to see how an innocent instruction like_ size()_ can be a cause of a slowdown. Typically, a vector class has 3 properties it needs to keep track of:

buffer properties (pointer and size),

number of stored elements

We can safely ignore buffer begin property - it has to be a pointer. Capacity and number of stored elements are more interesting, though. We can either store them using an integer number or as another pointer marking the end of range. What are the consequences? In the first case all the operations requiring “end” pointer will be more costly (we don’t know this pointer offhand, need to add our integer to the begin pointer). If we decide to go with the second solution – size() and/or capacity() methods will have to convert a pair of pointers into number of elements. (Third approach would be to store both, but this means additional bookkeeping, so it’s rarely used). Most implementations I’ve seen use the latter approach (pointers), as we tend to operate on array “end” quite often. The consequence is (as mentioned) – size() method now requires conversion from pointer to integer, which usually means an integer division. Consider this seemingly innocent code:

1 struct Foo 2 { 3 char ch[ 13 ]; 4 }; 5 void Test ( const rde :: vector < Foo >& v) 6 { 7 for (std :: size_t i = 0 ; i < v.size(); ++ i) 8 { 9 printf( "%d

" , v[i].ch[ 0 ]); 10 } 11 }

Here’s generated code (inner loop only):

1 00 DEE420 0F B6 04 0F movzx eax,byte ptr [edi + ecx] 2 00 DEE424 50 push eax 3 00 DEE425 68 6 C 3 D E1 00 push 0E13 D6Ch 4 00 DEE42A FF 15 10 58 E4 00 call dword ptr [__imp__printf ( 0E45810 h)] 5 00 DEE430 8 B 46 04 mov eax,dword ptr [esi + 4 ] 6 00 DEE433 59 pop ecx 7 00 DEE434 59 pop ecx 8 00 DEE435 8 B 0 E mov ecx,dword ptr [esi] 9 00 DEE437 2 B C1 sub eax,ecx 10 00 DEE439 6 A 0 D push 0 Dh 11 00 DEE43B 99 cdq 12 00 DEE43C 5 D pop ebp 13 00 DEE43D F7 FD idiv eax,ebp *** 14 00 DEE43F 43 inc ebx 15 00 DEE440 83 C7 0 D add edi, 0 Dh 16 00 DEE443 3 B D8 cmp ebx,eax 17 00 DEE445 72 D9 jb ` anonymous namespace ' :: Test + 1 Dh ( 0 DEE420h)

Yuck! To be fair, compiler is sometimes smart enough to recognize that size() has no side effects and vector cannot be modified during iteration, so it moves this calculation outside the loop. Couldn’t do it in this case as vector is not local. Observations:

for iteration prefer iterators (doh) to indices (especially if they’re raw pointers underneath, as they should… Even better on platforms with load/store+update instructions).

if you know array is not modified in the loop (vast majority of cases) - compute size once.

Same loop, version with iterators:

1 00 A5DC58 0F B6 06 movzx eax,byte ptr [esi] 2 00 A5DC5B 50 push eax 3 00 A5DC5C 68 70 3 D AB 00 push 0 AB3D70h 4 00 A5DC61 FF 15 10 58 AE 00 call dword ptr [__imp__printf ( 0 AE5810h)] 5 00 A5DC67 59 pop ecx 6 00 A5DC68 59 pop ecx 7 00 A5DC69 83 C6 0 D add esi, 0 Dh 8 for (rde :: vector < Foo >:: const_iterator i = v.begin(); i != v.end(); ++ i) 9 00 A5DC6C 3 B 77 04 cmp esi,dword ptr [edi + 4 ] 10 00 A5DC6F 75 E7 jne ` anonymous namespace ' :: TestVIter + 0 Ah ( 0 A5DC58h)

..and with size only calculated once:

1 //for(std::size_t i = 0, s = v.size(); i < s; ++i) 2 00 C5F64E 8 B 06 mov eax,dword ptr [esi] 3 00 C5F650 0F B6 04 07 movzx eax,byte ptr [edi + eax] 4 00 C5F654 50 push eax 5 00 C5F655 68 6 C 3 D CA 00 push 0 CA3D6Ch 6 00 C5F65A FF 15 10 58 CD 00 call dword ptr [__imp__printf ( 0 CD5810h)] 7 00 C5F660 59 pop ecx 8 00 C5F661 83 C7 0 D add edi, 0 Dh 9 00 C5F664 4 B dec ebx 10 00 C5F665 59 pop ecx 11 00 C5F666 75 E6 jne ` anonymous namespace ' :: TestV + 1 Ah ( 0 C5F64Eh)

Finally, if you find yourself in a situation where you have to use indices and you know array size might change, consider making element’s size a power of two, so that compiler can replace idiv instruction with shift.

[Digression: GCC is actually smart enough to replace division by constant with multiplication by a “magic number” (see Hacker’s Delight). See GCC Explorer for details (I still can’t get over how awesome this tool is).]

A broader lesson from all that is we should try to avoid converting between indices & pointers if only possible. I learned it the hard way when optimizing my insert() method. Here’s the first, naive implementation (I’ll assume we’re inserting a POD type):

1 iterator insert (iterator it, const T & val) 2 { 3 const size_type index = (size_type)(it - m_begin); 4 const size_type prevSize = size(); 5 const size_type toMove = prevSize - index; 6 if (m_end == m_capacityEnd) 7 { 8 grow(); 9 it = m_begin + index; 10 } 11 rde :: construct(m_begin + prevSize); 12 13 if (toMove > 0 ) 14 { 15 //rde::internal::move_n(it, toMove, it + 1, int_to_type<has_trivial_copy<T>::value>()); 16 // Assuming PODs, this ^^^^ will result in call to: 17 memmove(result, first, n * sizeof (T)); 18 } 19 * it = val; 20 ++ m_end; 21 return it; 22 }

It results in this monstrosity:

1 0130F4F C 8 B 45 08 mov eax,dword ptr [ebp + 8 ] 2 0130F4FF 53 push ebx 3 0130F 500 56 push esi 4 0130F 501 8 B D9 mov ebx,ecx 5 0130F 503 8 B 33 mov esi,dword ptr [ebx] 6 0130F 505 2 B C6 sub eax,esi 7 0130F 507 57 push edi 8 0130F 508 99 cdq 9 0130F 509 6 A 24 push 24 h 10 0130F 50 B 59 pop ecx 11 0130F 50 C F7 F9 idiv eax,ecx ( 1 ) 12 const size_type prevSize = size(); 13 0130F 50 E 8 B 4 B 04 mov ecx,dword ptr [ebx + 4 ] 14 0130F 511 6 A 24 push 24 h 15 0130F 513 8 B F8 mov edi,eax 16 0130F 515 8 B C1 mov eax,ecx 17 0130F 517 2 B C6 sub eax,esi 18 0130F 519 99 cdq 19 0130F 51 A 5 E pop esi 20 0130F 51 B F7 FE idiv eax,esi ( 2 ) 21 0130F 51 D 8 B F0 mov esi,eax 22 const size_type toMove = prevSize - index; 23 0130F51F 2 B F7 sub esi,edi 24 if (m_end == m_capacityEnd) 25 0130F 521 3 B 4 B 08 cmp ecx,dword ptr [ebx + 8 ] 26 0130F 524 75 0F jne rde :: vector < ` anonymous namespace ' :: MyStruct,rde :: allocator,rde :: standard_vector_storage < ` anonymous namespace ' :: MyStruct,rde :: allocator > >:: insert + 3 Ch ( 130F 535 h) 27 { 28 grow(); 29 0130F 526 8 B CB mov ecx,ebx 30 0130F 528 E8 9 E 40 FC FF call rde :: vector < ` anonymous namespace ' :: MyStruct,rde :: allocator,rde :: standard_vector_storage < ` anonymous namespace ' :: MyStruct,rde :: allocator > >:: grow ( 12 D35CBh) 31 it = m_begin + index; 32 0130F 52 D 6 B FF 24 imul edi,edi, 24 h ( 1 m * ) 33 0130F 530 03 3 B add edi,dword ptr [ebx] 34 0130F 532 89 7 D 08 mov dword ptr [it],edi 35 } 36 rde :: construct(m_begin + prevSize); 37 38 if (toMove > 0 ) 39 0130F 535 85 F6 test esi,esi 40 0130F 537 7 E 15 jle rde :: vector < ` anonymous namespace ' :: MyStruct,rde :: allocator,rde :: standard_vector_storage < ` anonymous namespace ' :: MyStruct,rde :: allocator > >:: insert + 55 h ( 130F 54 Eh) 41 { 42 rde :: internal :: move_n(it, toMove, it + 1 , int_to_type < has_trivial_copy < T >:: value > ()); 43 0130F 539 8 B 45 08 mov eax,dword ptr [it] 44 0130F 53 C 6 B F6 24 imul esi,esi, 24 h ( 1 / 2 m) 45 0130F53F 56 push esi 46 0130F 540 50 push eax 47 0130F 541 83 C0 24 add eax, 24 h 48 0130F 544 50 push eax 49 0130F 545 FF 15 C0 57 36 01 call dword ptr [__imp__memmove ( 13657 C0h)] 50 0130F 54 B 83 C4 0 C add esp, 0 Ch 51 } 52 * it = val; 53 0130F 54 E 8 B 75 0 C mov esi,dword ptr [val] 54 0130F 551 8 B 7 D 08 mov edi,dword ptr [it] 55 return it; 56 0130F 554 8 B 45 08 mov eax,dword ptr [it] 57 0130F 557 6 A 09 push 9 58 0130F 559 59 pop ecx 59 0130F 55 A F3 A5 rep movs dword ptr es :[edi],dword ptr [esi] 60 0130F 55 C 83 43 04 24 add dword ptr [ebx + 4 ], 24 h

2 divs, 2 muls in the worst case (1 on average)… Some of these conversions are hard to avoid, but in some cases we convert back & forth (e.g. we first divide to get ‘toMove’, but later need to obtain # of bytes, so have to multiply it again).

Here’s the same function (POD case only, branch resolved during compilation), but this time we’re actively trying to avoid conversions:

1 if (m_end == m_capacityEnd) 2 { 3 const size_type index = (size_type)(it - m_begin); 4 grow(); 5 it = m_begin + index; 6 } 7 rde :: construct(m_end); 8 9 if (m_end > it) // only showing POD case (resolved at compile time) 10 { 11 const size_t n = reinterpret_cast < uintptr_t > (m_end) - reinterpret_cast < uintptr_t > (it); 12 Sys :: MemMove(it + 1 , it, n); 13 } 14 * it = val; 15 ++ m_end; 16 return it; 17 }

1 0086F 434 8 B D9 mov ebx,ecx 2 if (m_end == m_capacityEnd) 3 0086F 436 8 B 43 04 mov eax,dword ptr [ebx + 4 ] 4 0086F 439 56 push esi 5 0086F 43 A 57 push edi 6 0086F 43 B 3 B 43 08 cmp eax,dword ptr [ebx + 8 ] 7 0086F 43 E 75 20 jne rde :: vector < ` anonymous namespace ' :: MyStruct,rde :: allocator,rde :: standard_vector_storage < ` anonymous namespace ' :: MyStruct,rde :: allocator > >:: insert + 30 h ( 86F 460 h) 8 { 9 const size_type index = (size_type)(it - m_begin); 10 0086F 440 8 B 45 08 mov eax,dword ptr [it] 11 0086F 443 2 B 03 sub eax,dword ptr [ebx] 12 0086F 445 6 A 24 push 24 h 13 0086F 447 99 cdq 14 0086F 448 59 pop ecx 15 0086F 449 F7 F9 idiv eax,ecx 16 grow(); 17 0086F 44 B 8 B CB mov ecx,ebx 18 0086F 44 D 8 B F0 mov esi,eax 19 0086F44F E8 77 41 FC FF call rde :: vector < ` anonymous namespace ' :: MyStruct,rde :: allocator,rde :: standard_vector_storage < ` anonymous namespace ' :: MyStruct,rde :: allocator > >:: grow ( 8335 CBh) 20 it = m_begin + index; 21 0086F 454 6 B F6 24 imul esi,esi, 24 h 22 0086F 457 03 33 add esi,dword ptr [ebx] 23 0086F 459 8 B FE mov edi,esi 24 0086F 45 B 89 7 D 08 mov dword ptr [it],edi 25 0086F 45 E EB 03 jmp rde :: vector < ` anonymous namespace ' :: MyStruct,rde :: allocator,rde :: standard_vector_storage < ` anonymous namespace ' :: MyStruct,rde :: allocator > >:: insert + 33 h ( 86F 463 h) 26 0086F 460 8 B 7 D 08 mov edi,dword ptr [it] 27 } 28 rde :: construct(m_end); 29 30 if (m_end > it) // only showing POD case (resolved at compile time) 31 0086F 463 8 B 43 04 mov eax,dword ptr [ebx + 4 ] 32 0086F 466 3 B C7 cmp eax,edi 33 0086F 468 76 11 jbe rde :: vector < ` anonymous namespace ' :: MyStruct,rde :: allocator,rde :: standard_vector_storage < ` anonymous namespace ' :: MyStruct,rde :: allocator > >:: insert + 4 Bh ( 86F 47 Bh) 34 { 35 const size_t n = reinterpret_cast < uintptr_t > (m_end) - reinterpret_cast < uintptr_t > (it); 36 0086F 46 A 2 B C7 sub eax,edi 37 Sys :: MemMove(it + 1 , it, n); 38 0086F 46 C 50 push eax 39 0086F 46 D 8 D 47 24 lea eax,[edi + 24 h] 40 0086F 470 57 push edi 41 0086F 471 50 push eax 42 0086F 472 FF 15 C0 57 8 C 00 call dword ptr [__imp__memmove ( 8 C57C0h)] 43 0086F 478 83 C4 0 C add esp, 0 Ch 44 } 45 * it = val; 46 0086F 47 B 8 B 75 0 C mov esi,dword ptr [val] 47 return it; 48 0086F 47 E 8 B 45 08 mov eax,dword ptr [it] 49 0086F 481 6 A 09 push 9 50 0086F 483 59 pop ecx 51 0086F 484 F3 A5 rep movs dword ptr es :[edi],dword ptr [esi] 52 0086F 486 83 43 04 24 add dword ptr [ebx + 4 ], 24 h

Down to 1 div/1 mul in the worst case, no divisions/multiplication in the typical case.

Obviously, these are all micro-optimizations, hard to expect spectacular gains. Still, it’s good to keep it in mind and remember how tiny changes in high level code can result in vastly different assembly being generated.

Peng 2013-04-19 15:53:39

Hi,

I cannot understand one of copy_n methods, could u shed some light on it?

”

template

void copy_n(const T* first, size_t n, T* result, int_to_type)

{

const T* last = first + n;

//while (first != last)

// *result++ = *first++;

switch (n & 0x3)

{

case 0:

while (first != last)

{

*result++ = *first++;

case 3: *result++ = *first++;

case 2: *result++ = *first++;

case 1: *result++ = *first++;

}

}

}

“

Peng 2013-04-19 15:57:26

To be precise, i cannot understand why there’s a while loop in case 0 that includes case 1/2/3.

admin 2013-04-19 17:01:59