Compiler: Visual C++ 2010

Operating System: Windows 7 32bits

Tested machine CPU: Intel core i3

Download: preVSpost (demo project) (1179 downloads)

A recent Visual C++ team’s comment on twitter.com reminded me a hot topic that exists in C++ programming world: there is a long discussion of using pre versus post increment operators, specially, for iterators. Even me I was witness to a discussion like this. The discussion started from a FAQ written by me on www.codexpert.ro.

The reason of preferring pre increment operators is simple. For each post-increment operator a temporary object is needed.

Visual C++ STL implementation looks similarly with next code:

_Myt_iter operator++(int) { // postincrement _Myt_iter _Tmp = *this; ++*this; return (_Tmp); } 1 2 3 4 5 6 _Myt_iter operator ++ ( int ) { // postincrement _Myt_iter _Tmp = * this ; ++ * this ; return ( _Tmp ) ; }

But for pre-increment operator implementation this temporary object is not needed anymore.

_Myt_iter& operator++() { // preincrement ++(*(_Mybase_iter *)this); return (*this); } 1 2 3 4 5 _Myt_iter & operator ++ ( ) { // preincrement ++ ( * ( _Mybase_iter * ) this ) ; return ( * this ) ; }

In the discussion that I mentioned above, somebody came with a dummy application and tried to prove that things have changed because of new compilers optimizations (the code exists in the attached file, too). This sample is too simple and far away to the real code. Normally the real code has more code line codes that eat CPU time even if you’re compiling with /O2 settings (is obviously).

Base on that VC++ team’s tweet related to viva64.com’s research I decided to create my own benchmark base on single and multicore architectures. For those that don’t know Viva64 is a company specialized on Static Code Analysis.

Starting from their project I extended the tested for other STL containers: std::vector, std::list, std::map, and std::unordered_map (VC++ 2010 hash table implementation).

For parallel core tests I used Microsoft’s new technology called Parallel Pattern Library.

1. How the tests were made

1.1. Code stuff

In order to get execution time I used same timer as Viva64 team (with few changes). Each container instance was populated with 100000 elements of same random data. An effective computing function was repeated 10 times. Into this function some template functions are called for 300 times. The single core computing function contains loops like this:

for (size_t i = 0; i != Count; ++i) { x += FooPre(arr); } // where FooPre looks like template <class T> size_t FooPre(const T &arr) { size_t sum = 0; for (auto it = arr.begin(); it != arr.end(); ++it) sum += *it; return sum; } 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 for ( size _ t i = 0 ; i != Count ; ++ i ) { x += FooPre ( arr ) ; } // where FooPre looks like template < class T > size_t FooPre ( const T &arr ) { size_t sum = 0 ; for ( auto it = arr . begin ( ) ; it != arr . end ( ) ; ++ it ) sum += * it ; return sum ; }

For the parallel core computing the first simple for loop has changed in:

parallel_for (size_t(0), Count, [&cnt,&arr] (size_t i) { cnt.local() += FooPre(arr); }); 1 2 3 4 parallel_for ( size_t ( 0 ) , Count , [ &cnt , &arr ] ( size _ t i ) { cnt . local ( ) += FooPre ( arr ) ; } ) ;

Where cnt is an instance of combinable class and the sum of partial computed elements is obtained by calling combine() method:

cnt.combine(plus<size_t>()); 1 cnt . combine ( plus < size_t > ( ) ) ;

As you can see, the parallel_for function uses one of the new C++ standard features: a lambda function. This lambda function and the combinable class implements the so called parallel aggregation pattern and helps you to avoid the multithreaded common share resource issues. The code is executed on independent tasks. The reason that this approach is fast is that there is very little need for synchronization operations. Calculating the per-task local results uses no shared variables, and therefore requires no locks. The combine operation is a separate sequential step and also does not require locks.

1.2. Effective results

The tests were running on a Intel core i3 machine (4 cores) running Windows 7 on 32bits OS. I tested debug and release mode for single and multi cores computation. The test application was build in VC++ 2010 one of the first C++11 compliant.

The OX axis represents the execution repeated times, and the OY axis means time in seconds.

1.2.1. Single core computation

Debug



Release



1.2.2. Multi cores computation

As you know, multi core programming is the future. For C++ programmers Microsoft propose a very interesting library called Parallel Pattern Library.

The overall goal is to decompose the problem into independent tasks that do not share data, while providing a sufficient number of tasks to occupy the number of cores available.

This is how it looks my task manager when the demo application runs in parallel mode.



Isn’t it nice comparing to a single core use? 🙂

Debug



Release



1.2.3. Speedup

Speedup is an efficiency performance metric for a parallel algorithm comparing to a serial algorithm.

Debug



Release

