Intro

We’ll be creating a framework that attempts to provide a generic, cross platform (standard compliant), non dependent to third party libraries solution to the benchmarking problem. The end product will look like this :

benchmark<time_type, clock_type> bm; bm. run( "Experiment Name" , sampleSize , { /* code to time */ } , "factor name" , factors. begin(), factors. end() ); bm. serialize( "benchmark name" , "output file name" , mode);

The main features of this design are :

Parametrizable time and clock type per benchmark.

and per benchmark. Multiple experiments can run on the same benchmark (hence we can group and compare results as we wish).

Adjustable sampling size (number of times each experiment will run).

(number of times each experiment will run). A factor can be passed to the code under benchmark (simply put, when we graph the results, the factor will be the variable along the x axis).

can be passed to the code under benchmark (simply put, when we graph the results, the factor will be the variable along the x axis). The benchmarks are printable and serializable in a “Python friendly” form.

As a complementary tool, a Python script will be developed, to read the output of a serialization and make graphs using matplotlib.

Finally, it will become apparent that the framework is mostly concerned (without excluding the possibility of expanding) with code runtime performance. The reason why this post isn’t titled “Timing C++ code” is because a sampling procedure and postprocessing of the results is incorporated as well.

Know why

If you think something is correct, prove it with a test. If you think something is fast, time it. If you think one algorithm is better than another, find some way to make a measurement to support your argument. Do not depend only upon your “common sense.” Do not depend only on theory. It is good to use theory to help guide you to an optimum solution. But without measuring, you have no way of knowing (nor convincing others) of the correctness of your theory. Without measurement, you have a faith-based system, not a scientific theory. [3]

Know how – techniques

The dominant technique used here is instrumentation. The way we’ll be applying it is by mandating code under timing be called through a measuring structure. A simple version of this emerged after several design iterations :

template<typename TimeT = std::chrono::milliseconds> struct measure { template<typename F, typename ...Args> static typename TimeT::rep execution( F func, Args&&... args) { auto start = std::chrono::system_clock::now(); func(std::forward<Args>(args)...); auto duration = std::chrono::duration_cast< TimeT>( std::chrono::system_clock::now() - start); return duration.count(); } };

Things to note here are :

variadic templates are used to forward as many arguments as needed to the callable object (function, function pointer, functor, lambda or a block of code surrounded by a lambda).

The code we time won’t be rearranged by the optimizer and will always lie between those start / end calls to now() , so we can guarantee our timing will be valid .

, so we can guarantee our . There’s minimum overhead by this calling layer; in other words if passing the callable and the arguments (by && ) to the instrumentation structure disturbs the results, then there’s no point in the measurement to begin with.

This has a very clean use pattern and minimum code pollution when it comes to timing :

auto executionTime = measure<>::execution(f0);

Know how – std::chrono

C++11 introduced “a flexible collection of types that track time with varying degrees of precision”. In what would be a brief introduction to this collections, we can identify three workhorses : clocks, durations and time points.

A clock is a class that contains time information. Things to know about clocks:

We call now() to get the current time of a clock

to get the current time of a clock There are three types of clocks: steady_clock , system_clock and high_resolution_clock .

, and . A clock’s tick period is the number of times it ticks per second (specified by std::ratio )

Durations are what we would call time types. The related class template std::chrono::duration<> can “create” time types in its instantiations. To give an example, this is how we define milliseconds :

std::chrono::duration< double, std::ratio<1,1000> >

Finally a time point is the return type of clock::now() . It is implemented as if it stores a value of type Duration indicating the time interval from the start of the Clock ‘s epoch.

Implementation

A draft description of the entity design is shown in the diagram below :

I may be providing more details on the implementation techniques on another post, but for now I’ll justify (for anyone reading the code) the usage of type erasure with the fact that I had to use function templates when implementing the experiments’ functionality, and those can’t be virtual.

Using the toolset

We’ll be using the toolset to fill in the mising graphs in Bjarne’s presentation. Therein Mr. Stroustrup proposes the following experiment :

Generate N random integers and insert them into a sequence so that each is inserted in its proper position in the numerical order.

Remove elements one at a time by picking a random position in the sequence and removing the element there.

For which N is it better to use a linked list than a vector (or an array) to represent the sequence ?

This is all it takes (assuming the correct headers are included) :

template<class Cont> void CommonUse(int nElems) { Cont cont; random_device rd; mt19937 eng{ rd() }; uniform_int_distribution<int> distr(0, nElems); for (int i = 0, val; i < nElems; i++) { val = distr(eng); cont.insert(lower_bound( begin(cont), end(cont), val), val); } while (nElems) { auto it = std::begin(cont); advance(it, distr(eng) % nElems); cont.erase(it); --nElems; } } int main() { bmk::benchmark<> bm; // 1. Run an experiment for vectors -------------- bm.run("vector", 10, CommonUse< vector<int> >, "number of elements", { 10, 100, 1000, 10000, 100000, 1000000 }); // 2. Run an experiment for lists ---------------- bm.run("list", 10, CommonUse< list<int> >, "number of elements", { 10, 100, 1000, 10000, 100000, 1000000 }); // 3. Record the results ------------------------- bm.serialize("Why you should avoid Linked Lists", "vector_vs_list.txt"); }

Here we average the results of 10 samples per experiment and use an initializer list to provide the factors. Factors must be the same type as the argument to the lambda and can also be given as an iterator delimited range ( begin / end ). If you think an operation should be excluded from the timing you can surround it with a timeout’s .tic() / .toc() as long as the timeout has the same clock and time type as the experiment :

bmk::timeout_ptr<> to = make_unique < bmk::timeout<> >(); to->tic(); val = distr(eng); // time excluded from the experiment to->toc(); /* rest of the experiment */ return to;

Experiments that return timeouts (actually timeout unique pointers) are handled differently (thank you expression SFINAE) and have the total duration stored in timeout subtracted from the experiment’s time (hopefully you won’t use timeout’s for trivial tasks that will take 0 nanoseconds anyway).

If we were to open “vector_vs_list.txt” (just the results stored as a Python dictionary) with the visualizer python script this is what we get :

Yes I only actually tried up to 10,000 elements; it takes waaaay too long when I add the last too factors as well (you can run the code yourself if you don’t believe me).

If we were to run factorless experiments then the sequence size couldn’t be an input. So let’s choose for example’s sake the size 10,000; the visualization we get is a boxplot per experiment :

Code Kata

Add the option to have each experiment run in its own thread (I’ll be posting a take on this but it’s good to get some ideas first).

I kind of cheated (not that it changes the results) in favor of vector. Can you spot where ?

Post scriptum

tl;dr ? You can check the code in action here (it’s coliru so I’m only console printing the results).

References