The first thing that baffled my colleagues and myself is that the implementation of the standard library is the second worst. How is it that such a common function does not have its fastest implementation in the standard library?

An obvious answer, as noted in the talk, is that the standard library’s version handles much more cases: any base from 2 to 36, hexadecimal and octal prefix detection and number formatted according to the installed C locale.

Another point is that the standard library targets a wide range of devices and an algorithm that performs well on a device may be the worst on another. The authors of the library are thus facing a dilemma: should they write an optimal version for each existing hardware or should they write a single version that perform well enough everywhere? I personally root for the latter, letting the people needing a faster procedure to write it for the hardware they target.

Take a look look at the glibc implementation of strtol and tell me what you think.

Reproducing the benchmark

The results presented by Alexandrescu are intriguing. Is there really so much variation in performance among the implementations of such a common task as converting a string to an integer? There is only one way to be sure: reproduce the benchmark.

You can find on my GitHub page a C++ implementation of this benchmark, which I used to compare the various implementations of this algorithm and to produce the following measures.

The implementations are:

boost::lexical_cast: a call to a string to integer function provided by Boost,

std::strtoul: a call to the string to integer function from the standard C++ library,

strtoul: a call to the string to integer function from the standard C library,

naive: the naive one which loops through the inputs and multiplies the result,

recursive: a recursive algorithm which gets the value of the first n-1 characters to compute the value of the first n characters,

table-pow: an implementation of the sum of powers of ten, where the powers come from a precalculated table, with the unsigned trick for the call to enforce() (i.e. ILP+unsigned in Alexandrescu’s benchmark),

unrolled-4: an implementation similar to table-pow where the loop is unrolled to process four digits at each iteration (i.e. ILP+unsigned+unrolling in Alexandrescu’s benchmark).

The benchmark works by generating the string representation of ten random numbers for each length of one to twenty digits included. Then these two hundred strings are processed one million times by each implementation. For each string length, the minimum cumulated duration of the million runs is kept. The implementations are compared relatively to std::strtoul.

The program is compiled with GCC 5.3.1 with the arguments -DNDEBUG and -O3. The measures are taken on a MacBook Pro running Ubuntu 16.04. (Intel Core i5–3210M CPU @ 2.50GHz, 8 Go).

Relative performance of the string to int functions.

The astonishing thing in this graph is the performance of the naive implementation: contrary to Alexandrescu’s experiment, this is one of the top algorithms in this benchmark. Also, against all expectations, the recursive implementation performs very well too. Actually, all implementations except boost::lexical_cast have approximately the same performance and are faster than the standard library by far. I see a single explanation for these results: the compiler has been greatly improved since the talk. From the author himself, the original benchmark was compiled with g++ 4.9, that makes three major releases between our experiments.

The performance of these functions may vary greatly depending on the compiler, the way the code is built, the OS… For example, the above results have been obtained with a classical incremental build. What would happen if the whole benchmark was in a single file (i.e. compiled as a unity build)?

Relative string to int function performance when compiled as a unity build.

In the unity build, the unrolled implementation outperforms the others for medium to large input strings. Also, all other implementations perform better by a factor of approximately 1.3 to 2.1. What if we change the OS, the hardware or the compiler?

Top to bottom: incremental builds, unity builds. Left to right: 64 bits clang (Ubuntu),64 bits clang (OSX), then on Ubuntu: 64 bits g++, 32 bits clang, 32 bits g++.

Link to the full-size graphs.

The middle column above contains the graphs previously discussed. On their right are the results of the benchmark on an Eeepc with an Intel Atom N455 at 1.66GHz running a 32 bits Ubuntu server 16.04, both with g++ and clang. On these configurations, using std::strtoul is a wise decision, especially on incremental builds. On the unity builds the best algorithms are the naive one and table-pow.

With clang on Ubuntu (the leftmost column), unrolled-4 is the best, followed by table-pow and naive. Notice how the recursive and naive implementations behave bad compared to the g++ results on this platform.

When switching to OSX, the relative performance of the implementations has the same order than on Ubuntu. An interesting point is the fact that for the small to average inputs, the performance gap with the standard library is near twice the values obtained on the Ubuntu build.

It is worth noticing that the unrolled-4 implementation depends heavily on the architecture. The value 4 was chosen by Alexandrescu after having tested different unrolled levels because he found out that it wast the most efficient on his computer. I did not try any other unrolling levels but I suppose that the result would be different for this algorithm

Let the compiler do its job

We have seen in the above benchmarks that a procedure’s performance may vary greatly from one architecture or implementation to another. To my surprise, we also found that the compiler may do a very good job on a naive implementation, up to a point where most of the optimizations only slightly improve the performance.

The first part of Alexandrescu’s talk is a call to measure before doing anything. I can only confirm the importance of measuring. After seeing the talk, one can be tempted to implement the version with the table of powers of tens and the unrolled loop and one could be wrong. The best solution here is the naive implementation: easy to write, easy to read, and near optimal performance.

So my advice here is that if you need performance on a procedure, then go naive first. Before anything else, write the algorithm in the most simple terms. It is only when this version has been implemented and measured against the reference implementation that you can plug and measure your extra improvements.