It's been a while since I've done a benchmark of different compilers on C++ code. Since I've recently released the version 1.1 of my ETL project (an optimized matrix/vector computation library with expression templates), I've decided to use it as the base of my benchmark. It's a C++14 library with a lot of templates. I'm going to compile the full test suite (124 test cases). This is done directly on the last release (1.1) code. I'm going to compile once in debug mode and once in release_debug (release plus debug symbols and assertions) and record the times for each compiler. The tests were compiled with support for every option in ETL to account to maximal compilation time. Each compilation was made using four threads (make -j4). I'm also going to test a few of the benchmarks to see the difference in runtime performance between the code generated by each compiler. The benchmark will be compiled in release mode and its compilation time recorded as well.

I'm going to test the following compilers:

GCC-4.9.4

GCC-5.4.0

GCC-6.3.0

GCC-7.1.0

clang-3.9.1

clang-4.0.1

zapcc-1.0 (commercial, based on clang-5.0 trunk)

All have been installed directly using Portage (Gentoo package manager) except for clang-4.0.1 that has been installed from sources and zapcc since it does not have a Gentoo package. Since clang package on Gentoo does not support multislotting, I had to install one version from source and the other from the package manager. This is also the reason I'm testing less versions of clang, simply less practical.

For the purpose of these tests, the exact same options have been used throughout all the compilers. Normally, I use different options for clang than for GCC (mainly more aggressive vectorization options on clang). This may not lead to the best performance for each compiler, but allows for comparison between the results with defaults optimization level. Here are the main options used:

In debug mode: -g

In release_debug mode: -g -O2

In release mode: -g -O3 -DNDEBUG -fomit-frame-pointer

In each case, a lot of warnings are enabled and the ETL options are the same.

All the results have been gathered on a Gentoo machine running on Intel Core i7-2600 (Sandy Bridge...) @3.4GHz with 4 cores and 8 threads, 12Go of RAM and a SSD. I do my best to isolate as much as possible the benchmark from perturbations and that my benchmark code is quite sound, it may well be that some results are not totally accurate. Moreover, some of the benchmarks are using multithreading, which may add some noise and unpredictability. When I was not sure about the results, I ran the benchmarks several time to confirm them and overall I'm confident of the results.

Compilation Time Let's start with the results of the performance of the compilers themselves: Compiler Debug Release_Debug Benchmark g++-4.9.4 402s 616s 100s g++-5.4.0 403s 642s 95s g++-6.3.0 399s 683s 102s g++-7.1.0 371s 650s 105s clang++-3.9.1 380s 807s 106s clang++-4.0.1 260s 718s 92s zapcc++-1.0 221s 649s 108s Note: For Release_Debug and Benchmark, I only use three threads with zapcc, because 12Go of RAM is not enough memory for four threads. There are some very significant differences between the different compilers. Overall, clang-4.0.1 is by far the fastest free compiler for Debug mode. When the tests are compiled with optimizations however, clang is falling behind. It's quite impressive how clang-4.0.1 manages to be so much faster than clang-3.9.1 both in debug mode and release mode. Really great work by the clang team here! With these optimizations, clang-4.0.1 is almost on par with gcc-7.1 in release mode. For GCC, it seems that the cost of optimization has been going up quite significantly. However, GCC 7.1 seems to have made optimization faster and standard compilation much faster as well. If we take into account zapcc, it's the fastest compiler on debug mode, but it's slower than several gcc versions on release mode. Overall, I'm quite impressed by the performance of clang-4.0.1 which seems really fast! I'll definitely make more tests with this new version of the compiler in the near future. It's also good to see that g++-7.1 also did make the build faster than gcc-6.3. However, the fastest gcc version for optimization is still gcc-4.9.4 which is already an old branch with low C++ standard support.

Runtime Performance Let's now take a look at the quality of the generated code. For some of the benchmarks, I've included two versions of the algorithm. std is the most simple algorithm (the naive one) and vec is the hand-crafted vectorized and optimized implementation. All the tests were done on single-precision floating points. Dot product The first benchmark that is run is to compute the dot product between two vectors. Let's look first at the naive version: dot (std) 100 500 1000 10000 100000 1000000 2000000 3000000 4000000 5000000 10000000 g++-4.9.4 64.96ns 97.12ns 126.07ns 1.89us 25.91us 326.49us 1.24ms 1.92ms 2.55ms 3.22ms 6.36ms g++-5.4.0 72.96ns 101.62ns 127.89ns 1.90us 23.39us 357.63us 1.23ms 1.91ms 2.57ms 3.20ms 6.32ms g++-6.3.0 73.31ns 102.88ns 130.16ns 1.89us 24.314us 339.13us 1.47ms 2.16ms 2.95ms 3.70ms 6.69ms g++-7.1.0 70.20ns 104.09ns 130.98ns 1.90us 23.96us 281.47us 1.24ms 1.93ms 2.58ms 3.19ms 6.33ms clang++-3.9.1 64.69ns 98.69ns 128.60ns 1.89us 23.33us 272.71us 1.24ms 1.91ms 2.56ms 3.19ms 6.37ms clang++-4.0.1 60.31ns 96.34ns 128.90ns 1.89us 22.87us 270.21us 1.23ms 1.91ms 2.55ms 3.18ms 6.35ms zapcc++-1.0 61.14ns 96.92ns 125.95ns 1.89us 23.84us 285.80us 1.24ms 1.92ms 2.55ms 3.16ms 6.34ms The differences are not very significant between the different compilers. The clang-based compilers seem to be the compilers producing the fastest code. Interestingly, there seem to have been a big regression in gcc-6.3 for large containers, but that has been fixed in gcc-7.1. dot (vec) 100 500 1000 10000 100000 1000000 2000000 3000000 4000000 5000000 10000000 g++-4.9.4 48.34ns 80.53ns 114.97ns 1.72us 22.79us 354.20us 1.24ms 1.89ms 2.52ms 3.19ms 6.55ms g++-5.4.0 47.16ns 77.70ns 113.66ns 1.72us 22.71us 363.86us 1.24ms 1.89ms 2.52ms 3.19ms 6.56ms g++-6.3.0 46.39ns 77.67ns 116.28ns 1.74us 23.39us 452.44us 1.45ms 2.26ms 2.87ms 3.49ms 7.52ms g++-7.1.0 49.70ns 80.40ns 115.77ns 1.71us 22.46us 355.16us 1.21ms 1.85ms 2.49ms 3.14ms 6.47ms clang++-3.9.1 46.13ns 78.01ns 114.70ns 1.66us 22.82us 359.42us 1.24ms 1.88ms 2.53ms 3.16ms 6.50ms clang++-4.0.1 45.59ns 74.90ns 111.29ns 1.57us 22.47us 351.31us 1.23ms 1.85ms 2.49ms 3.12ms 6.45ms zapcc++-1.0 45.11ns 75.04ns 111.28ns 1.59us 22.46us 357.32us 1.25ms 1.89ms 2.53ms 3.15ms 6.47ms If we look at the optimized version, the differences are even slower. Again, the clang-based compilers are producing the fastest executables, but are closely followed by gcc, except for gcc-6.3 in which we can still see the same regression as before. Logistic Sigmoid The next test is to check the performance of the sigmoid operation. In that case, the evaluator of the library will try to use parallelization and vectorization to compute it. Let's see how the different compilers fare: sigmoid 10 100 1000 10000 100000 1000000 g++-4.9.4 8.16us 5.23us 6.33us 29.56us 259.72us 2.78ms g++-5.4.0 7.07us 5.08us 6.39us 29.44us 266.27us 2.96ms g++-6.3.0 7.13us 5.32us 6.45us 28.99us 261.81us 2.86ms g++-7.1.0 7.03us 5.09us 6.24us 28.61us 252.78us 2.71ms clang++-3.9.1 7.30us 5.25us 6.57us 30.24us 256.75us 1.99ms clang++-4.0.1 7.47us 5.14us 5.77us 26.03us 235.87us 1.81ms zapcc++-1.0 7.51us 5.26us 6.48us 28.86us 258.31us 1.95ms Interestingly, we can see that gcc-7.1 is the fastest for small vectors while clang-4.0 is the best for producing code for larger vectors. However, except for the biggest vector size, the difference is not really significantly. Apparently, there is a regression in zapcc (or clang-5.0) since it's slower than clang-4.0 at the same level as clang-3.9. y = alpha * x + y (axpy) The third benchmark is the well-known axpy (y = alpha * x + y). This is entirely resolved by expressions templates in the library, no specific algorithm is used. Let's see the results: saxpy 10 100 1000 10000 100000 1000000 g++-4.9.4 38.1ns 61.6ns 374ns 3.65us 40.8us 518us g++-5.4.0 35.0ns 58.1ns 383ns 3.87us 43.2us 479us g++-6.3.0 34.3ns 59.4ns 371ns 3.57us 40.4us 452us g++-7.1.0 34.8ns 59.7ns 399ns 3.78us 43.1us 547us clang++-3.9.1 32.3ns 53.8ns 297ns 3.21us 38.3us 466us clang++-4.0.1 32.4ns 59.8ns 296ns 3.31us 38.2us 475us zapcc++-1.0 32.0ns 54.0ns 333ns 3.32us 38.7us 447us Even on the biggest vector, this is a very fast operation, once vectorized and parallelized. At this speed, some of the differences observed may not be highly significant. Again clang-based versions are the fastest versions on this code, but by a small margin. There also seems to be a slight regression in gcc-7.1, but again quite small. Matrix Matrix multiplication (GEMM) The next benchmark is testing the performance of a Matrix-Matrix Multiplication, an operation known as GEMM in the BLAS nomenclature. In that case, we test both the naive and the optimized vectorized implementation. To save some horizontal space, I've split the tables in two. sgemm (std) 10 20 40 60 80 100 g++-4.9.4 7.04us 50.15us 356.42us 1.18ms 3.41ms 5.56ms g++-5.4.0 8.14us 74.77us 513.64us 1.72ms 4.05ms 7.92ms g++-6.3.0 8.03us 64.78us 504.41us 1.69ms 4.02ms 7.87ms g++-7.1.0 7.95us 65.00us 508.84us 1.69ms 4.02ms 7.84ms clang++-3.9.1 3.58us 28.59us 222.36us 0.73ms 1.77us 3.41ms clang++-4.0.1 4.00us 25.47us 190.56us 0.61ms 1.45us 2.80ms zapcc++-1.0 4.00us 25.38us 189.98us 0.60ms 1.43us 2.81ms sgemm (std) 200 300 400 500 600 700 800 900 1000 1200 g++-4.9.4 44.16ms 148.88ms 455.81ms 687.96ms 1.47s 1.98s 2.81s 4.00s 5.91s 9.52s g++-5.4.0 63.17ms 213.01ms 504.83ms 984.90ms 1.70s 2.70s 4.03s 5.74s 7.87s 14.905 g++-6.3.0 64.04ms 212.12ms 502.95ms 981.74ms 1.69s 2.69s 4.13s 5.85s 8.10s 14.08s g++-7.1.0 62.57ms 210.72ms 499.68ms 974.94ms 1.68s 2.67s 3.99s 5.68s 7.85s 13.49s clang++-3.9.1 27.48ms 90.85ms 219.34ms 419.53ms 0.72s 1.18s 1.90s 2.44s 3.36s 5.84s clang++-4.0.1 22.01ms 73.90ms 175.02ms 340.70ms 0.58s 0.93s 1.40s 1.98s 2.79s 4.69s zapcc++-1.0 22.33ms 75.80ms 181.27ms 359.13ms 0.63s 1.02s 1.52s 2.24s 3.21s 5.62s This time, the differences between the different compilers are very significant. The clang compilers are leading the way by a large margin here, with clang-4.0 being the fastest of them (by another nice margin). Indeed, clang-4.0.1 is producing code that is, on average, about twice faster than the code generated by the best GCC compiler. Very interestingly as well, we can see a huge regression starting from GCC-5.4 and that is still here in GCC-7.1. Indeed, the best GCC version, in the tested versions, is again GCC-4.9.4. Clang is really doing an excellent job of compiling the GEMM code. sgemm (vec) 10 20 40 60 80 100 g++-4.9.4 264.27ns 0.95us 3.28us 14.77us 23.50us 60.37us g++-5.4.0 271.41ns 0.99us 3.31us 14.811us 24.116us 61.00us g++-6.3.0 279.72ns 1.02us 3.27us 15.39us 24.29us 61.99us g++-7.1.0 273.74ns 0.96us 3.81us 15.55us 31.35us 71.11us clang++-3.9.1 296.67ns 1.34us 4.18us 19.93us 33.15us 82.60us clang++-4.0.1 322.68ns 1.38us 4.17us 20.19us 34.17us 83.64us zapcc++-1.0 307.49ns 1.41us 4.10us 19.72us 33.72us 84.80us sgemm (vec) 200 300 400 500 600 700 800 900 1000 1200 g++-4.9.4 369.52us 1.62ms 2.91ms 7.17ms 11.74ms 22.91ms 34.82ms 51.67ms 64.36ms 111.15ms g++-5.4.0 387.54us 1.60ms 2.97ms 7.36ms 12.11ms 24.37ms 35.37ms 52.27ms 65.72ms 112.74ms g++-6.3.0 384.43us 1.74ms 3.12ms 7.16ms 12.44ms 24.15ms 34.87ms 52.59ms 70.074ms 119.22ms g++-7.1.0 458.05us 1.81ms 3.44ms 7.86ms 13.43ms 24.70ms 36.54ms 53.47ms 66.87ms 117.25ms clang++-3.9.1 494.52us 1.96ms 4.80ms 8.88ms 18.20ms 29.37ms 41.24ms 60.72ms 72.28ms 123.75ms clang++-4.0.1 511.24us 2.04ms 4.11ms 9.46ms 15.34ms 27.23ms 38.27ms 58.14ms 72.78ms 128.60ms zapcc++-1.0 492.28us 2.03ms 3.90ms 9.00ms 14.31ms 25.72ms 37.09ms 55.79ms 67.88ms 119.92ms As for the optimized version, it seems that the two families are reversed. Indeed, GCC is doing a better job than clang here, and although the margin is not as big as before, it's still significant. We can still observe a small regression in GCC versions because the 4.9 version is again the fastest. As for clang versions, it seems that clang-5.0 (used in zapcc) has had some performance improvements for this case. For this case of matrix-matrix multiplication, it's very impressive that the differences in the non-optimized code are so significant. And it's also impressive that each family of compilers has its own strength, clang being seemingly much better at handling unoptimized code while GCC is better at handling vectorized code. Convolution (2D) The last benchmark that I considered is the case of the valid convolution on 2D images. The code is quite similar to the GEMM code but more complicated to optimized due to cache locality. sconv2_valid (std) 100x50 105x50 110x55 115x55 120x60 125x60 130x65 135x65 140x70 g++-4.9.4 27.93ms 33.68ms 40.62ms 48.23ms 57.27ms 67.02ms 78.45ms 92.53ms 105.08ms g++-5.4.0 37.60ms 44.94ms 54.24ms 64.45ms 76.63ms 89.75ms 105.08ms 121.66ms 140.95ms g++-6.3.0 37.10ms 44.99ms 54.34ms 64.54ms 76.54ms 89.87ms 105.35ms 121.94ms 141.20ms g++-7.1.0 37.55ms 45.08ms 54.39ms 64.48ms 76.51ms 92.02ms 106.16ms 125.67ms 143.57ms clang++-3.9.1 15.42ms 18.59ms 22.21ms 26.40ms 31.03ms 36.26ms 42.35ms 48.87ms 56.29ms clang++-4.0.1 15.48ms 18.67ms 22.34ms 26.50ms 31.27ms 36.58ms 42.61ms 49.33ms 56.80ms zapcc++-1.0 15.29ms 18.37ms 22.00ms 26.10ms 30.75ms 35.95ms 41.85ms 48.42ms 55.74ms In that case, we can observe the same as for the GEMM. The clang-based versions are much producing significantly faster code than the GCC versions. Moreover, we can also observe the same large regression starting from GCC-5.4. sconv2_valid (vec) 100x50 105x50 110x55 115x55 120x60 125x60 130x65 135x65 140x70 g++-4.9.4 878.32us 1.07ms 1.20ms 1.68ms 2.04ms 2.06ms 2.54ms 3.20ms 4.14ms g++-5.4.0 853.73us 1.03ms 1.15ms 1.36ms 1.76ms 2.05ms 2.44ms 2.91ms 3.13ms g++-6.3.0 847.95us 1.02ms 1.14ms 1.35ms 1.74ms 1.98ms 2.43ms 2.90ms 3.12ms g++-7.1.0 795.82us 0.93ms 1.05ms 1.24ms 1.60ms 1.77ms 2.20ms 2.69ms 2.81ms clang++-3.9.1 782.46us 0.93ms 1.05ms 1.26ms 1.60ms 1.84ms 2.21ms 2.65ms 2.84ms clang++-4.0.1 767.58us 0.92ms 1.04ms 1.25ms 1.59ms 1.83ms 2.20ms 2.62ms 2.83ms zapcc++-1.0 782.49us 0.94ms 1.06ms 1.27ms 1.62ms 1.83ms 2.24ms 2.65ms 2.85ms This time, clang manages to produce excellent results. Indeed, all the produced executables are significantly faster than the versions produced by GCC, except for GCC-7.1 which is producing similar results. The other versions of GCC are falling behind it seems. It seems that it was only for the GEMM that clang was having a lot of troubles handling the optimized code.