[cfe-dev] libc++ Performance (compared to libstdc++)

Hi everyone, I was chatting with Marshall offline last week, and I mentioned that several of my users had noted general performance regressions switching from libstdc++ to libc++. Marshall said that he's heard similar things, but has received few specific reports. He did recall looking at the problem which I believe is described here (http://aras-p.info/blog/2015/12/11/careful-with-that-stl-map-insert-eugene/), which is still a problem. I'll certainly admit that I'd not investigated most of these in detail (with the exception of a std::complex -ffast-math issue, http://reviews.llvm.org/D18639). We do have a few performance-related libc++ bugs open: https://llvm.org/bugs/show_bug.cgi?id=21192 - Reading from stdin is 1-2 orders of magnitude slower than using libstdc++ [I just tested this myself and updated the bug report]. https://llvm.org/bugs/show_bug.cgi?id=19708 - std::find is significantly slower than libstdc++. https://llvm.org/bugs/show_bug.cgi?id=20837 - libc++'s std::sort is O(N^2) in the worst case (instead of O(N*ln(N))). https://llvm.org/bugs/show_bug.cgi?id=26886 - libc++'s std::stable_sort also has a worst-case complexity issue. https://llvm.org/bugs/show_bug.cgi?id=15456 - A faster implementation of std::function is possible https://llvm.org/bugs/show_bug.cgi?id=16747 and https://llvm.org/bugs/show_bug.cgi?id=21275 - Our unordered_multimap insert is much slower than libstdc++'s. In PR16747, Howard interestingly explains libc++ has this problem because of an additional (i.e. not-required-by-the-standard) guarantee that libc++ provides regarding member ordering. but very few are related to containers. Baptiste Wicht has a benchmark covering use of several common standard algorithms with vectors, lists and dqueues (https://github.com/wichtounet/articles/blob/master/src/vector_list/bench.cpp) which he used for his post http://baptiste-wicht.com/posts/2012/12/cpp-benchmark-vector-list-deque.html, and I've compiled this using LLVM/Clang/libc++ r271873 @ -O3, using both libc++ and libstdc++ 4.8.5, and run on an Intel Xeon E5-2699 v3 @ 2.30GHz running Linux 3.10.0. If you try this yourself, note that even on a fast machine the benchmark takes several hours to run. $ clang++ -std=c++11 -O3 -I../../include -I../../../boost_1_61_0 bench.cpp ../demangle.cpp ../graphs.cpp -o /tmp/b-gnu $ clang++ -std=c++11 -stdlib=libc++ -O3 -I../../include -I../../../boost_1_61_0 bench.cpp ../demangle.cpp ../graphs.cpp -o /tmp/b-llvm Of the 248 tests, libc++ was faster by at least 5% in 58 of the tests and libstdc++ was faster by at least 5% in 94 of the tests. libc++ was faster by at least 20% in 14 of the tests and libstdc++ was faster by at least 20% in 64 of the tests. The real problem, however, comes from the extremums. libc++ is never more than 65% faster than libstdc++: destruction___Trivial_128_ list -0.65 destruction___Trivial_4096_ vector -0.40 destruction___Trivial_1024_ list -0.38 destruction___Trivial_1024_ vector -0.37 random_remove___NonTrivialArray_32_ vector -0.3 but libc++ is sometimes over 10x slower than libstdc++: fill_back___NonTrivialStringMovable list_inserter 9.96 fill_back___NonTrivialStringMovable vector_reserve 10.21 fill_back___NonTrivialStringMovableNoExcept vector_reserve 10.82 fill_back___NonTrivialStringMovableNoExcept vector_inserter 11.15 fill_back___NonTrivialStringMovable vector_inserter 11.93 I've attached the full list. A second benchmark, http://beta.visl.sdu.dk/svn/visl/tools/benchmarks/src/set.cpp (https://tinodidriksen.com/2012/02/20/cpp-set-performance-2/), modified only to repeat each test 30 instead of 7 times, and compiled as before: uint32_t std::set erase: -0.37 std::string std::set erase: -0.30 std::string std::set insertion: -0.23 std::string std::unordered_set erase: -0.16 std::string std::unordered_set iterate: -0.15 std::string std::set lookup: -0.15 uint32_t std::set insertion: -0.13 uint32_t std::unordered_set iterate: -0.072 std::string std::set iterate: -0.062 uint32_t std::set iterate: -0.054 uint32_t std::set lookup: -0.015 uint32_t std::unordered_set erase: 0.085 std::string std::unordered_set insertion: 0.22 std::string std::unordered_set lookup: 0.30 uint32_t std::unordered_set insertion: 0.51 uint32_t std::unordered_set lookup: 0.61 In this benchmark, libc++ beats libstdc++ by more than 5% in 10 tests, and libstdc++ beats libc++ by more than 5% in 5 tests. Again, however, libc++'s downside is larger, being up to 61% slower (in the 'uint32_t std::unordered_set lookup' test) than libstdc++. libstdc++ loses only by 37% to libc++, at most, in the 'uint32_t std::set erase' test. Also, I can easily imagine that users are more-likely to notice a performance difference in lookup than in erase. To pick another benchmark, I compiled and ran the one from http://www.reedbeta.com/blog/2015/01/12/data-oriented-hash-table/ - and this must be good because the post ends with, "And remember, if Chandler Carruth and Mike Acton give you advice about data structures, listen to them. ;)". I modified the benchmark only by adding constexpr to min() and max() of XorshiftRNG to make it compile with libc++. This benchmarks many configurations and takes nearly an hour to run. I'll summarize the results I'll say that libc++ is almost always slower than libstdc++, and that as the element size and/or the number of elements increases it gets worse. Here are the relative timing differences; these are tests for unordered_map: Fill for 8-byte elements, 32-byte elements, 128-byte elements, 1K-byte elements, 4K-byte elements: 100000 -0.022 0.026 0.10 0.057 0.058 200000 -0.028 0.0041 0.12 0.083 0.057 300000 -0.022 -0.023 0.018 0.040 0.035 400000 -0.010 -0.0094 -0.077 0.048 0.049 500000 0.22 0.28 0.18 0.17 0.094 600000 -0.064 -0.081 -0.11 0.0033 0.020 700000 -0.059 -0.043 -0.089 -0.0047 0.027 800000 -0.037 -0.053 -0.072 0.0092 0.035 900000 0.36 0.30 0.20 0.15 0.099 1000000 0.31 0.23 0.17 0.13 0.098 The first column is the number of elements; negative numbers mean libc++ is faster. For 1000000 4K elements, libstdc++ is faster by 9.8%. For 1000000 8-byte elements, libstdc++ is faster by nearly 32%. Pre-sized fill: 100000 0.024 0.020 0.12 0.091 0.042 200000 0.015 0.016 0.16 0.041 0.084 300000 0.00038 0.033 0.17 0.090 0.090 400000 0.070 0.041 0.069 0.084 0.094 500000 0.069 0.025 0.061 0.12 0.10 600000 -0.0096 0.023 0.016 0.074 0.072 700000 0.060 0.0013 0.036 0.094 0.085 800000 0.029 -0.0048 0.035 0.083 0.075 900000 -0.011 0.0025 -0.037 0.078 0.085 1000000 0.0022 0.019 0.011 0.084 0.081 Time for 100K lookups: 100000 0.12 0.13 0.089 0.11 0.071 200000 0.13 0.12 0.10 0.11 0.072 300000 0.098 0.12 0.053 0.085 0.080 400000 0.18 0.11 0.088 0.059 0.030 500000 0.12 0.080 0.072 0.075 0.033 600000 0.095 0.10 0.076 0.063 0.017 700000 0.17 0.097 0.12 0.083 0.043 800000 0.16 0.13 0.11 0.092 0.050 900000 0.094 0.086 0.056 0.041 -0.0047 1000000 0.14 0.11 0.092 0.073 0.016 Time for 100K failed lookups: 100000 0.15 0.15 0.11 0.087 0.077 200000 0.14 0.11 0.083 0.074 0.059 300000 0.10 0.091 0.090 0.12 0.097 400000 0.10 0.12 0.061 0.11 0.099 500000 0.19 0.20 0.12 0.12 0.12 600000 0.18 0.18 0.12 0.11 0.095 700000 0.082 0.096 0.068 0.079 0.070 800000 0.20 0.19 0.14 0.13 0.12 900000 0.20 0.17 0.10 0.11 0.097 1000000 0.25 0.22 0.17 0.14 0.13 Time to remove half the elements: 100000 0.15 0.15 0.078 0.066 0.066 200000 0.12 0.12 0.039 0.055 0.060 300000 0.12 0.070 0.024 0.038 0.095 400000 0.16 0.16 0.15 0.13 0.13 500000 0.10 0.12 0.12 0.13 0.13 600000 0.077 0.015 0.036 0.026 0.049 700000 0.086 0.16 0.16 0.17 0.14 800000 0.076 0.044 0.051 0.043 0.064 900000 0.11 0.11 0.11 0.12 0.13 1000000 0.10 0.090 0.061 0.081 0.10 Many of our users have code that is sensitive to the performance of standard containers and algorithms, and this preliminary benchmarking lends support to the anecdotes that libc++ is slower than libstdc++. Worryingly, the extremes of these differences are pretty large. Obviously application impact can't be judged by some benchmarks I happened to find on the internet, but this is something we, as a community, should look at more closely. Thanks again, Hal -- Hal Finkel Assistant Computational Scientist Leadership Computing Facility Argonne National Laboratory -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: psums.txt URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20160701/7ca8fff9/attachment.txt>