Retrospective

(part 1) - a naïve ( gmtime_r ) way of creating formatted timestamp string from Unix time in microseconds

) way of creating formatted timestamp string from Unix time in microseconds (part 2) - performance improvements to the naïve solution: avoided unnecessary allocations by not creating std::string objects replaced gmtime_r with modern date experimented (and failed) with changing strftime and snprintf calls with Boost Karma based formatting

(part 3) - replaced the output formatting with a custom number to string convesion that resulted in massive speed-up

Comparing the fastest solution from part 3 with the initial naïve solution we can observe almost an order of magnitude of speedup. The question is whether we can make any further improvements. The answer is as usual - it depends.

EDIT: Most recent code samples and reviewed benchmark code available here.

It depends…

If we state that our task is simply to convert a number of microseconds since epoch into a string representation then probably it will be difficult to get any more significant speed-ups.

However in many cases we do not want to convert arbitrary timestamps to strings. What if there is a predictable pattern in the input data? For example:

a system that is producing current timestamps (e.g. to be sent to an external system as a string)

parsing a log file with time expressed as number of microseconds? In these cases the year/month/day component of the date is unlikely to change very often. So what if we cached it and recalculated it only if really needed? Let’s take the fastest example we have so far and try it out:

const char * BENCHMARK_NOINLINE heuristic_format_date_not_counting_lookup(std::chrono::microseconds micros) { thread_local char buffer[32] = "20160926T18:12:32.123456Z\0"; using namespace std::chrono; using namespace date; thread_local sys_days previous_dp; time_point<system_clock, microseconds> tp{micros}; bool ok = true; const auto dp = floor<days>(tp); if(dp != previous_dp) { previous_dp = dp; const auto ymd = year_month_day(dp); ok &= utoa_lookup_faster4(buffer + YEAR_OFFSET, int(ymd.year())); ok &= utoa_lookup_faster2(buffer + MONTH_OFFSET, unsigned(ymd.month())); ok &= utoa_lookup_faster2(buffer + DAY_OFFSET, unsigned(ymd.day())); } const auto tod = make_time(tp - previous_dp); const auto usec = micros % 1000000; ok &= utoa_lookup_faster2(buffer + HOURS_OFFSET, unsigned(tod.hours().count())); ok &= utoa_lookup_faster2(buffer + MINUTES_OFFSET, unsigned(tod.minutes().count())); ok &= utoa_lookup_faster2(buffer + SECONDS_OFFSET, unsigned(tod.seconds().count())); ok &= utoa_lookup_faster6(buffer + USEC_OFFSET, unsigned(usec.count())); if(BOOST_UNLIKELY(!ok)) buffer[0] = '\0'; return buffer; }

Is it any better under predictable input data? As it turns out it is much better:

bm_to_string_naive_strftime_snprintf/real_time/threads:1_mean 942 ns 942 ns bm_to_string_naive_strftime_snprintf/real_time/threads:4_mean 272 ns 1070 ns bm_format_naive_strftime_snprintf/real_time/threads:1_mean 762 ns 762 ns bm_format_naive_strftime_snprintf/real_time/threads:4_mean 223 ns 873 ns bm_formate_date/real_time/threads:1_mean 689 ns 689 ns bm_formate_date/real_time/threads:4_mean 173 ns 691 ns bm_format_date_karma1/real_time/threads:1_mean 1041 ns 1041 ns bm_format_date_karma1/real_time/threads:4_mean 267 ns 1069 ns bm_format_date_karma2/real_time/threads:1_mean 988 ns 988 ns bm_format_date_karma2/real_time/threads:4_mean 245 ns 980 ns bm_format_date_counting_lookup/real_time/threads:1_mean 156 ns 156 ns bm_format_date_counting_lookup/real_time/threads:4_mean 41 ns 162 ns bm_format_date_not_counting_lookup/real_time/threads:1_mean 100 ns 100 ns bm_format_date_not_counting_lookup/real_time/threads:4_mean 25 ns 98 ns bm_heuristic_format_date_not_counting_lookup/real_time/threads:1_mean 58 ns 58 ns bm_heuristic_format_date_not_counting_lookup/real_time/threads:4_mean 15 ns 60 ns

Predictable vs random load

Given all the methods so far let’s compare how they perform using:

stable load (where date component seldom/never changes)

random load (where date component changes at almost every single call)

somewhat random load (where we can expect date changes, but for each date we’ll have a train of samples)

Predictable load

This is what we have been doing up till now. The following benchmark function gives a very stable load:

void bm_fun(benchmark::State & state) { auto start = 1474913552000000; while(state.KeepRunning()) { auto res = fun(++start); if(BOOST_UNLIKELY(res[8] != 'T' || res[23] != (start % 10 + '0'))) {std::abort();}\ } }

Random load

We generate some random (but valid) input data and use that in the benchmark:

constexpr auto RANGE_FROM = 0LL; constexpr auto RANGE_TO = 2840140800000000LL; constexpr static auto SAMPLES = 1 << 20; const static std::array<uint64_t, SAMPLES> RANDOM_INPUT = [](){ std::random_device rnd; std::mt19937 generator(rnd()); std::uniform_int_distribution<int64_t> source(RANGE_FROM, RANGE_TO); std::array<uint64_t, SAMPLES> result; for(auto i = 0; i != SAMPLES; ++i) { result[i] = source(generator); } return result; }(); void bm_rnd_fun(benchmark::State & state) { size_t i = 0; while(state.KeepRunning()) { const auto ts = RANDOM_INPUT[i]; auto res = fun(ts); if(BOOST_UNLIKELY(res[8] != 'T' || static_cast<unsigned char>(res[23]) != (ts % 10 + '0'))) {std::abort();} i = (i + 1) & (SAMPLES - 1); } }

Somewhat random load

Not trying to be to sophisticated I’m going to use a random walk generated data:

constexpr static auto START_TIME = 1474913552000000LL; const static std::array<uint64_t, SAMPLES> RANDOM_WALK_INPUT = [](){ std::random_device rnd; std::mt19937 generator(rnd()); std::normal_distribution<> source; std::array<uint64_t, SAMPLES> result; for(auto i = 0; i != SAMPLES; ++i) { result[i] = 1000000LL * 3600 * 2 * source(generator); } result[0] = START_TIME; std::partial_sum(result.begin(), result.end(), result.begin()); return result; }();

Complete results

Naïve approach

The only visible difference is between single-threaded and multi-threaded execution due to internal locks. (ir)Regularity of input data does not have much effect:

bm_to_string_naive_strftime_snprintf/real_time/threads:1_mean 912 ns 912 ns bm_to_string_naive_strftime_snprintf/real_time/threads:4_mean 263 ns 1034 ns bm_rnd_to_string_naive_strftime_snprintf/real_time/threads:1_mean 912 ns 911 ns bm_rnd_to_string_naive_strftime_snprintf/real_time/threads:4_mean 268 ns 1056 ns bm_rndwalk_to_string_naive_strftime_snprintf/real_time/threads:1_mean 908 ns 908 ns bm_rndwalk_to_string_naive_strftime_snprintf/real_time/threads:4_mean 272 ns 1067 ns bm_format_naive_strftime_snprintf/real_time/threads:1_mean 747 ns 747 ns bm_format_naive_strftime_snprintf/real_time/threads:4_mean 220 ns 868 ns bm_rnd_format_naive_strftime_snprintf/real_time/threads:1_mean 799 ns 799 ns bm_rnd_format_naive_strftime_snprintf/real_time/threads:4_mean 240 ns 941 ns bm_rndwalk_format_naive_strftime_snprintf/real_time/threads:1_mean 770 ns 770 ns bm_rndwalk_format_naive_strftime_snprintf/real_time/threads:4_mean 233 ns 915 ns

Nothing unexpected to see here:

bm_formate_date/real_time/threads:1_mean 679 ns 679 ns bm_formate_date/real_time/threads:4_mean 175 ns 700 ns bm_rnd_formate_date/real_time/threads:1_mean 713 ns 713 ns bm_rnd_formate_date/real_time/threads:4_mean 182 ns 730 ns bm_rndwalk_formate_date/real_time/threads:1_mean 700 ns 700 ns bm_rndwalk_formate_date/real_time/threads:4_mean 179 ns 717 ns bm_format_date_karma1/real_time/threads:1_mean 1062 ns 1062 ns bm_format_date_karma1/real_time/threads:4_mean 268 ns 1074 ns bm_rnd_format_date_karma1/real_time/threads:1_mean 1065 ns 1065 ns bm_rnd_format_date_karma1/real_time/threads:4_mean 269 ns 1077 ns bm_rndwalk_format_date_karma1/real_time/threads:1_mean 1064 ns 1064 ns bm_rndwalk_format_date_karma1/real_time/threads:4_mean 269 ns 1076 ns bm_format_date_karma2/real_time/threads:1_mean 980 ns 980 ns bm_format_date_karma2/real_time/threads:4_mean 248 ns 991 ns bm_rnd_format_date_karma2/real_time/threads:1_mean 985 ns 985 ns bm_rnd_format_date_karma2/real_time/threads:4_mean 250 ns 1001 ns bm_rndwalk_format_date_karma2/real_time/threads:1_mean 987 ns 987 ns bm_rndwalk_format_date_karma2/real_time/threads:4_mean 250 ns 1000 ns

bm_format_date_counting_lookup/real_time/threads:1_mean 156 ns 156 ns bm_format_date_counting_lookup/real_time/threads:4_mean 40 ns 160 ns bm_rnd_format_date_counting_lookup/real_time/threads:1_mean 186 ns 186 ns bm_rnd_format_date_counting_lookup/real_time/threads:4_mean 47 ns 190 ns bm_rndwalk_format_date_counting_lookup/real_time/threads:1_mean 169 ns 169 ns bm_rndwalk_format_date_counting_lookup/real_time/threads:4_mean 43 ns 174 ns bm_format_date_not_counting_lookup/real_time/threads:1_mean 94 ns 94 ns bm_format_date_not_counting_lookup/real_time/threads:4_mean 24 ns 96 ns bm_rnd_format_date_not_counting_lookup/real_time/threads:1_mean 96 ns 96 ns bm_rnd_format_date_not_counting_lookup/real_time/threads:4_mean 25 ns 98 ns bm_rndwalk_format_date_not_counting_lookup/real_time/threads:1_mean 96 ns 96 ns bm_rndwalk_format_date_not_counting_lookup/real_time/threads:4_mean 25 ns 98 ns

We get significant performance improvement when using predictable and somewhat random inputs. For completely random input the code is on-par with the solution without heuristic.

bm_heuristic_format_date_counting_lookup/real_time/threads:1_mean 87 ns 87 ns bm_heuristic_format_date_counting_lookup/real_time/threads:4_mean 22 ns 89 ns bm_rnd_heuristic_format_date_counting_lookup/real_time/threads:1_mean 185 ns 185 ns bm_rnd_heuristic_format_date_counting_lookup/real_time/threads:4_mean 47 ns 189 ns bm_rndwalk_heuristic_format_date_counting_lookup/real_time/threads:1_mean 110 ns 110 ns bm_rndwalk_heuristic_format_date_counting_lookup/real_time/threads:4_mean 28 ns 113 ns bm_heuristic_format_date_not_counting_lookup/real_time/threads:1_mean 58 ns 58 ns bm_heuristic_format_date_not_counting_lookup/real_time/threads:4_mean 15 ns 59 ns bm_rnd_heuristic_format_date_not_counting_lookup/real_time/threads:1_mean 99 ns 99 ns bm_rnd_heuristic_format_date_not_counting_lookup/real_time/threads:4_mean 25 ns 102 ns bm_rndwalk_heuristic_format_date_not_counting_lookup/real_time/threads:1_mean 63 ns 63 ns bm_rndwalk_heuristic_format_date_not_counting_lookup/real_time/threads:4_mean 16 ns 65 ns

Conclusions

In these four short notes I went over microptimisation of a simple piece of code for formatting timestamp as string. Using various techniques I managed to obtain ~20x improvement over the initial naïve baseline. Was it worth the effort? It depends.

Do you need this code in the critical path of your application at all? If you’re writing data for internal clients for later consumption you might want to consider writing just raw timestamps which is faster.

First focus on correctness - to get the right result using simple code. Put that in your code and check whether it is a potential performance bottleneck. If you’re counting nanoseconds on the critical path then it probably is.

Profile, measure, tweak. Or phrasing in terms of active learning: Plan -> Do -> Reflect -> Conceptualise -> Plan -> …

Still, one should benchmark this code in context of his program. It might happen that the best solution here is not performing as expected due to certain input patterns, instruction/data cache issues and other aspects.