Improving the Performance of Standard Library Functions

Kirsten

July 30th, 2019

In Visual Studio 2019 version 16.2 we improved the codegen of several standard library functions. Guided by your feedback on Developer Community (Inlining std::lldiv and Improved codegen for std::fmin, std::fmax, std::round, std::trunc) we focused on the variants of standard division ( std::div , std::ldiv , std::lldiv ) and std::isnan .

Originally function calls to the standard library, rather than inline assembly instructions, were generated upon each invocation of variations of std::div and std::isnan , regardless of the compiler optimization flags passed. Since these standard library function definitions live inside of the runtime, their definitions are opaque to the compiler and therefore not candidates for inlining and optimization. Furthermore, the function overhead of calling both std::div and std::isnan is greater than the actual cost of these operations. On most platforms, std::div can be computed in a single instruction that returns both quotient and remainder while std::isnan requires only a comparison and condition flag check. Inlining these calls would remove both function call overhead and allow optimizations to kick in since the compiler has the additional context of the calling function.

To support inline assembly code generation we added a number of different functions as compiler intrinsics (also known as builtins) for std::isnan , std::div , and friends. Registering an intrinsic effectively “teaches” the meaning of that function to the compiler and results in greater control over the code generated. We went with a codegen solution rather than a library change to avoid altering library headers.

Optimizing std::div and Friends

The MSVC compiler has pre-existing support for optimizing bare division and remainder operations. Therefore, to feed calls to std::div into this existing compiler infrastructure, we recognize std::div as a compiler intrinsic and then transform the inputs of our recognized call into the canonical format the compiler is expecting for division and remainder operations.

Optimizing std::isnan

Replacing calls to std::isnan was the more complicated of the two categories of functions we targeted due to the conflicting requirements of the C and C++ standards. According to the C standard, isnan and the function it wraps, fpclassify , are required to be implemented as macros, while C++ requires both operations to be implemented as function overloads. Below is a diagram of the call structure in C++ vs C, with functions that are required to be implemented as function overloads inside bolded blue boxes, functions that are required to be implemented as macros in dashed purple boxes, and those without a requirement in green.

To get a unified solution for C and C++ code, we had to bypass both std::isnan and std::fpclassify to look at the functions std::fpclassify wraps. We chose the green functions to register as compiler intrinsics since they lack any implementation requirements. Additionally, since each overload in either standard accomplishes the same task, we can transform instances of one standard’s intrinsics (we chose the C++ standard) into instances of our added C intrinsics. The table below demonstrates the results of the unification process that reduces the intrinsics we operate on after the initial pass from six to three.

Function Intrinsic function Intrinsic After Unification _fdtest IV__FDTEST IV__FDCLASS _dtest IV__DTEST IV__DCLASS _ldtest IV__LDTEST IV__LDCLASS _fdclass IV__FDCLASS IV__FDCLASS _dclass IV__DCLASS IV__DCLASS _ldclass IV__LDCLASS IV__LDCLASS

Back to std::isnan . The reasoning behind adding six new compiler intrinsics was to improve the code generation for std::isnan . However, referencing the diagram from before, while we’ve transformed the functions that std::fpclassify calls into, that hasn’t actually changed any of the codegen for std::isnan . In the _dclass case for example, we’ve recognized all _dclass calls as intrinsics, but as we’re not changing the code generated for _dclass , the code emitted is still the same call to _dclass that we started out with.

The last step required to recognize std::isnan as an intrinsic and therefore enable more efficient code generation involves pattern matching. Checking if a float/double/etc. is a NaN looks something like this:

isnan(double x) { return FP_NAN == IV__DCLASS(x); }

Where FP_NAN is a constant defined in both C and C++ standards. Now that it’s easy to identify calls to _dclass and friends, the optimizer was extended to recognize the above pattern (a call to one of the three unified intrinsics followed by a comparison to the FP_NAN constant) and transform it into the new std::isnan intrinsic, IV_ISNAN .

Results

To better illustrate the codegen differences, below are samples of the different x64 code generation for both std::isnan and std::div .

std::isnan(double)

Reference With Intrinsics lea rcx, QWORD PTR _X$[rsp] movsd QWORD PTR _X$[rsp], xmm0 call _dtest cmp ax, 2 sete al ucomisd xmm0, xmm0 setp al movzx eax, al

std::div(long, long)

Reference With Intrinsics mov edx, ebx mov ecx, edi call ldiv mov eax, esi mov rcx, rbx cdq idiv edi

How much does effectively inlining these calls impact performance? When benchmarked by calling each operation on every member of a 1×109 element array with unknown inputs, the following improvements can be observed:

std::isnan(double) Reference With Intrinsics % Improvement Avg (s) 4.03 1.25 69% Std Dev (s) 0.05 0.04

std::div(long, long) Reference With Intrinsics % Improvement Avg (s) 6.52 6.06 7% Std Dev (s) 0.11 0.04

You can view the benchmark source here on Compiler Explorer. Each benchmark was run six times on an Intel Xeon CPU v3 @3.50GHz, with the first warm-up run thrown out.

Conclusion

The aforementioned optimizations will be enabled transparently with an upgrade to the 16.3 toolset in codebases compiled under /O2. Otherwise, make sure you’re explicitly using /Oi to enable intrinsic support.