This article is also available in Russian.

Let’s benchmark toupper implementations, because that’s a thing you do on Tuesdays.

Actually, I don’t really care about toupper much at all, but I was writing a different post and needed a peg to hang my narrative hat on, and hey toupper seems like a nice harmless benchmark. Despite my effort to choose something which should be totally straightforward and not sidetrack me, this weird thing popped out.

This will be a pretty short one - the longer, original post on the original, maybe far more interesting, topic is coming soon. The source is available on github if you want to follow along.

So let’s take a look at three implementations which perform toupper over an array of char : that is, which take an input array and modifies it in-place so any lowercase characters are converted to uppercase.

The first simply calls the C standard library toupper function in a C-style loop:

void toupper_rawloop ( char * buf , size_t size ) { for ( size_t i = 0 ; i < size ; i ++ ) { buf [ i ] = toupper ( buf [ i ]); } }

The second uses the more modern approach of using to std::transform to replace the raw loop:

void toupper_transform ( char * buf , size_t size ) { std :: transform ( buf , buf + size , buf , :: toupper ); }

Finally, the third one is our bespoke ASCII-specific version that checks if the character lies in the range a - z and remaps it by subtracting 32 if so:

void toupper_branch ( char * buf , size_t size ) { for ( size_t i = 0 ; i < size ; i ++ ) { char c = buf [ i ]; if ( c >= 'a' && c <= 'z' ) { buf [ i ] = c - 32 ; } } }

Seems straightforward enough, right?

Let’s benchmark these on my Skylake i7-6700HQ laptop, with the default gcc 5.5. Here’s a JPSP :

Let’s get three observations that aren’t really part of the story out of the way.

First, the pattern for the branchy algorithm (green). It’s the only one that varies much at all with input size - the other two are basically flat. This turns out to be just a benchmarking artifact. I use random ASCII input, so primary determinant of performance our branchy algorithm is branch prediction. For small input sizes, the branch predictor learns the entire input sequence across iterations of the benchmark and so mispredictions are low and performance is high, just like this. As sequence size grows the predictor memorizes less and less of the sequence until it flatlines when it mispredicts every time there is an uppercase character (0.27 mispredicts per character).

The second thing is this green blob of much slower samples from the toupper_branch in the upper left:

This wasn’t a one time artifact, I saw those guys hanging out up there across several runs. They don’t reproduce if you run that particular size alone however, only when the script runs to collect input across all size values. They don’t always show up. I didn’t look into it further but my best guess is some of unfortunate collision or aliasing effect perhaps in the branch predictor or in the 4k physical to virtual page mapping (VA space randomization was off, however).

The third not interesting thing is the bimodal behavior of toupper_rawloop – the blue dots form two distinct lines, at just above 2 cycles per char and a faster line at 1.5 cycles per char. All performance counters that I checked were the same between the two “modes”. The fast mode, which runs at 1.57 chars/cycle is basically bottlenecked on the load ports: there are 1.54 uops/cycle going to both port 2 and port 3, so those ports are 98% occupied. The slower mode I can’t explain.

While I was investigating it, the fast mode suddenly stopped appearing and I was stuck in slow mode. Maybe my CPU saw what I was up to and downloaded a microcode update in the background to remove the inconsistency, but I still have the SVG to prove it (for now).

So what’s the interesting thing?

The interesting thing is that the raw loop version runs 3x to 4x faster than the std::transform version: 1.5 to 2 cycles per character versus just above 7 cycles per character.

What’s up with that? Are my standard algorithms letting me down? Does std::transform have some fatal flaw?

Not really. Well, not at all.

It turns out these results occur when the functions are compiled in separate files. If you put them into the same file, suddenly the performance is is the same: they both run slowly.

No, it’s not an alignment thing.

It gets weirder too: the fast raw loop version, compiled in a separate file, slows down if you simply include the <algorithm> header. That’s right - including that header, which is never used and generates no code in the object file, slows down the raw loop by 3 to 4 times. Conversely, the std::transform version speeds up to full speed if you copy and paste the std::transform implementation out of <algorithm> and stop including that file.

It gets even weirder (this is the last “it gets weirder”, I promise): including <algorithm> doesn’t always do this. The slowdown happens if <algorithm> is included before <ctype.h> , but not if you swap them around:

Slow:

#include <algorithm> #include <ctype.h>

Fast:

#include <ctype.h> #include <algorithm>

In fact, in my case, this performance anomaly was triggered (in a different project) when clang-format sorted my headers, moving <algorithm> to the top where it belonged (hence the clickbait title).

Of course, we were going to end up mired in assembly at some point. Let’s not postpone the pain any longer.

Here are are the fast and slow versions of the functions, with the core loops annotated:

<algorithm> included first:

toupper_rawloop ( ch ar * , unsigned long ): push rbp push rbx lea rbp , [ rdi + rsi ] sub rsp , 8 test rsi , rsi je .L1 mov rbx , rdi .L5: movsx edi , BYTE PTR [ rbx ] ; load a char from *buf add rbx , 1 ; buf++ call toupper ; call toupper(c) mov BYTE PTR [ rbx - 1 ], al ; save the result to buf[-1] cmp rbp , rbx ; check for buf == buf_end jne .L5 ; .L1: add rsp , 8 pop rbx pop rbp ret

With <algorithm> second:

toupper_rawloop ( ch ar * , unsigned long ): test rsi , rsi je .L7 push rbp push rbx mov rbp , rsi mov rbx , rdi sub rsp , 8 call __ctype_toupper_loc lea rsi , [ rbx + rbp ] mov rdi , rbx .L4: movsx rcx , BYTE PTR [ rdi ] ; load a char from buf mov rdx , QWORD PTR [ rax ] ; load the toupper table address (pointed to by __ctype_toupper_loc) add rdi , 1 ; buf++ mov edx , DWORD PTR [ rdx + rcx * 4 ] ; look up the toupper result by indexing into table with the char mov BYTE PTR [ rdi - 1 ], dl ; store the result cmp rsi , rdi ; check buf == end_buf jne .L4 ; add rsp , 8 pop rbx pop rbp .L7: rep ret

The key difference is the slow version simply calls toupper in the loop, while the fast version has no function calls at all, just a table lookup - the body of std::toupper has been inlined.

Examining the glibc source, we find the implementation of toupper :

__extern_inline int __NTH ( toupper ( int __c )) // __NTH is a macro that indicates the function doesn't throw { return __c >= - 128 && __c < 256 ? ( * __ctype_toupper_loc ())[ __c ] : __c ; }

We see that toupper is implemented as an extern inline function that first checks that the range of the char fits within a byte and then looks up the character in the table returned by __ctype_toupper_loc() . That function returns a thread-local pointer (a const int ** ), which in turn points to a lookup table which given a character returns the uppercase version.

So now the assembly makes sense: the fast version of the algorithm inlines the toupper body, but it can’t inline the __ctype_toupper_loc() call; however, this call is declared __attribute__((const)) which means that its return value depends only on the arguments (and here there are no arguments) and so the compiler knows it returns the same value every time and so can be hoisted out of the loop, so the loop body has only a few loads associated with the lookup table, the store of the new value to the buffer, and loop control.

The slow version, on the other hand, leaves the toupper() inside the loop. The loop itself is one instruction shorted, but of course you need to run all the code inside toupper as well. Here’s what that looks like on my system:

lea edx ,[ rdi + 0x80 ] ; edx = rdi + 0x80 movsxd rax , edi ; zero extend c cmp edx , 0x17f ; check that c is in -128 to 255 ja 2 a ; if not, we're done mov rdx , QWORD PTR [ rip + 0x395f30 ] ; lookup TLS index mov rdx , QWORD PTR fs :[ rdx ] ; access TLS at index mov rdx , QWORD PTR [ rdx ] ; dereference TLS pointer mov rdx , QWORD PTR [ rdx + 0x48 ] ; load current toupper lookup table mov eax , DWORD PTR [ rdx + rax * 4 + 0x200 ] ; lookup c in LUT 2 a: ret

Since it’s a standalone function call, it has to do more work. There are no less than five chained (pointer-chasing) memory accesses. Only two of those accesses remained in the fast loop, because the rest were hoisted up and out of the loop. The input to output latency of this function is probably close to 25 cycles, so out measured throughput of ~7 cycles means that the CPU was able to run several copies in parallel, not too terrible all things considered.

Why does this happen?

Through a long series of includes, C++ files like <algorithm> include <bits/os_defines.h> which has this line:

// This keeps isanum, et al from being propagated as macros. #define __NO_CTYPE 1

When <ctype.h> is ultimately included, this prevents the block containing the extern inline definition of toupper from being included:

#if !defined __NO_CTYPE # ifdef __isctype_f __isctype_f ( alnum ) // lots more like this __isctype_f ( xdigit ) # elif defined __isctype # define isalnum(c) __isctype((c), _ISalnum) # define isalpha(c) __isctype((c), _ISalpha) // more like this # endif // the stuff we care about # ifdef __USE_EXTERN_INLINES __extern_inline int __NTH ( tolower ( int __c )) { return __c >= - 128 && __c < 256 ? ( * __ctype_tolower_loc ())[ __c ] : __c ; } __extern_inline int __NTH ( toupper ( int __c )) { return __c >= - 128 && __c < 256 ? ( * __ctype_toupper_loc ())[ __c ] : __c ; } # endif // here's where tolower is defined as a macro # if __GNUC__ >= 2 && defined __OPTIMIZE__ && !defined __cplusplus # define tolower(c) __tobody (c, tolower, *__ctype_tolower_loc (), (c)) # define toupper(c) __tobody (c, toupper, *__ctype_toupper_loc (), (c)) # endif /* Optimizing gcc */ #endif /* Not __NO_CTYPE. */

Note when including <ctype.h> C++ toupper is never defined as a macro - at most it is extern inline - the macro definitions below that are guarded by !defined __cplusplus so they’ll never take effect.

So I’m not sure if the extern inline bodies of tolower and toupper are intended to be excluded by __NO_CTYPE in this case, but that’s what happens and this has a significant performance impact in this toy loop. As a corollary, if you include <cctype> rather than <ctype.h> (the C++ version of the C header which puts functions in the std:: namespace) you also get the slow behavior because <cctype> ultimately includes <bits/os_defines.h> .

Does this matter? Nah.

toupper is broken for serious multilingual use and, if you only care about ASCII you can write your own faster function. If you care about proper text handling, you are probably using UTF-8 and you’ll have to use something like ICU to do locale-aware text handling, or wait for C++ to get Unicode support (you might be waiting a while). It’s only interesting in clickbait sense of “clang-format can cause a 4x performance regression”.

Does this happen on all libc versions? Mostly yes, but it gets complicated.

The above results apply directly to gcc 5.5 and glibc 2.23, since that’s what I used, but things get weirder on newer versions (starting around glibc 2.27). On newer versions, the issue as described above occurs between <algorithm> and <ctype.h> , but additionally <stdlib.h> enters the picture: if you include <stdlib.h> before <ctype.h> you’ll enter slow mode, which doesn’t happen on earlier versions. So apparently <stdlib.h> also ends up defining __NO_CTYPE at some point on these newer versions. At least here we can’t blame clang-format sorting – it might fix the issue for by sorting headers (in files where you aren’t including any other problematic header).

I have filed a bug against libc, so this specific issue might be fixed in the future, but no doubt the scourge of include order will be with us more or less forever.

I don’t have a comments system, but I’m working on it (read: occasionally complaining out loud that comments on a static site are hard).

In the meantime you can discuss this on Hacker News or lobste.rs.

Thanks

Thanks to HN user ufo who pointed out you don’t need a lambda to adapt std::toupper for use in std::transform , and Jonathan Müller who subsequently pointed out that in fact you do want a lambda.

If you liked this post, check out the homepage for others you might enjoy.



