The standard C functions for comparing two strings of the same length, memcmp can be implemented naïvely as follows:

For each byte in the two strings: From both strings, load the byte at the position currently under consideration and compare their values. If they are not equal, return some value matching the sign of their difference (as unsigned bytes). If no differing bytes are discovered, return 0.

From time to time, there are reports that this implementation results in a timing oracle because the execution speed of memcmp depends on the length of the shared prefix. For a concrete example, the following piece of code from OpenVPN was fixed as CVE-2013-2061:

/* Compare locally computed HMAC with packet HMAC */ if (memcmp (local_hmac, BPTR (buf), hmac_len)) CRYPT_ERROR ("packet HMAC authentication failed");

The idea is that an attacker would try some arbitrary HMAC values, cycling through values 0, 1, 2, …, 255 in the first byte, until they observe an HMAC check that is slightly slower than the others: memcmp has to look at the second byte to see if the string is equal, while for the other values, it only has to look at the first byte to see that the strings are different. After finding the slower byte value, the attacker treats it as the first byte of the correct HMAC value, and cycles through 0, 1, 2, …, 255 for the second byte. Again, one of the values will be slightly slower, and the attacker treats it as the correct value of the second byte. At least in theory, this procedure can be repeated for each of the bytes in the string, eventually recovering the correct HMAC value and defeating parts of the OpenVPN encryption scheme.

For passwords, a similar attacking procedure can be used, as long as they are stored unhashed or hashed on the client side. For passwords and HMACs, the timing differences are minuscule, and it is an open question whether they will be observable to an attacker (whether locally on the system, on the same hypervisor, or over the network) in a given deployment scenario.

The traditional approach to address these concerns involves a memcmp replacement which computes a value that depends on all characters from both strings. For CVE-2013-2061, OpenVPN chose the cumulative bit-wise OR of the XOR of the two input bytes. If this value is zero, the two input strings are equal. This tells us nothing about their ordering, but the HMAC comparison does not need this information, so it is not a full memcmp replacement.

Obviously, we would prefer to come up with a way to compute the ordering in constant time as well, and apply this change to the memcmp implementation, so that all applications can benefit from it. But always processing both input strings completely will likely result in measurable performance regressions for some applications, so this is a hard sell for a general-purpose implementation in a C run-time library.

But it turns out that we can do better: the timing oracle is particularly useful because the attacker can enumerate all the possible initial bytes, from 0 to 255. If we change the memcmp implementation to consider multiple bytes at a time, say 32-bit or 64-bit words, then the attacker would have enumerate up to 4,294,967,296 or 18,446,744,073,709,551,616 values. (Even the 32-bit case is currently fairly safe because the timing difference can only be observed statistically, so multiple attempts are needed for each of the 4,294,967,296 candidates.) For performance reasons, most memcmp implementations already attempt to work at larger granularity than individual bytes, so this approach is not controversial at all.

But we still need constant-time way to extract the ordering information (for the full memcmp case). There are various approaches:

Just convert the value to big endian. On big-endian machines, this is trivially a no-op. On x86, the BSWAP and, on some implementations, MOVBE instructions can be used.

and, on some implementations, instructions can be used. XOR the input words, count the leading zeros (using BSF on x86), use that to shift both words into a suitable register position (so that bytes which come after the differing byte are masked away), and compute the difference.

on x86), use that to shift both words into a suitable register position (so that bytes which come after the differing byte are masked away), and compute the difference. Load two input words. Use a magic instruction sequence that computes, from two bytes at the same position in the input words, a single bit that is set when the bytes are equal, cleared otherwise. Count the number of leading ones in that value, and use that as a string index to load the two differing bytes, and compute their difference.

All these procedures appear fairly elaborate, so one has to wonder what the costs are. We implemented the first approach, based on BSWAP , on top of the GNU C library implementation of memcmp that is targeted at the current line of x86 CPUs which have fast unaligned loads. It turned out that the cost was mostly negative. For example, sorting a random permutation of /usr/share/dict/words using qsort was about ten per cent faster than before (this figure includes qsort overhead). In retrospect, this is not surprising: in essence, we vectorized the difference extract code, replacing a complicated sequence of jumps with branch-free code. Especially when sorting, the place of the difference is data-dependent, so the branches are difficult to predict, which explains why the previous, branch-based code was quite a bit slower in the qsort -based benchmark. In its defense, it is slightly faster for a few shared prefix lengths when all branches are correctly predicted, but this scenario should be relatively rare in practice. So in this particular case (targeted at current-line Intel 64 CPUs), this memcmp implementation is an overall win. It is refreshing to see a case where addressing a security vulnerability makes your program go faster.