Back in the days when I was playing with SSE instructions, I was trying to optimize every workload that I could think of. One of these was to convert thousands of IPv4 strings to 32-bit numbers for further processing. This article shows one way to optimize such a thing, and how the SSE instructions set can be used to get the better of your $1000 Intel CPU :)

All the code used in this article can be downloaded here: https://github.com/quarkslab/ip_conv_sse.

Notes:

We only support this format, and not stuff like "0x25BB2F46" (thing that inet_aton supports for instance).

It is as simple as this:

The Python 3 module is written in plain Python, thus the big difference. So we compare ourself to the fastest of these methods, that is libleeloo .

Here are the results running on a Core i7-3520M for 1 million strings:

We took libleeloo ( http://github.com/quarkslab/libleeloo ) implementation and Python 3 ipaddress module ( http://code.google.com/p/ipaddress-py/source/browse/ipaddress.py ). The benchmarks consists in the conversion of a one million list of IPv4 addresses stored in memory. Our goal is to get the best number of converted strings per second. This measure makes sense as the algorithm here is of linear complexity.

The first thing to do when optimizing such algorithm is to have a reference benchmark. We could have taken eglibc's inet_aton , but as said previously, it does a lot of checks and efforts to support different formats, and thus it's not a good starting point.

In our case, for the first part of the URL, this gives ('3'-'0')*10 + ('7'-'0')*1.

To convert an address to a 32-bit number, we need to convert each string (between dots) to real numbers and compose these numbers to have the final result.

So we try to use SSE instructions to perform these transformations.

We did some profiling of libleeloo code using gprof . According to the results, the atoi3 function is the most computation-intensive, that is the conversion of strings into integers.

One can assume that the subtraction and multiplication operations used to convert strings to numbers are the most computation-intensive ones. There are also a lot of branchings to look for the dots that can impact performances.

The long road to optimization

The main idea is that, at most, one IPv4 string can fit into a SSE register. Indeed, 255.255.255.255 is 16 characters wide, thus 128 bit, which is the size of a SSE register!

So, we assume that we are able to have a SSE register filled like this:

SSE register = {'3, '7', '.', '1', '8', '7', '.', '4, '7', '.', '7', '0', 0, 0, 0, 0}

So, let's see what SSE instructions might be of use for us. In order to do this, Intel provides us with a real nice web application you can access here: https://software.intel.com/sites/landingpage/IntrinsicsGuide/. Note that this used to be a Java application (that you can't seem to be able to download anymore, but you can still get a copy here: http://files.geekou.info/intel_intrinsics.tar.bz2), and even an older PDF that stopped with SSE4.2.

Back to the web application, we select all flavors of SSE. Then, we need to find:

a subtraction operation: in the "Arithmetic" category, we can find _mm_sub_epi8 , which subtract two SSE registers by pair of 8-bit signed integers. Sounds like our candidate for the subtraction of '0'.

, which subtract two SSE registers by pair of 8-bit signed integers. Sounds like our candidate for the subtraction of '0'. a bitwise-or operation: _mm_or_si128 in the "Logical" category seems fine.

something to find out where the dots are. The "Compare" category seems a good choice, and _mm_cmpeq_epi8 our candidate. It will generate a "mask" vector where bytes will be equal to 0xFF where dots are, and null otherwise. That is, res = _mm_cmpeq_epi8(a, b) is semantically the same as:

for i from 0 to 16: res[i] = (a[i] == b[i]) ? OxFF:0x00

a multiply operation. This is a bit trickier, because a multiplication of two 8-bit integers might end up on a 16-bit integer.

Multiplications As the different multiplication instructions available will determine how our algorithm works, let's detail a bit the available choices. The SSE instructions set provides several ways to do multiplication. Basically, the different kinds of instructions are, for res = a * b (with res , a and b 128-bit SSE registers): multiply 16-bit numbers to 32-bit ( _mm_mullo_epi16 ): res[0:32[ = a[0:16[*b[0:16[ res[32:64[ = a[16:32[*b[16:32[ res[64:96[ = a[32:48[*b[32:48[ res[96:128[ = a[48:64[*b[48:64[ same with 32-bit numbers to 64-bit ( _mm_mullo_epi32 ):

): same with the upper parts of the a and b registers ( _mm_mulhi_epi16 / _mm_mulhi_epi32 ):

and registers ( / ): some mix of multiplications and additions, like this ( _mm_maddubs_epi16 ): res[0:16[ = a[0:8[ * b[0:8[ + a[8:16[ * b[8:16[ res[16:32[ = a[16:24[ * b[16:24[ + a[24:32[ * b[24:32[ res[32:48[ = a[32:40[ * b[32:40[ + a[40:48[ * b[40:48[ res[48:64[ = a[48:56[ * b[48:56[ + a[56:64[ * b[56:64[ etc. Let's imagine we manage to get this vector (out of '37.187.47.70') (X means any 8-bit value): SSE ip_int = {3, 7, X, 1, 8, 7, X, 4, 7, X, 7, 0, X, X, X, X} What we would like to do is multiply it by this one: SSE mul = {10, 1, 0, 100, 10, 1, 0, 10, 1, 0, 10, 1, 0, 0, 0, 0} and get this: SSE res = {3*10, 7*1, X*0, 1*100, 8*10, 7*1, X*0, 4*10, 7*1, X*0, 7*10, 0*1, X*0, X*0, X*0, X*0} and then do some kind of horizontal sum to obtain this: SSE ip = { 3*10 + 7*1 + X*0, 1*100 + 8*10 + 7*1 + X*0, 4*10 + 7*1 + X*0, 7*10 + 0*1 + X*0 } But, SSE does not provide a way to multiply SSE registers by 8-bit packed numbers. The minimum is, as shown above, 16-bit. Fortunately for us, the _mm_maddubs_epi16 is the one that will help us. The issue though is that it will only work on 8 blocks of 2*8-bit numbers each, that is if our vectors are organised like this: SSE ip_int = {0, 3 , 7, X, 1 , 8 , 7, X, 0, 4 , 7, X, 0, 7 , 0, X} (1) SSE mul = {100, 10, 1, 0, 100, 10, 1, 0, 100, 10, 1, 0, 100, 10, 1, 0} (2) With res = _mm_maddubs_epi16(ip_int, mul) , we have: SSE ip_res = {0*100 + 3*10, 7*1 + X*0, 1*100 + 8*10, 7*1 + X*0, 0*100 + 4*10, 7*1 + X*0, 0*100 + 7*10, 0*1 + X*0} Each number here is 16-bit wide. Thus, we need one last horizontal 16-bit addition and shifting, that is one instruction that does: res[0:32[ = (a[00:16[ + a[16:32[) << 24, res[32:64[ = (a[32:48[ + a[48:64[) << 16, res[64:96[ = (a[64:80[ + a[80:96[) << 8, res[96:128[ = (a[96:112[ + a[112:128[), There is no instruction that does exactly that, and _mm_madd_epi16 is the closest to what we are looking for: SSE res = {a[00:16[*b[00:16[ + a[16:32[*b[16:32[, a[32:48[*b[32:48[ + a[48:64[*b[48:64[, a[64:80[*b[64:80[ + a[80:96[*b[80:96[, a[96:112[*b[96:112[ + a[112:128[*b[112:128[} Thus, by setting the b 8*16-bit integers vector as: b = {2**8, 2**8, 1, 1, 2**8, 2**8, 1, 1} (3) we end up with, for res = _mm_madd_epi16(ip_res, b) : res[0:32[ = (a[00:16[ + a[16:32[) << 8, res[32:64[ = (a[32:48[ + a[48:64[), res[64:96[ = (a[64:80[ + a[80:96[) << 8, res[96:128[ = (a[96:112[ + a[112:128[), The final computation will just need one shift of 16 bit and four or operations. We need to keep this in mind as this is the key to the vectorized algorithm, because we still have to manage to get (1) and (2).

The vectorized algorithm So, starting with our IP string in a SSE register and the above instructions, it seems natural to try something like this: find out where the dots are in the IPv4 string

subtract with '0'

manage to have (1) and (2) from this

use _mm_maddubs_epi16 and _mm_madd_epi16 to get four 32-bit numbers representing the four parts of the IP

and to get four 32-bit numbers representing the four parts of the IP construct the final number Let's describe this a little bit step by step.

First steps So, first thing first, the subtraction is not that hard to perform in SSE. We just need to create a vector where '0' bytes are at the right position. So, we search for the dots and ending zeros, using the compare and or instructions: __m128i mask_dots = (sse_str == '.' | sse_str == 0) = _mm_or_si128( _mm_cmpeq_epi8(sse_str, _mm_set1_epi8('.')), _mm_cmpeq_epi8(sse_str, _mm_setzero_si128())); In our case, this gives: SSE ip = {'3' , '7' , '.' , '1', '8', '7', '.', '4', '7', '.', '7', '0', 0, 0, 0, 0} SSE mask_dots = {0x00, 0x00, 0xFF, 0x00, 0x00, 0x00, 0xFF, 0x00, 0x00, 0xFF, 0x00, 0x00, 0xFF, 0xFF, 0xFF, 0xFF}; Then, we use the and not instructions to create the '0' register: SSE mask_zeros = ~mask_dots & {'0', '0', ..., '0'} = _mm_andnot_si128(mask_dots, _mm_set1_epi8('0')); = {'0', '0', 0x00, '0', '0', '0', 0x00, '0', '0', 0x00, '0', '0', 0x00, 0x00, 0x00, 0x00} And, finally subtract this to our original SSE "string" register: SSE ip_int = sse_ip - mask_zeros = _mm_sub_epi8(sse_ip, mask_zeros);

The shifting dance Alright, so now we need to create (1) and (2). (2) is fairly easy, as this is a constant value: SSE mul_ten = _mm_set_epi8(0,1,10,100,0,1,10,100,0,1,10,100,0,1,10,100); (one will notice that _mm_set_XX takes their arguments the "little-endian" way) (1) requires more work. The goal is to make sure that each number of the original string is aligned on 4 bytes. In our case, we need to transform ip_int from: SSE ip_int = {3, 7, X, 1, 8 , 7, X, 4 , 7, X, 7, 0, X, X, X, X} to: SSE ip_int = {0, 3, 7, X, 1, 8, 7, X, 0, 4, 7, X, 0, 7, 0, X} <--------> <--------> <--------> <--------> 1st block 2nd 3rd 4th We need to play with the SSE shift instructions to add the zeros needed to build the four blocks. The instruction _mm_shuffle_epi8 permutes the different bytes of a register according to another one. A shift of one byte to the right will be expressed like this: right_shift(reg) = _mm_shuffle_epi8(reg, _mm_set_epi8(14,13,12,11,10,9,8,7,6,5,4,3,2,1,0,0x80)); (0x80 indicates that the byte at this index will be null) For instance: right_shift(ip_int) = {0, 3, 7, X, 1, 8 , 7, X, 4 , 7, X, 7, 0, X, X, X} <--------> 1st block As we can see, this gives us our first block. We can "move" this block in a fresh blank register, and go on: SSE mask_32bit = {0xFFFFFFFF, 0 (32-bit wide), 0 (32-bit wide), 0 (32-bit wide)} SSE ip_final = ip_int & mask_32bit = _mm_and_si128(ip_int, mask_32bit) ip_int = ~mask_32bit & ip_int = _mm_andnot_si128(mask_32bit, ip_int) Here, we got ourselves with: ip_final = {0, 3, 7, X, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0} ip_int = {0, 0, 0, 0, 1, 8, 7, X, 4, 7, X, 7, 0, X, X, X} For the following block, we do not need any shifting, so simply "move" it the same way: SSE mask_32bit = {0 (32-bit wide), 0xFFFFFFFF, 0 (32-bit wide), 0 (32-bit wide)} ip_final = ip_final | (ip_int & mask_32bit) = _mm_or_si128(ip_final, _mm_and_si128(ip_int, mask_32bit)) ip_int = ~mask_32bit & ip_int = _mm_andnot_si128(mask_32bit, ip_int) resulting in: ip_final = {0, 3, 7, X, 1, 8, 7, X, 0, 0, 0, 0, 0, 0, 0, 0} ip_int = {0, 0, 0, 0, 0, 0, 0, 0, 4, 7, X, 7, 0, X, X, X} The third block needs one shift to the right, so let's do it: SSE mask_32bit = {0 (32-bit wide), 0 (32-bit wide), 0xFFFFFFFF, 0 (32-bit wide)} SSE ip_final = right_shift(ip_int, mask_32bit) ip_int = ~mask_32bit & ip_int = _mm_andnot_si128(mask_32bit, ip_int) resulting in: ip_final = {0, 3, 7, X, 1, 8, 7, X, 0, 4, 7, X, 0, 0, 0, 0} ip_int = {0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 7, 0, X, X} And the same for the last block: ip_final = {0, 3, 7, X, 1, 8, 7, X, 0, 4, 7, X, 0, 7, 0, X} This is the vector we wanted in (1)! What's left to compute are the shift "counts" that we used. As you now have understood, it depends on the number of figures of each block. We need to find a way to compute this one thanks to SSE instructions and what we've got so far, that is a SSE "mask" register that indicates the positions of dots and null bytes. In our example: SSE mask_dots = {0x00, 0x00, 0xFF, 0x00, 0x00, 0x00, 0xFF, 0x00, 0x00, 0xFF, 0x00, 0x00, 0xFF, 0xFF, 0xFF, 0xFF}; What we can do is start from this vector to compute the shifting count: first, shift this vector one byte to the left: SSE mask_shifts_count = left_shift(mask_dots) = {0x00, 0xFF, 0x00, 0x00, 0x00, 0xFF, 0x00, 0x00, 0xFF, 0x00, 0x00, 0xFF, 0xFF, 0xFF, 0xFF, 0x00} create a vector of 2 according to this mask: SSE shifts_count = mask_shifts_count & {2, 2, ..., 2} = _mm_and_si128(mask_shifts_counts, {2, 2, ..., 2}) = {0, 2, 0, 0, 0, 2, 0, 0, 2, 0, 0, 2, 2, 2, 2, 0} Now, do the same and make sure we do not erase a shift count already set: mask_shifts_count = ~mask_shifts_count & left_shift(mask_shifts_count) = {0xFF, 0x00, 0x00, 0x00, 0xFF, 0x00, 0x00, 0xFF, 0x00, 0x00, 0xFF, 0xFF, 0xFF, 0xFF, 0x00, 0x00} shifts_count = _mm_or_si128(shifts_count, _mm_and_si128(mask_shifts_count, {1, 1, .., 1})) = {1, 2, 0, 0, 1, 2, 0, 1, 2, 0, 1, 2, 2, 2, 2, 0} <--------> <--------> <--------> <--------> So now, we shift our shifts_count vector the same way that we will shift the ip_int vector, that is: take the first 8-bit integer of shifts_count (here 1). It means that the first block needs a one-byte shift to the right. We'll do the same with shifts_count and end up with: shifts_count = right_shift(shifts_count) = {0, 1, 2, 0, 0, 1, 2, 0, 1, 2, 0, 1, 2, 2, 2, 2} <--------> <--------> <--------> <--------> then, we take the 4-th 8-bit number of shifts_count , and do the same. In our example, this is the shift count of our second block (0), so shifts_count does not change.

, and do the same. In our example, this is the shift count of our second block (0), so does not change. go on with the 8-th byte, so a shift count of 1, and get: shifts_count = right_shift(shifts_count) = {0, 0, 1, 2, 0, 0, 1, 2, 0, 1, 2, 0, 1, 2, 2, 2} <--------> <--------> <--------> <--------> and, finally, for the last block, the 12-th byte gives us the last shift count, that is 1! Using the last "shift count" and the same operations as above, we finally manage to create (1): ip_final = {0, 3, 7, X, 1, 8, 7, X, 0, 4, 7, X, 0, 7, 0, X}