The TCP/IP Checksum The TCP/IP checksum is used to detect corruption of data over a TCP or IPv4 connection. If a bit is flipped, a byte mangled, or some other badness happens to a packet, then it is highly likely that the receiver of that broken packet will notice the problem due to a checksum mismatch. This provides end-to-end assurance that the data stream is correct. IPv4 uses the checksum to detect corruption of packet headers. i.e. the source, destination, and other meta-data. The TCP protocol includes an extra checksum that protects the packet "payload" as well as the header. This is in addition to the header-checksum of IP. IPv6 does not use a checksum. This is due to the modern assumption that low-level transport protocols typically carry some form of error correction, together with the fact that higher-level protocols (like TCP) do as well. This makes IPv6 packet headers "leaner" than they might otherwise be (provided the size of the flow-label isn't altered). The algorithm for the TCP and IPv4 checksums is identical. The data is processed a word (16 bits, two bytes) at a time. (An odd number of bytes just has a trailing zero byte added to make the total number even.) The words are added using ones-complement arithmetic, with the location holding the checksum set to be zeros. Once this long chain of additions is complete, the result is negated in ones-complement by taking the binary not, and the result is stored in the right spot. If this operation is repeated then the result of the checksum should be the binary all-ones. The algorithm above has many interesting properties. The first is that since it is based on addition, it is both commutative and associative. This means that we can use some other calculation that effectively adds in some other order. By re-arranging the additions we might be able to increase performance. The second property is more subtle: The result is endian independent. A ones-complement addition can be performed on twos-complement machines (most of them) by doing a normal twos-complement addition, followed by adding the potential carry back in the least-significant position. This means that we effectively have a carry from the first byte into the second, and from the second byte into the first. This symmetry means that no matter what endian order we internally place on those two bytes, the calculation will give the same result. A final interesting property (not really used in this article) is that since the checksum is based on addition, we can alter a packet and update the checksum cheaply. We just need to do a ones-complement subtraction of the differences in the words we change. For example, this means that when a router updates the time-to-live (TTL) of a packet, it doesn't have to redo the whole checksum calculation. Evaluating the checksum can be a bottle-neck in network heavy environments. This is why many network interfaces offer the ability for the hardware to do the resulting checksum calculations. However, many network stacks do the checksum in software, which is why optimizing its calculation can be important. This article explores how that might be done. TCP/IP Checksum Algorithms C assumes two's complement arithmetic with its unsigned integers, and doesn't specify the format for signed integers, so we can't use ones-complement arithmetic directly. (This is for portability. Hardware might use either representation.) It also doesn't have access to the carry flag, or the add-with-carry instruction. However, if we use integer types wider than a 16-bit word, we can accumulate many carries. Due to the commutivity and associativity of addition, we can fold them all in at the end. (Note that we don't have to worry about overflow of the wider type because the size of packets is limited.) Some C code that does that looks like: unsigned short checksum1(const char *buf, unsigned size) { unsigned sum = 0; int i; /* Accumulate checksum */ for (i = 0; i < size - 1; i += 2) { unsigned short word16 = *(unsigned short *) &buf[i]; sum += word16; } /* Handle odd-sized case */ if (size & 1) { unsigned short word16 = (unsigned char) buf[i]; sum += word16; } /* Fold to get the ones-complement result */ while (sum >> 16) sum = (sum & 0xFFFF)+(sum >> 16); /* Invert to get the negative in ones-complement arithmetic */ return ~sum; } We can run the above 224 times as a benchmark, and time the result. Obviously, the times will vary depending on the size of the buffer used. We choose to look at 64-byte packets, 1023 byte packets and 1024 byte packets. The 64-byte case is indicative of the overhead for small packets. 1024 bytes tells us how performance varies in the large-packet limit. 1023 bytes helps diagnose the effects of odd-sized packets on performance of different variants of checksum algorithms. Doing that, the results are: Size 64 1023 1024 Time (s) 0.88 10.99 11.03 The above obvious algorithm handles 16-bits at a time. Usually, if you process more data at a time, then performance is better. Since we have a 64-bit machine, we would like to modify the algorithm to handle chunks of that size. The problem there is the carries. Fortunately, it isn't hard to come up with an algorithm to detect an (unsigned) overflow. If a+b < a, then a carry occurred. Thus the main loop is easily modified to add an extra one to the total when needed. Finally, we need to "fold" the result down from 64 to 16 bits at the end. Such an algorithm looks like: unsigned short checksum2(const char *buf, unsigned size) { unsigned long long sum = 0; const unsigned long long *b = (unsigned long long *) buf; unsigned t1, t2; unsigned short t3, t4; /* Main loop - 8 bytes at a time */ while (size >= sizeof(unsigned long long)) { unsigned long long s = *b++; sum += s; if (sum < s) sum++; size -= 8; } /* Handle tail less than 8-bytes long */ buf = (const char *) b; if (size & 4) { unsigned s = *(unsigned *)buf; sum += s; if (sum < s) sum++; buf += 4; } if (size & 2) { unsigned short s = *(unsigned short *) buf; sum += s; if (sum < s) sum++; buf += 2; } if (size) { unsigned char s = *(unsigned char *) buf; sum += s; if (sum < s) sum++; } /* Fold down to 16 bits */ t1 = sum; t2 = sum >> 32; t1 += t2; if (t1 < t2) t1++; t3 = t1; t4 = t1 >> 16; t3 += t4; if (t3 < t4) t3++; return ~t3; } As one might expect, this does a fair bit better: Size 64 1023 1024 Time (s) 0.29 2.90 2.93 It isn't quite the four-times faster that using four-times larger chunks might give, but it's close. The problem now is to do even better. However, since C doesn't have access to the carry flag, it becomes increasingly difficult to describe what we want the machine to do. Instead, we will make the jump to assembly language, which will give us the freedom to try any technique we wish. A direct translation of the above C code into assembly language by hand might look like the following: .globl checksum3 .type checksum3,@function .align 16 checksum3: xor %eax, %eax cmp $8, %esi jl 2f # The main loop 1: add (%rdi), %rax adc $0, %rax add $8, %rdi sub $8, %esi cmp $8, %esi jge 1b # Handle the tail 2: test $4, %esi je 3f movl (%rdi), %edx add %rdx, %rax adc $0, %rax add $4, %rdi 3: test $2, %esi je 4f xor %edx, %edx movw (%rdi), %dx add %rdx, %rax adc $0, %rax add $2, %rdi 4: test $1, %esi je 5f xor %edx, %edx movb (%rdi), %dl add %rdx, %rax adc $0, %rax # Fold down to 16-bits 5: mov %eax, %edx shr $32, %rax add %edx, %eax adc $0, %eax mov %eax, %edx shr $16, %eax add %dx, %ax adc $0, %ax # Invert to get the final checksum not %ax retq .size checksum3, .-checksum3 The main loop is quite simple. It uses the %rax register to accumulate the result, using the add instruction to do so. Carries are handled by the adc instruction afterwards. The rest of the loop just updates the pointer into the buffer, and counts down the number of iterations to do. Handling the tail, and the final fold down to the 16-bit result are also simplified by the ability to use the adc instruction explicitly. Since the above is a faithful translation of the C code, you wouldn't expect too much of a difference in performance. That is the case, with the results being: Size 64 1023 1024 Time (s) 0.28 2.88 2.91 There is only a smidgeon of a difference, due to the intrinsic lower overhead of using assembly language. To go faster we need to use more of the flexibility available. One of the most obvious things to do is to optimize the main loop some more. If we take some care, we can let the carry flag be "live" throughout. This means we can get rid of the adc-with-zero, and turn that into an adc with the next thing to sum in the next iteration. However, the current code corrupts the carry flag with the other instructions in the loop. So we need to be creative and use the lea instruction to update the pointer, and the dec instruction to count down the loop. Both of those don't affect the carry. .globl checksum4 .type checksum4,@function .align 16 checksum4: mov %esi, %ecx xor %eax, %eax # Divide by 8 to get the total number of iterations shr $3, %ecx je 2f # Clear the carry before starting the loop clc # The new smaller main loop 1: adc (%rdi), %rax lea 8(%rdi), %rdi dec %ecx jne 1b # Fold in the final carry adc $0, %rax 2: test $4, %esi je 3f movl (%rdi), %edx add %rdx, %rax adc $0, %rax add $4, %rdi 3: test $2, %esi je 4f xor %edx, %edx movw (%rdi), %dx add %rdx, %rax adc $0, %rax add $2, %rdi 4: test $1, %esi je 5f xor %edx, %edx movb (%rdi), %dl add %rdx, %rax adc $0, %rax 5: mov %eax, %edx shr $32, %rax add %edx, %eax adc $0, %eax mov %eax, %edx shr $16, %eax add %dx, %ax adc $0, %ax not %ax retq .size checksum4, .-checksum4 The above shrinks the hot inner loop enormously, and is obviously going to be much faster. However, reality isn't so kind: Size 64 1023 1024 Time (s) 0.27 2.87 2.90 Ouch. The transformation had little to no effect at all. It turns out that on this machine adc is a very fast instruction. The extra adc instruction inside the loop overlapped in execution with the other operations, so removing it didn't speed anything up much at all. However, if that's the case, we can use that knowledge to make some improvements. If we unroll the loop once, and make two additions in it, we can perhaps double the speed: .globl checksum5 .type checksum5,@function .align 16 checksum5: mov %esi, %ecx xor %eax, %eax # Now handle 16 bytes at a time shr $4, %ecx je 2f clc # The main loop now uses two 64-bit additions 1: adc (%rdi), %rax adc 8(%rdi), %rax lea 16(%rdi), %rdi dec %ecx jne 1b adc $0, %rax # We need to handle anything up to 15 tail bytes. 2: test $8, %esi je 3f add (%rdi), %rax adc $0, %rax add $8, %rdi 3: test $4, %esi je 4f movl (%rdi), %edx add %rdx, %rax adc $0, %rax add $4, %rdi 4: test $2, %esi je 5f xor %edx, %edx movw (%rdi), %dx add %rdx, %rax adc $0, %rax add $2, %rdi 5: test $1, %esi je 6f xor %edx, %edx movb (%rdi), %dl add %rdx, %rax adc $0, %rax # Since we accumulate with 64-bits still, this doesn't change. 6: mov %eax, %edx shr $32, %rax add %edx, %eax adc $0, %eax mov %eax, %edx shr $16, %eax add %dx, %ax adc $0, %ax not %ax retq .size checksum5, .-checksum5 This time, our suspicions are realized, and we get a significant performance increase: Size 64 1023 1024 Time (s) 0.20 1.54 1.56 Obviously, if unrolling once is good, unrolling even more must be better. If we unroll so that there are four additions in the inner loop, we get: .globl checksum6 .type checksum6,@function .align 16 checksum6: mov %esi, %ecx xor %eax, %eax # 32 bytes at a time shr $5, %ecx je 2f clc # Lets make sure our quite-large loop is now aligned .align 16 # Four 64-bit adds per iteration 1: adc (%rdi), %rax adc 8(%rdi), %rax adc 16(%rdi), %rax adc 24(%rdi), %rax lea 32(%rdi), %rdi dec %ecx jne 1b adc $0, %rax # Handle the 31 bytes or less remaining 2: test $16, %esi je 3f add (%rdi), %rax adc 8(%rdi), %rax adc $0, %rax add $16, %rdi 3: test $8, %esi je 4f add (%rdi), %rax adc $0, %rax add $8, %rdi 4: test $4, %esi je 5f movl (%rdi), %edx add %rdx, %rax adc $0, %rax add $4, %rdi 5: test $2, %esi je 6f xor %edx, %edx movw (%rdi), %dx add %rdx, %rax adc $0, %rax add $2, %rdi 6: test $1, %esi je 7f xor %edx, %edx movb (%rdi), %dl add %rdx, %rax adc $0, %rax 7: mov %eax, %edx shr $32, %rax add %edx, %eax adc $0, %eax mov %eax, %edx shr $16, %eax add %dx, %ax adc $0, %ax not %ax retq .size checksum6, .-checksum6 Again, we get more improvement. Not quite as much as last time, but enough to make it worth it: Size 64 1023 1024 Time (s) 0.18 1.10 1.08 Unrolling more doesn't seem to help. To increase the speed further we will need to think of something else to change. As we unroll we are increasing the length of the dependent instruction chain in the loop. If we can somehow break this chain into smaller pieces, then the cpu might be able to exploit more inter-instruction parallelism. Unfortunately, there is only a single carry flag. However, we can combine what we have with something using the algorithm in the original C code. Using lea instructions to accumulate into a wide register, we can get extra additions in the loop, without many dependencies. Such code looks like: .globl checksum7 .type checksum7,@function .align 16 checksum7: mov %esi, %ecx xor %eax, %eax shr $5, %ecx je 2f # Use %r8 to accumulate as well xor %r8, %r8 .align 16 1: movl 24(%rdi), %edx # Add 32-bits into %r8 lea (%rdx,%r8),%r8 # Three adc's now into %rax instead of four adc (%rdi), %rax adc 8(%rdi), %rax adc 16(%rdi), %rax movl 28(%rdi), %edx # Add another 32-bits into %r8 lea (%rdx,%r8),%r8 lea 32(%rdi), %rdi dec %ecx jne 1b # Merge %r8 with other results in %rax adc %r8, %rax adc $0, %rax 2: test $16, %esi je 3f add (%rdi), %rax adc 8(%rdi), %rax adc $0, %rax add $16, %rdi 3: test $8, %esi je 4f add (%rdi), %rax adc $0, %rax add $8, %rdi 4: test $4, %esi je 5f movl (%rdi), %edx add %rdx, %rax adc $0, %rax add $4, %rdi 5: test $2, %esi je 6f xor %edx, %edx movw (%rdi), %dx add %rdx, %rax adc $0, %rax add $2, %rdi 6: test $1, %esi je 7f xor %edx, %edx movb (%rdi), %dl add %rdx, %rax adc $0, %rax 7: mov %eax, %edx shr $32, %rax add %edx, %eax adc $0, %eax mov %eax, %edx shr $16, %eax add %dx, %ax adc $0, %ax not %ax retq .size checksum7, .-checksum7 So does this save us any cycles? Unfortunately, it doesn't: Size 64 1023 1024 Time (s) 0.22 1.10 1.15 If anything, it is slightly worse than the previous case. Sometimes being tricky helps, sometimes it doesn't. This is a case of the latter. We will need to try something different to improve. Vectorization One obvious optimization technique for cpu-bound problems is vectorization. If we can use the power of the SSE instructions then we can do more work in parallel, and hopefully increase performance. However, this is one drawback. The set of SSE instructions is anything but orthogonal. There is a high likelihood that the exact instruction you might want may not be implemented. Instead, we have to make do with what exists, and re-design our algorithm around that. In this case, the limiting factor is that the vectorization instructions have no concept of the carry flag (or of ones-complement arithmetic), so we can't use the best-performing technique above. However, we can still use the method of accumulating results into a double-length register. In this case, we can add 32-bit chunks to the two 64-bit halves of a 128-bit xmm register. In fact, we can go even further, and accumulate into two such registers simultaneously, further increasing parallelism. Such code looks like: .globl checksum8 .type checksum8,@function .align 16 checksum8: mov %esi, %ecx xor %eax, %eax shr $5, %ecx je 2f # %xmm7 will hold zeros so that the unpack instructions work nicely xorpd %xmm7, %xmm7 # The two accumulation registers xorpd %xmm0, %xmm0 xorpd %xmm2, %xmm2 clc .align 16 1: movupd (%rdi), %xmm1 movupd 16(%rdi), %xmm3 # Add the non-16 byte aligned cases with adc instructions adc 8(%rdi), %rax adc 24(%rdi), %rax # Distribute the lower 8 bytes into 2 4-byte chunks punpckldq %xmm7, %xmm1 punpckldq %xmm7, %xmm3 lea 32(%rdi), %rdi # Accumulate into the two xmm registers paddq %xmm1, %xmm0 paddq %xmm3, %xmm2 dec %ecx jne 1b # Merge the %xmm register results paddq %xmm2, %xmm0 # Extract the information in %xmm0 by using the red zone. movapd %xmm0, -24(%rsp) adc -24(%rsp), %rax adc -16(%rsp), %rax adc $0, %rax # Handle the remaining possible tail of 31 bytes like before. 2: test $16, %esi je 3f add (%rdi), %rax adc 8(%rdi), %rax adc $0, %rax add $16, %rdi 3: test $8, %esi je 4f add (%rdi), %rax adc $0, %rax add $8, %rdi 4: test $4, %esi je 5f movl (%rdi), %edx add %rdx, %rax adc $0, %rax add $4, %rdi 5: test $2, %esi je 6f xor %edx, %edx movw (%rdi), %dx add %rdx, %rax adc $0, %rax add $2, %rdi 6: test $1, %esi je 7f xor %edx, %edx movb (%rdi), %dl add %rdx, %rax adc $0, %rax 7: mov %eax, %edx shr $32, %rax add %edx, %eax adc $0, %eax mov %eax, %edx shr $16, %eax add %dx, %ax adc $0, %ax not %ax retq .size checksum8, .-checksum8 The above carefully overlaps use of the vectorization instructions with normal 64-bit arithmetic. (It is possible to use a purely vectorized algorithm, but that turns out to be slower.) However, the results are not encouraging: Size 64 1023 1024 Time (s) 0.28 2.27 2.29 This is much worse than not using vectorization at all. The problem here is that we are using three relatively expensive instructions (vectorized load, extract, add) to do the work of a single 64-bit adc. The fact that the vectorized instructions can execute with more parallelism doesn't help make up for their intrinsic inefficiency. We can unroll more, and do 64-bytes at a time: .globl checksum9 .type checksum9,@function .align 16 checksum9: mov %esi, %ecx xor %eax, %eax shr $6, %ecx je 2f # Just accumulate into %xmm0 this time xorpd %xmm0, %xmm0 xorpd %xmm7, %xmm7 clc .align 16 1: adc 8(%rdi), %rax adc 24(%rdi), %rax # Use three xmm loads though movupd (%rdi), %xmm1 movupd 16(%rdi), %xmm2 movupd 32(%rdi), %xmm3 adc 40(%rdi), %rax adc 48(%rdi), %rax # Three unpacks punpckldq %xmm7, %xmm1 punpckldq %xmm7, %xmm2 punpckldq %xmm7, %xmm3 adc 56(%rdi), %rax lea 64(%rdi), %rdi # Three semi-independent adds paddq %xmm1, %xmm2 paddq %xmm3, %xmm0 paddq %xmm2, %xmm0 dec %ecx jne 1b movapd %xmm0, -24(%rsp) adc -24(%rsp), %rax adc -16(%rsp), %rax adc $0, %rax 2: test $32, %esi je 3f add (%rdi), %rax adc 8(%rdi), %rax adc 16(%rdi), %rax adc 24(%rdi), %rax adc $0, %rax add $32, %rdi 3: test $16, %esi je 4f add (%rdi), %rax adc 8(%rdi), %rax adc $0, %rax add $16, %rdi 4: test $8, %esi je 5f add (%rdi), %rax adc $0, %rax add $8, %rdi 5: test $4, %esi je 6f movl (%rdi), %edx add %rdx, %rax adc $0, %rax add $4, %rdi 6: test $2, %esi je 7f xor %edx, %edx movw (%rdi), %dx add %rdx, %rax adc $0, %rax add $2, %rdi 7: test $1, %esi je 8f xor %edx, %edx movb (%rdi), %dl add %rdx, %rax adc $0, %rax 8: mov %eax, %edx shr $32, %rax add %edx, %eax adc $0, %eax mov %eax, %edx shr $16, %eax add %dx, %ax adc $0, %ax not %ax retq .size checksum9, .-checksum9 This does a little better: Size 64 1023 1024 Time (s) 0.26 1.70 1.75 However, we still haven't caught up to the unvectorized code. We need to vectorize in a smarter way. The problem is that we are still using three "expensive" vector instructions to handle 64-bits of data. There is a way however, to use 4 vector instructions to work with 128 bits of data. This is quite an improvement, and might be 50% faster. The trick is to try to emulate the second C code that open-codes the add-with-carry instruction by using a comparison. There are plenty of comparison operations available, but none does exactly what we want. Basically, we would like a vectorized unsigned integer comparison, but all the integer comparison instructions are on signed integers. We can get around this by noticing that if a and b are unsigned integers then the signed comparison (a - msb) < (b - msb) will produce the correct result. (Basically, we are shifting zero to be at INT_MIN.) Since the msb (most significant bit) can be toggled with an addition, subtraction or an xor instruction, we have quite some flexibility here. We choose to have the accumulator in the form a + msb. Then, when we add a partial result, we get (a + b + msb). Thus, if we compare (a + b + msb) to (a + msb), we can detect a carry. Since the SSE comparison instructions set their "true" value to be all-ones, and that is equal to -1 in twos-complement arithmetic, we can subtract that to in effect add the carry as needed. Finally, when done, we can xor out the msb's to get the accumulated ones-complement results. So some lateral thinking gives: .align 16 # Mask of most significant bits. mask: .quad 0x8000000080000000 .quad 0x8000000080000000 .globl checksum10 .type checksum10,@function .align 16 checksum10: mov %esi, %ecx xor %eax, %eax shr $6, %ecx je 2f movapd mask, %xmm0 movapd mask, %xmm2 clc .align 16 1: adc 32(%rdi), %rax # Load 32 bytes of data into xmm registers movapd %xmm0, %xmm1 movapd %xmm2, %xmm3 adc 40(%rdi), %rax # Accumulate in %xmm0 and %xmm2 paddd (%rdi), %xmm0 paddd 16(%rdi), %xmm2 adc 48(%rdi), %rax # pcmpgtq is in SSE4.2, not SSE2, which limits us to 32-bit accumulators pcmpgtd %xmm0, %xmm1 pcmpgtd %xmm2, %xmm3 adc 56(%rdi), %rax # Add in carry psubd %xmm1, %xmm0 psubd %xmm3, %xmm2 lea 64(%rdi), %rdi dec %ecx jne 1b # Use the same algorithm to combine %xmm0 and %xmm2 when done xorpd mask, %xmm2 movapd %xmm0, %xmm1 paddd %xmm2, %xmm0 pcmpgtd %xmm0, %xmm1 psubd %xmm1, %xmm0 xorpd mask, %xmm0 # Finally, extract the results from %xmm0 movapd %xmm0, -24(%rsp) adc -24(%rsp), %rax adc -16(%rsp), %rax adc $0, %rax 2: test $32, %esi je 3f add (%rdi), %rax adc 8(%rdi), %rax adc 16(%rdi), %rax adc 24(%rdi), %rax adc $0, %rax add $32, %rdi 3: test $16, %esi je 4f add (%rdi), %rax adc 8(%rdi), %rax adc $0, %rax add $16, %rdi 4: test $8, %esi je 5f add (%rdi), %rax adc $0, %rax add $8, %rdi 5: test $4, %esi je 6f movl (%rdi), %edx add %rdx, %rax adc $0, %rax add $4, %rdi 6: test $2, %esi je 7f movzxw (%rdi), %rdx add %rdx, %rax adc $0, %rax add $2, %rdi 7: test $1, %esi je 8f movzxb (%rdi), %rdx add %rdx, %rax adc $0, %rax 8: mov %eax, %edx shr $32, %rax add %edx, %eax adc $0, %eax mov %eax, %edx shr $16, %eax add %dx, %ax adc $0, %ax not %ax retq .size checksum10, .-checksum10 This improved algorithm does better than the previous vectorized code: Size 64 1023 1024 Time (s) 0.26 1.28 1.21 Again, it still isn't fast enough. Perhaps the problem is that we now have extra overhead after the main loop ends in order to combine the two xmm register results. Let's just accumulate into one register: .globl checksum11 .type checksum11,@function .align 16 checksum11: mov %esi, %ecx xor %eax, %eax shr $6, %ecx je 2f # Just use %xmm0 to accumulate movapd mask, %xmm0 clc .align 16 1: adc 0(%rdi), %rax adc 8(%rdi), %rax # A single load, add movapd %xmm0, %xmm1 paddd 16(%rdi), %xmm0 adc 32(%rdi), %rax adc 40(%rdi), %rax # A single compare pcmpgtd %xmm0, %xmm1 adc 48(%rdi), %rax adc 56(%rdi), %rax # Handle the carries psubd %xmm1, %xmm0 lea 64(%rdi), %rdi dec %ecx jne 1b # Remove the msb's now that we are done xorpd mask, %xmm0 movapd %xmm0, -24(%rsp) adc -24(%rsp), %rax adc -16(%rsp), %rax adc $0, %rax 2: mov %esi, %ecx shr $3, %ecx and $7, %ecx # Also clears the carry flag je 4f 3: adc (%rdi), %rax lea 8(%rdi), %rdi dec %ecx jne 3b 4: adc $0, %rax test $7, %edi jne 7f # Not aligned mov %esi, %ecx and $7, %ecx jne 6f # Some odd bytes to do # Fold down to 16 bits 5: mov %eax, %edx shr $32, %rax add %edx, %eax adc $0, %eax mov %eax, %edx shr $16, %eax add %dx, %ax adc $0, %ax not %ax retq # Add in last bytes 6: lea (,%ecx,8),%ecx mov $-1, %rdx neg %ecx shr %cl, %rdx and (%rdi), %rdx add %rdx, %rax adc $0, %rax jmp 5b 7: test $4, %esi je 8f movl (%rdi), %edx add %rdx, %rax adc $0, %rax add $4, %rdi 8: test $2, %esi je 9f movzxw (%rdi), %rdx add %rdx, %rax adc $0, %rax add $2, %rdi 9: test $1, %esi je 5b movzxb (%rdi), %rdx add %rdx, %rax adc $0, %rax jmp 5b .size checksum11, .-checksum11 This does a little better than before: Size 64 1023 1024 Time (s) 0.170 1.19 1.11 We are catching up on the unvectorized code. However, there is something rather important that we have ignored up until now. The SSE vectorized instructions depend on having their inputs 16-byte aligned. If that is not the case, then they will fail with a hardware exception. We need to make sure that the buffer is properly aligned. Sometimes that is trivial, just use an aligned allocator, or use an aligned attribute for a static or automatic buffer. However, the use-case for the TCP checksum is with networking. The data from packets can be unaligned, so we really need to handle that annoying case. If the buffer is mis-aligned, but still evenly-aligned, then we can use a similar algorithm. Just handle the beginning bytes until we are 16-byte aligned (or run out of bytes to checksum). Initialize the accumulator to that value, and continue on as normal with the vectorized code. However, if we are at an odd-alignment we run into trouble. We will have a partial 16-bit word to handle. Here, the fact that the checksum is endian-neutral saves us. We can accumulate part of the checksum in the "wrong" endian form, and then correct for that later. Thus the weakness that SSE instructions don't handle mis-aligned data severely complicates the code. However, we can still get something to work: .globl checksum12 .type checksum12,@function .align 16 checksum12: xor %eax, %eax test $0xf, %edi jne 7f # Not aligned? 0: mov %esi, %ecx shr $6, %ecx je 2f movapd mask, %xmm0 clc # The main loop assumes it has 16-byte alignment 1: adc 0(%rdi), %rax adc 8(%rdi), %rax movapd %xmm0, %xmm1 paddd 16(%rdi), %xmm0 adc 32(%rdi), %rax adc 40(%rdi), %rax pcmpgtd %xmm0, %xmm1 adc 48(%rdi), %rax adc 56(%rdi), %rax psubd %xmm1, %xmm0 lea 64(%rdi), %rdi dec %ecx jne 1b xorpd mask, %xmm0 movq %xmm0, %rdx movhlps %xmm0, %xmm0 adc %rdx, %rax movq %xmm0, %rdx adc %rdx, %rax adc $0, %rax 2: mov %esi, %ecx shr $3, %ecx and $7, %ecx # Also clears the carry flag je 4f 3: adc (%rdi), %rax lea 8(%rdi), %rdi dec %ecx jne 3b 4: adc $0, %rax mov %esi, %ecx and $7, %ecx jne 6f # Some odd bytes to do # Fold down to 16 bits 5: mov %eax, %edx shr $32, %rax add %edx, %eax adc $0, %eax mov %eax, %edx shr $16, %eax add %dx, %ax adc $0, %ax not %ax retq # Add in last bytes 6: lea (,%ecx,8),%ecx mov $-1, %rdx neg %ecx shr %cl, %rdx and (%rdi), %rdx add %rdx, %rax adc $0, %rax jmp 5b 7: test $1, %edi jne 11f # Odd-aligned is annoying mov %edi, %ecx neg %ecx and $0xf, %ecx cmp %ecx, %esi cmovl %esi, %ecx # Work to do is min(size, alignment) sub %ecx, %esi test $8, %ecx je 8f movq (%rdi), %rax add $8, %rdi 8: test $4, %ecx je 9f movl (%rdi), %edx add %rdx, %rax adc $0, %rax add $4, %rdi 9: test $2, %ecx je 10f movzwq (%rdi), %rdx add %rdx, %rax adc $0, %rax add $2, %rdi 10: test $1, %ecx je 0b movzbq (%rdi), %rdx add %rdx, %rax adc $0, %rax add $1, %rdi test %esi, %esi # Nothing left to do? je 5b # Continue with the rest jmp 0b 11: movzbw (%rdi), %r8w inc %rdi dec %esi je 12f # Only one byte to do? push %rbp # Needed for 16-byte stack alignment call checksum12 pop %rbp not %ax xchg %al, %ah add %r8w, %ax adc $0, %ax not %ax retq 12: mov %r8w, %ax not %ax retq .size checksum12, .-checksum12 The cost of checking for alignment is overwhelmed by the other work the function needs to do, and the resulting performance doesn't change from the previous (incorrect) method: Size 64 1023 1024 Time (s) 0.170 1.20 1.11 Unfortunately, on this machine, it doesn't look like you can do much better with using the vectorized instructions. However, the above does exploit a trick that we haven't used with the non-vectorized code. By enforcing aligned accesses in the inner loop, we can clean up the code that deals with the odd-sized tail. Doing the same with the adc-based loop looks like: .globl checksum13 .type checksum13,@function .align 16 checksum13: xor %eax, %eax test $7, %edi jne 7f 0: mov %esi, %ecx shr $6, %ecx je 2f clc .align 16 1: adc 0(%rdi), %rax adc 8(%rdi), %rax adc 16(%rdi), %rax adc 24(%rdi), %rax adc 32(%rdi), %rax adc 40(%rdi), %rax adc 48(%rdi), %rax adc 56(%rdi), %rax lea 64(%rdi), %rdi dec %ecx jne 1b adc $0, %rax 2: mov %esi, %ecx shr $3, %ecx and $7, %ecx # Also clears the carry flag je 4f 3: adc (%rdi), %rax lea 8(%rdi), %rdi dec %ecx jne 3b 4: adc $0, %rax and $7, %esi jne 6f # Some odd bytes to do # Fold down to 16 bits 5: mov %eax, %edx shr $32, %rax add %edx, %eax adc $0, %eax mov %eax, %edx shr $16, %eax add %dx, %ax adc $0, %ax not %ax retq # Add in last bytes 6: lea (,%esi,8),%ecx mov $-1, %rdx neg %ecx shr %cl, %rdx and (%rdi), %rdx add %rdx, %rax adc $0, %rax jmp 5b 7: test $1, %edi jne 11f # Odd-aligned is annoying mov %edi, %ecx neg %ecx and $0x7, %ecx cmp %ecx, %esi cmovl %esi, %ecx # Work to do is min(size, alignment) sub %ecx, %esi test $4, %ecx je 8f movl (%rdi), %edx add %rdx, %rax adc $0, %rax add $4, %rdi 8: test $2, %ecx je 9f movzwq (%rdi), %rdx add %rdx, %rax adc $0, %rax add $2, %rdi 9: test $1, %ecx je 10f movzbq (%rdi), %rdx add %rdx, %rax adc $0, %rax add $1, %rdi 10: test %esi, %esi # Nothing left to do? je 5b # Continue with the rest jmp 0b 11: movzbw (%rdi), %r8w inc %rdi dec %esi je 12f # Only one byte to do? call checksum13 not %ax xchg %al, %ah add %r8w, %ax adc $0, %ax not %ax retq 12: mov %r8w, %ax not %ax retq .size checksum13, .-checksum13 This gives: Size 64 1023 1024 Time (s) 0.128 1.13 1.05 Which is faster than before, except for the odd-length case, which is slightly slower. If we could only make those cases more like the ones that are multiples of 64 bytes long (the unrolled loop step size), then we might be able to increase speed a little further. One way to do this is via the asm equivalent of Duff's device. We can jump into the middle of the loop when we start it. Unfortunately, we still need to handle lengths not exact multiples of the 64-bit register size in a special way, but we needed to do that anyway. Normally, you would construct a jump table, and then use it to branch to the correct spot in the inner loop. However, here, all the interesting instructions are the same length: four bytes long. This means we can do some arithmetic and emulate the equivalent of a computed goto and avoid doing a relatively expensive unpredictable memory lookup. The resulting code looks like: .globl checksum14 .type checksum14,@function .align 16 checksum14: xor %eax, %eax test $7, %edi jne 5f 0: mov %esi, %edx mov %esi, %ecx shr $3, %edx je 2f shr $6, %ecx and $7, %edx # A multiple of 64 means we do full executions of the loop je 1f add $1, %ecx # We need to adjust the pointer, because we have a partial loop lea -64(%rdi,%rdx,8), %rdi # Compute the jump point neg %edx and $7, %edx lea 1f-1(,%rdx,4), %rdx jmp *%rdx .align 16 1: adc (%rdi), %rax adc 8(%rdi), %rax adc 16(%rdi), %rax adc 24(%rdi), %rax adc 32(%rdi), %rax adc 40(%rdi), %rax adc 48(%rdi), %rax adc 56(%rdi), %rax lea 64(%rdi), %rdi dec %ecx jne 1b adc $0, %rax 2: and $7, %esi jne 4f # Some odd bytes to do # Fold down to 16 bits 3: mov %eax, %edx shr $32, %rax add %edx, %eax adc $0, %eax mov %eax, %edx shr $16, %eax add %dx, %ax adc $0, %ax not %ax retq # Add in last bytes 4: lea (,%esi,8),%ecx mov $-1, %rdx neg %ecx shr %cl, %rdx and (%rdi), %rdx add %rdx, %rax adc $0, %rax jmp 3b 5: test $1, %edi jne 9f # Odd-aligned is annoying mov %edi, %ecx neg %ecx and $0x7, %ecx cmp %ecx, %esi cmovl %esi, %ecx # Work to do is min(size, alignment) sub %ecx, %esi test $4, %ecx je 6f movl (%rdi), %edx add %rdx, %rax adc $0, %rax add $4, %rdi 6: test $2, %ecx je 7f movzwq (%rdi), %rdx add %rdx, %rax adc $0, %rax add $2, %rdi 7: test $1, %ecx je 8f movzbq (%rdi), %rdx add %rdx, %rax adc $0, %rax add $1, %rdi 8: test %esi, %esi # Nothing left to do? je 3b # Continue with the rest jmp 0b 9: movzbw (%rdi), %r8w inc %rdi dec %esi je 10f # Only one byte to do? call checksum14 not %ax xchg %al, %ah add %r8w, %ax adc $0, %ax not %ax retq 10: mov %r8w, %ax not %ax retq .size checksum14, .-checksum14 This gives: Size 64 1023 1024 Time (s) 0.118 1.10 1.04 The above is very good, the best code so far on this machine. The only thing left to try is shrinking the inner loop. The current code processes 64 bytes at a time, like the vectorized version did. However, the old adc-based algorithm found that 32 bytes at a time was the best. Lets take one last look at what that might do: .globl checksum15 .type checksum15,@function .align 16 checksum15: xor %eax, %eax test $7, %edi jne 5f mov %esi, %edx mov %esi, %ecx shr $3, %edx je 2f shr $5, %ecx and $3, %edx je 1f add $1, %ecx lea -32(%rdi,%rdx,8), %rdi neg %edx and $3, %edx lea 1f-1(,%rdx,4), %rdx jmp *%rdx # Now only 32 bytes at a time .align 16 1: adc (%rdi), %rax adc 8(%rdi), %rax adc 16(%rdi), %rax adc 24(%rdi), %rax lea 32(%rdi), %rdi dec %ecx jne 1b adc $0, %rax 2: and $7, %esi jne 4f # Some odd bytes to do # Fold down to 16 bits 3: mov %eax, %edx shr $32, %rax add %edx, %eax adc $0, %eax mov %eax, %edx shr $16, %eax add %dx, %ax adc $0, %ax not %ax retq # Add in last bytes 4: lea (,%esi,8),%ecx mov $-1, %rdx neg %ecx shr %cl, %rdx and (%rdi), %rdx add %rdx, %rax adc $0, %rax jmp 3b 5: test $1, %edi jne 9f # Odd-aligned is annoying mov %edi, %ecx neg %ecx and $0xf, %ecx cmp %ecx, %esi cmovl %esi, %ecx # Work to do is min(size, alignment) sub %ecx, %esi test $8, %ecx je 6f movq (%rdi), %rax add $8, %rdi 6: test $4, %ecx je 7f movl (%rdi), %edx add %rdx, %rax adc $0, %rax add $4, %rdi 7: test $2, %ecx je 8f movzwq (%rdi), %rdx add %rdx, %rax adc $0, %rax add $2, %rdi 8: test $1, %ecx je 0b movzbq (%rdi), %rdx add %rdx, %rax adc $0, %rax add $1, %rdi test %esi, %esi # Nothing left to do? je 3b # Continue with the rest jmp 0b 9: movzbw (%rdi), %r8w inc %rdi dec %esi je 10f # Only one byte to do? call checksum15 not %ax xchg %al, %ah add %r8w, %ax adc $0, %ax not %ax retq 10: mov %r8w, %ax not %ax retq .size checksum15, .-checksum15 The results don't change all that much: Size 64 1023 1024 Time (s) 0.128 1.10 1.04 The only major difference is the small 64-byte packet time has gone up due to it requiring two loop iterations instead of one. However, it is still faster than the earlier unvectorized code. On the other hand, we may want to keep with the 64-byte unrolled version due to this slight speed loss. Summary After trying many many optimization tricks, we can find a method of calculating the TCP/IP checksum up to an order of magnitude faster than the original C code. Most of the gains were due to using computation on larger chunks of data (64-bits rather than 16 bits at a time). More gains were due to unrolling the inner loop, and controlling use of the carry flag. Small gains could be obtained by controlling alignment, and breaking slow-paths out of the main control flow. Unfortunately vectorization was not a win on this machine for this problem. The non-orthogonality of the SSE instructions can hurt quite a bit when the operation you need isn't supported. However, this isn't the full story. On more recent machines the adc instruction which is key to the unvectorized algorithm is much much slower. In addition, the vectorized instructions are much much faster, plus there are more of them doing new things. The combination of these changes could mean that one of the "slower" functions above could be better. If vectorization does prove to be better on your machine, perhaps you may want to explore adding the Duff's device trick to the best vectorized version. You may also want to explore using instructions added after SSE2, or perhaps even the 256-bit AVX extensions. As hardware changes over time, so does the most optimal version of an algorithm. Most of the time, it isn't worth it to hyper-optimize something. In some cases, where you absolutely need the performance edge, it can be highly profitable though. Lockless

Articles

The TCP/IP Checksum