Optimized SIMD Cross-Product

TL;DR: use the function cross_improved(...) , listed below, and do not compile it with FMA ( -mfma ).

A cross-product is a useful operation, particularly in graphics. The computation is this:

\[ \vec{x} \times \vec{y} := \begin{bmatrix} x_1 y_2 - x_2 y_1\\ x_2 y_0 - x_0 y_2\\ x_0 y_1 - x_1 y_0 \end{bmatrix} \]

3D vectors are typically stored (at least on-chip) in a hardware 4D vector register with the fourth component ignored. On typical (x86-64) desktops, this would be the SSE registers "xmmi". The simple and common implementation (such as here) then becomes:

inline __m128 cross_simple(__m128 const& x, __m128 const& y) { __m128 tmp0 = _mm_shuffle_ps(x,x,_MM_SHUFFLE(3,0,2,1)); __m128 tmp1 = _mm_shuffle_ps(y,y,_MM_SHUFFLE(3,1,0,2)); __m128 tmp2 = _mm_shuffle_ps(x,x,_MM_SHUFFLE(3,1,0,2)); __m128 tmp3 = _mm_shuffle_ps(y,y,_MM_SHUFFLE(3,0,2,1)); return _mm_sub_ps( _mm_mul_ps(tmp0,tmp1), _mm_mul_ps(tmp2,tmp3) ); }

This starts with four shuffle instructions, then two multiplies, then a subtract. Not bad, but can we do better?

To see how, observe that the purpose of the shuffles is to arrange the temporary vectors for the subsequent multiplication. Clearly, e.g. tmp0 and tmp1 must be shuffled with respect to each other, but shouldn't this only require one shuffle?

Indeed, but then the resulting vector will be in the wrong order! This is fine though: we can just add a single shuffle back at the end to fix it. Hence:

inline __m128 cross_improved(__m128 const& x, __m128 const& y) { __m128 tmp0 = _mm_shuffle_ps(y,y,_MM_SHUFFLE(3,0,2,1)); __m128 tmp1 = _mm_shuffle_ps(x,x,_MM_SHUFFLE(3,0,2,1)); tmp0 = _mm_mul_ps(tmp0,x); tmp1 = _mm_mul_ps(tmp1,y); __m128 tmp2 = _mm_sub_ps(tmp0,tmp1); return _mm_shuffle_ps(tmp2,tmp2,_MM_SHUFFLE(3,0,2,1)); }

\[ \vec{T(\vec{x},\vec{y})} := \begin{bmatrix} x_0 y_1 - x_1 y_0\\ x_1 y_2 - x_2 y_1\\ x_2 y_0 - x_0 y_2 \end{bmatrix}\\ \vec{x} \times \vec{y} = \begin{bmatrix} T(\vec{x},\vec{y})_1\\ T(\vec{x},\vec{y})_2\\ T(\vec{x},\vec{y})_0 \end{bmatrix} \]

We can also use a fused-multiply-add instruction to combine one of the multiplications with the subtraction if the CPU supports it (feature "FMA", present in Haswell and later, probably synonymous with "AVX" in-practice (?)). Note: compile with -mfma .

inline __m128 cross_improvedfma(__m128 const& x, __m128 const& y) { __m128 tmp0 = _mm_shuffle_ps(y,y,_MM_SHUFFLE(3,0,2,1)); __m128 tmp1 = _mm_shuffle_ps(x,x,_MM_SHUFFLE(3,0,2,1)); tmp1 = _mm_mul_ps(tmp1,y); __m128 tmp2 = _mm_fmsub_ps( tmp0,x, tmp1 ); return _mm_shuffle_ps(tmp2,tmp2,_MM_SHUFFLE(3,0,2,1)); }

How much faster does this all translate to? Well, it's insufficient to look at the C++ code. It's insufficient even to count the instructions in the disassembly. We'll have to look at the timing information of the instructions' micro-ops. Although I initially tried to do this by hand, it turns out that this is nearly impossible, due to the complex interaction of various pieces of the CPU's microarchitecture, which often aren't even documented anywhere.

Fortunately, Intel comes to our rescue with the Intel Architecture Code Analyzer which, with some finagling, can produce fairly accurate timing static analysis. The current version (as of this writing, 3.0) inexplicably doesn't support a latency analysis. I manually parsed some results out of its throughput analysis, but I'm not confident in them. Instead, here is the timing with the latency analysis in version 2.1 for Haswell:

Code Haswell Latency

(Cycles) Compiler Flags Version -mavx

(AVX) cross_simple(...) 12 cross_improved(...) 11 -mfma

(FMA, AVX) cross_simple(...) 12 cross_improved(...)

cross_improvedfma(...)

(results are same) 13

What's interesting here is that adding FMA instructions actually hurts the improved version (even though the improved version without FMA is the best of any algorithm, with or without FMA). It's unclear exactly why this happens. I should probably look into the details of the traces some more at some point, but I suspect the FMA instruction's complexity is the culprit.

We conclude that you should use cross_improved(...) and compile that module without FMA. That said, when the function is inline d, the compiler might be able to interleave surrounding instructions with the FMA instructions, making it unclear that FMA would hurt in-practice, while cross_simple(...) , for all its simplicity, is still actually a pretty good choice. The difference in any case is at most just a few cycles.

For completeness, the key pieces of the disassembly code look like:

Compiler Flags cross_simple(...) cross_improved(...) cross_improvedfma(...) -mavx

(AVX) ;`x` loaded into `xmm0` ;`y` loaded into `xmm1` vshufps xmm2, xmm0, xmm0, 201 vshufps xmm3, xmm1, xmm1, 210 vshufps xmm0, xmm0, xmm0, 210 vshufps xmm1, xmm1, xmm1, 201 vmulps xmm2, xmm2, xmm3 vmulps xmm0, xmm0, xmm1 vsubps xmm0, xmm2, xmm0 ;x⨯y returned in `xmm0` ;`x` loaded into `xmm0` ;`y` loaded into `xmm1` vshufps xmm3, xmm1, xmm1, 201 vshufps xmm2, xmm0, xmm0, 201 vmulps xmm0, xmm0, xmm3 vmulps xmm1, xmm1, xmm2 vsubps xmm0, xmm0, xmm1 vshufps xmm0, xmm0, xmm0, 201 ;x⨯y returned in `xmm0` N/A -mfma

(FMA, AVX) ;`x` loaded into `xmm0` ;`y` loaded into `xmm1` vshufps xmm2, xmm0, xmm0, 201 vshufps xmm3, xmm1, xmm1, 210 vmulps xmm2, xmm2, xmm3 vshufps xmm1, xmm1, xmm1, 201 vshufps xmm0, xmm0, xmm0, 210 vfnmadd132ps xmm0, xmm2, xmm1 ;x⨯y returned in `xmm0` ;`x` loaded into `xmm0` ;`y` loaded into `xmm1` vshufps xmm2, xmm1, xmm1, 201 vshufps xmm3, xmm0, xmm0, 201 vmulps xmm1, xmm1, xmm3 vfmsub132ps xmm0, xmm1, xmm2 vshufps xmm0, xmm0, xmm0, 201 ;x⨯y returned in `xmm0` ;`x` loaded into `xmm2` ;`y` loaded into `xmm1` vshufps xmm0, xmm1, xmm1, 201 vshufps xmm3, xmm2, xmm2, 201 vmulps xmm1, xmm1, xmm3 vfmsub132ps xmm0, xmm1, xmm2 vshufps xmm0, xmm0, xmm0, 201 ;x⨯y returned in `xmm0` Note that this is functionally identical to block to left.

It seems several others have discovered similar solutions: