Historical wisdom says that if you're writing a 3D vector maths library, you generally shouldn't base it around a SIMD data type. You'd be very tempted to, because obviously a SIMD type can do 4 adds for the price of one. In practice this has always been a mixed blessing. While you can indeed do 4 adds, the new type doesn't usually behave nicely when it comes to mixing with the rest of the codebase.

In order to get data in and out of the SIMD registers, or to pass the registers around to different functions, you'll usually find that you'll spend more instructions actually doing loads and stores than you saved by switching to SIMD.

Intel's SSE implementation is a particular troublemaker. The floating point registers on the x86 are separate registers from the SSE registers. This means that if you want to move a float into a __m128, the compiler has to store it back to memory, and then reload it. This completely nerfs any chance you had at getting good maths performance.

And so the advice usually given is to forget it: just stick with a simple struct with 3 floats in. SIMD optimization is best saved for a few special cases where you can write tight loops over specially organized data, such as CPU vertex skinning, or a cloth simulation perhaps. But don't use it for your regular 3D math library, because it'll probably end up slower.

Yet facts are not written in stone. What was true 10 years ago may not be true today. Advances in compiler technology mean we have to re-evaluate our beliefs every few years. I'm going to mention 3 particular advances here that change this:

Scalar SSE by default

When I said earlier that Intel's FP registers are separate from the SSE registers, that's missing part of the picture. You see, it turns out that the SSE registers aren't just for SIMD: you can do regular floating point arithmetic with them too, at no cost. This means you don't ever need to transfer floats back between FP and SSE registers - you just keep them in SSE registers the whole time! Most compilers have been able to take advantage of this for many years now, but it may require turning on an extra option which some people don't realize is there. On Visual Studio, enable /arch:SSE2 by default. Do it on both release and debug builds. (not required on 64-bit as it's already the default) SSA form

In the past 10 years most of the major compiler vendors have moved to SSA representations (Static Single Assignment) for their intermediate trees, which mean that data flow is greatly improved throughout the whole compiler. Compilers now take any chance they can get to avoid redundant data transfer. vectorcall

This is the key. The default calling convention for a function is "cdecl" (or "fastcall" on 64-bit). These conventions don't really care much for SSE registers; if you call a function that uses SSE registers, here's what happens: The SSE register gets written out onto the stack

The address of the SSE register gets written out onto the stack

The function gets called

The address of the SSE register gets read off the stack

The SSE register gets read off the stack Yeerugh. That's horrible. That's 4 extra steps per function argument. Surely there's something better we can do? In 2013 Microsoft introduced the Vector Calling Convention to address this. The idea of vectorcall is that SIMD values stay in SSE registers the whole time - they never need to go back and forth to memory (except of course when you actually finish with them). You can go through your codebase and tag every function with the new __vectorcall keyword if you want, but that's kinda stupid. Enable /Gv in the compiler options and it'll magically take hold everywhere.

Note If you're using C++, /Gv does not apply by default to member functions. You'll have to annotate them manually. This doesn't apply to the ones in the float3 type itself because they're force-inlined.

So, how do we put all this together and write our new maths library with it in mind? I've written a small sample application which you can download at the end. First I'll walk you through each section (that-there literate programming I've heard so much about), and then we'll inspect the generated assembly to see how well it does.

You'll note that we aim to follow HLSL syntax whenever possible. HLSL is the one true syntax. There's no place in this world for things like QVector3D::dotProduct() any more. Unfortunately the __m128 type prevents us from using the HLSL swizzle/access syntax, which sucks. It's not the end of the world though, we'll live.

Compiler options

I'm only going to cover Visual Studio here, but the same principles apply to gcc/clang. Note that these options should be applied to both release and debug builds.

Set inline expansion to "Only __inline" (/Ob1). Don't use automatic inlining (see below).

Enable intrinsic functions (/Oi).

Enable the enhanced instruction set (/arch:SSE2). Don't go higher than this as you'll limit your customer base. Literally everyone has SSE2 however. (NOTE: not needed on 64-bit, as it is already the default there)

Set the floating point model to 'Fast' (/fp:fast).

Set the default calling convention to __vectorcall (/Gv).

And of course, for release builds you'll want to enable optimization (!). I only mention that because occasionally I see new programmers who don't know about it.

There's a bunch more stuff you can tweak but that's your business. This is the minimum.

Headers

The most important header is <xmmintrin.h> , which provides the __m128 SSE type and the _mm_blah() functions which operate on it. You'll also need <math.h> for the regular stuff like sqrt() , and <stdint.h> . When designing libraries, it's important not to pull in headers that you don't need. Forward-declare types when you can.

Many game programmers are used to defining their own integer types, as C++ never used to do it for you. Well, we have <stdint.h> now, so use it. There's no place any more for your own Uint32 types. It's time to move on.

Also note the use of #pragma once . All C++ compilers you care about now support this (and even most of the ones you don't), so you can skip the include guards.

I realise I've strayed off the topic of maths here, but sometimes when you explain things it's good to color over the edges a little.

#pragma once #include <stdint.h> #include <math.h> #include <xmmintrin.h>

Setup

Inlining is critical a maths library. Not just for performance, but... OK, this is going to be hard to explain. It's about the essence of what's being generated. Many people think that the disassembly of some code is just an artifact - a byproduct of the compilation that you might occasionally take a peep at, if the fancy takes you. Yet the quality of the disassembly is important in it's own right. The ability to step and inspect at that low level is critical.

Arbitrary inlining in general can make the disassembly hard to read, so don't rely on the compiler's own heuristics (i.e. guesses) as to whether to inline something. If you want it inlined, mark it as such, otherwise don't. On Visual Studio, set the /Ob1 option to achieve this, and then use __forceinline for anything you want inlined. Do this even in debug builds.

For tiny functions, inlining actually provides a clearer disassembly, and cleaner code flow. If your function is more than 4 lines however, you probably shouldn't have inlined it. Calls are cheap, especially now that we're using our lovely vectorcall to avoid all the vector-passing-baggage we used to have.

We'll also define a few helper macros here. We could include <float.h> and <limits.h> to get INT_MAX/FLT_MAX, but seeing as it's just a few lines we should just define it ourselves.

#define VM_INLINE __forceinline #define M_PI 3.14159265358979323846f #define DEG2RAD(_a) ((_a)*M_PI/180.0f) #define RAD2DEG(_a) ((_a)*180.0f/M_PI) #define INT_MIN (-2147483647 - 1) #define INT_MAX 2147483647 #define FLT_MAX 3.402823466e+38F

Our Vector Type

First we'll just define a little helper for rearranging values.

Ideally we'd just use HLSL swizzles, but that's not an option.

// Shuffle helpers. // Examples: SHUFFLE3(v, 0,1,2) leaves the vector unchanged. // SHUFFLE3(v, 0,0,0) splats the X coord out. #define SHUFFLE3(V, X,Y,Z) float3(_mm_shuffle_ps((V).m, (V).m, _MM_SHUFFLE(Z,Z,Y,X)))

Now let's define our vector type. We'll just cover float3 here, but in a production library you'd also want float2, float4, and possibly int2-4 types too (although integer types probably don't need to be SIMD).

Always make constructors explicit (except for the default constructor). It'll save you many headaches down the line. Try to be generous in what you provide - remember the goal is to make it easy for the users. I haven't listed it, but you'll probably also want one that takes a float2+Z value and joins them.

struct float3 { // Constructors. VM_INLINE float3 () {} VM_INLINE explicit float3 ( const float * p ) { m = _mm_set_ps ( p [ 2 ], p [ 2 ], p [ 1 ], p [ 0 ]); } VM_INLINE explicit float3 ( float x , float y , float z ) { m = _mm_set_ps ( z , z , y , x ); } VM_INLINE explicit float3 ( __m128 v ) { m = v ; }

As I mentioned, the __m128 type doesn't provide any decent direct access to the X/Y/Z components, so we'll need wrappers. I really hate this, but what can you do. It's a small evil so we'll have to just live with it. You might be tempted to try defining the float3 struct as a union instead, but that'll just screw with the compiler's data flow analysis.

VM_INLINE float x () const { return _mm_cvtss_f32 ( m ); } VM_INLINE float y () const { return _mm_cvtss_f32 ( _mm_shuffle_ps ( m , m , _MM_SHUFFLE ( 1 , 1 , 1 , 1 ))); } VM_INLINE float z () const { return _mm_cvtss_f32 ( _mm_shuffle_ps ( m , m , _MM_SHUFFLE ( 2 , 2 , 2 , 2 ))); }

Let's make up for the lack of HLSL swizzles by defining some of the common ones manually:

VM_INLINE float3 yzx () const { return SHUFFLE3 ( * this , 1 , 2 , 0 ); } VM_INLINE float3 zxy () const { return SHUFFLE3 ( * this , 2 , 0 , 1 ); }

Occasionally we might want to store data out to 3 unaligned floats, so let's add something for that.

VM_INLINE void store ( float * p ) const { p [ 0 ] = x (); p [ 1 ] = y (); p [ 2 ] = z (); }

Some maths librarys provide writeable accessors using references. (e.g. x() = 4; ). This is a really stupid idea. References are just going to cause more loads/stores, and if your vector type has that kind of direct access anyway then you should have just exposed the X/Y/Z variables as a simple C struct.

But, we do still need occasionally to write to the components. It's actually not as common as you might think - you should almost always prefer to use a float3() constructor to construct a new vector, rather than taking an existing one and poking it. Still, let's define some write access functions.

Note that _mm_set_ss is just a no-op (i.e. it's just a cast, it doesn't generate any actual code). So setX() is actually just a single instruction.

void setX ( float x ) { m = _mm_move_ss ( m , _mm_set_ss ( x )); } void setY ( float y ) { __m128 t = _mm_move_ss ( m , _mm_set_ss ( y )); t = _mm_shuffle_ps ( t , t , _MM_SHUFFLE ( 3 , 2 , 0 , 0 )); m = _mm_move_ss ( t , m ); } void setZ ( float z ) { __m128 t = _mm_move_ss ( m , _mm_set_ss ( z )); t = _mm_shuffle_ps ( t , t , _MM_SHUFFLE ( 3 , 0 , 1 , 0 )); m = _mm_move_ss ( t , m ); }

Some maths libraries use index notation for all accesses (e.g. float f = v[1]; ). Don't do this. Use .x(), that's what it's for. The compiler will have a very hard time trying to treat the SIMD type as an array. However there are occasions when you actually do need indexing access, where the index isn't known at compile time. So let's add some access functions. They'll be slow, as it may cause a memory spill, but there we are.

VM_INLINE float operator [] ( size_t i ) const { return m . m128_f32 [ i ]; }; VM_INLINE float & operator [] ( size_t i ) { return m . m128_f32 [ i ]; };

And finally, our actual storage. We just embed the SSE type directly.

__m128 m ; };

Important There's a little subtlety here. In order for __vectorcall to work properly, we have to design our float3 type around this single __m128 type. If you try and mess with it, try and wrap it in a union, or try and stick anything else in this struct, you'll break the whole thing.

We'll also want one extra constructor here - one for constructing a float3 from 3 integers. It's not essential but it can save a lot of manually casting things about. It needs a different name so it doesn't get confused with the regular float3 constructor.

VM_INLINE float3 float3i ( int x , int y , int z ) { return float3 (( float ) x , ( float ) y , ( float ) z ); }

Phew! OK that's the first part done. In a sane world that would have just been a built-in primitive in the language, like HLSL has.

Static Construction

Occasionally you need to declare data globally. For float3's, you can just use a global constructor (e.g. float3 g_value(1,2,3); ). However some advanced usages might need a little more, so we declare a static construction helper here:

#define VCONST extern const __declspec(selectany) struct vconstu { union { uint32_t u [ 4 ]; __m128 v ; }; VM_INLINE operator __m128 () const { return v ; } }; VCONST vconstu vsignbits = { 0x80000000 , 0x80000000 , 0x80000000 , 0x80000000 };

This construct allows us to globally define SIMD data types that we can read as either integers or floating point. This is great for when you need to manually fiddle the the IEEE representation of floats. You can ignore this whole part for now though.

Operators

One of the nicest things about HLSL is that it separates the types from the functions that operate on them. Let's move on to our operators.

typedef float3 bool3 ;

Normally when you compare two values, you get a bool. SIMD comparison operators don't return a single value however, so we need a bool3 type. For today's lesson I'm just going to make that a float3 too, but if you go nuts with this then you might want to actually implement it as its own type.

Let's define our binary operators:

VM_INLINE float3 operator + ( float3 a , float3 b ) { a . m = _mm_add_ps ( a . m , b . m ); return a ; } VM_INLINE float3 operator - ( float3 a , float3 b ) { a . m = _mm_sub_ps ( a . m , b . m ); return a ; } VM_INLINE float3 operator * ( float3 a , float3 b ) { a . m = _mm_mul_ps ( a . m , b . m ); return a ; } VM_INLINE float3 operator / ( float3 a , float3 b ) { a . m = _mm_div_ps ( a . m , b . m ); return a ; } VM_INLINE float3 operator * ( float3 a , float b ) { a . m = _mm_mul_ps ( a . m , _mm_set1_ps ( b )); return a ; } VM_INLINE float3 operator / ( float3 a , float b ) { a . m = _mm_div_ps ( a . m , _mm_set1_ps ( b )); return a ; } VM_INLINE float3 operator * ( float a , float3 b ) { b . m = _mm_mul_ps ( _mm_set1_ps ( a ), b . m ); return b ; } VM_INLINE float3 operator / ( float a , float3 b ) { b . m = _mm_div_ps ( _mm_set1_ps ( a ), b . m ); return b ; } VM_INLINE float3 & operator += ( float3 & a , float3 b ) { a = a + b ; return a ; } VM_INLINE float3 & operator -= ( float3 & a , float3 b ) { a = a - b ; return a ; } VM_INLINE float3 & operator *= ( float3 & a , float3 b ) { a = a * b ; return a ; } VM_INLINE float3 & operator /= ( float3 & a , float3 b ) { a = a / b ; return a ; } VM_INLINE float3 & operator *= ( float3 & a , float b ) { a = a * b ; return a ; } VM_INLINE float3 & operator /= ( float3 & a , float b ) { a = a / b ; return a ; } VM_INLINE bool3 operator == ( float3 a , float3 b ) { a . m = _mm_cmpeq_ps ( a . m , b . m ); return a ; } VM_INLINE bool3 operator != ( float3 a , float3 b ) { a . m = _mm_cmpneq_ps ( a . m , b . m ); return a ; } VM_INLINE bool3 operator < ( float3 a , float3 b ) { a . m = _mm_cmplt_ps ( a . m , b . m ); return a ; } VM_INLINE bool3 operator > ( float3 a , float3 b ) { a . m = _mm_cmpgt_ps ( a . m , b . m ); return a ; } VM_INLINE bool3 operator <= ( float3 a , float3 b ) { a . m = _mm_cmple_ps ( a . m , b . m ); return a ; } VM_INLINE bool3 operator >= ( float3 a , float3 b ) { a . m = _mm_cmpge_ps ( a . m , b . m ); return a ; } VM_INLINE float3 min ( float3 a , float3 b ) { a . m = _mm_min_ps ( a . m , b . m ); return a ; } VM_INLINE float3 max ( float3 a , float3 b ) { a . m = _mm_max_ps ( a . m , b . m ); return a ; }

Notice that they're all inline. You never, ever, want to actually step into any of these functions.

There's also the unary operators:

VM_INLINE float3 operator - ( float3 a ) { return float3 ( _mm_setzero_ps ()) - a ; } VM_INLINE float3 abs ( float3 v ) { v . m = _mm_andnot_ps ( vsignbits , v . m ); return v ; }

Note the clever trick for abs(): SSE has no abs instruction, so instead we just mask off the sign bits ourselves. Isn't bit-twiddling great?

Helpers

That's the core of the work done. We need a few nice helper functions too.

Horizontal min/max is very useful:

VM_INLINE float hmin ( float3 v ) { v = min ( v , SHUFFLE3 ( v , 1 , 0 , 2 )); return min ( v , SHUFFLE3 ( v , 2 , 0 , 1 )). x (); } VM_INLINE float hmax ( float3 v ) { v = max ( v , SHUFFLE3 ( v , 1 , 0 , 2 )); return max ( v , SHUFFLE3 ( v , 2 , 0 , 1 )). x (); }

No 3D maths library would be complete without the vector cross product. This is specific to float3; float2/4 don't have this.

VM_INLINE float3 cross ( float3 a , float3 b ) { // x <- a.y*b.z - a.z*b.y // y <- a.z*b.x - a.x*b.z // z <- a.x*b.y - a.y*b.x // We can save a shuffle by grouping it in this wacky order: return ( a . zxy () * b - a * b . zxy ()). zxy (); }

I mentioned earlier that comparisons don't return a single value, they return a vector of booleans. How do we get this into a form we can actually work with?

The first step is to be able to construct an integer mask from the sign bits of each result:

// Returns a 3-bit code where bit0..bit2 is X..Z VM_INLINE unsigned mask ( float3 v ) { return _mm_movemask_ps ( v . m ) & 7 ; }

This mask can sometimes be useful on its own, but we can also use it to write some simple comparison helpers:

// Once we have a comparison, we can branch based on its results: VM_INLINE bool any ( bool3 v ) { return mask ( v ) != 0 ; } VM_INLINE bool all ( bool3 v ) { return mask ( v ) == 7 ; }

And finally we can fill out the rest of the HLSL standard library. I haven't listed all of them, there's a few more exotic ones you may or may not want, but this should get you started:

VM_INLINE float3 clamp ( float3 t , float3 a , float3 b ) { return min ( max ( t , a ), b ); } VM_INLINE float sum ( float3 v ) { return v . x () + v . y () + v . z (); } VM_INLINE float dot ( float3 a , float3 b ) { return sum ( a * b ); } VM_INLINE float length ( float3 v ) { return sqrtf ( dot ( v , v )); } VM_INLINE float lengthSq ( float3 v ) { return dot ( v , v ); } VM_INLINE float3 normalize ( float3 v ) { return v * ( 1.0f / length ( v )); } VM_INLINE float3 lerp ( float3 a , float3 b , float t ) { return a + ( b - a ) * t ; }

Well that's that. A simple SIMD float3 library for you. It might seem like a lot, but that's really only ~150 lines of code. I've left a matrix library as an exercise to the reader, but once you have float3/float4 in place, the rest can be based off that.

Note that the matrix library only needs to be written in terms of float3/float4 - you don't need to write any SSE intrinsics there. Also remember to only define your matrix struct as containing a simple array of float3's, otherwise you'll break vectorcall.

The Test Case

For our test, we'll try a simple non-toy example. Here's the code to intersect a ray and an AABB:

bool intersectRayBox ( float3 rayOrg , float3 invDir , float3 bbmin , float3 bbmax , float & hitT ) { float3 d0 = ( bbmin - rayOrg ) * invDir ; float3 d1 = ( bbmax - rayOrg ) * invDir ; float3 v0 = min ( d0 , d1 ); float3 v1 = max ( d0 , d1 ); float tmin = hmax ( v0 ); float tmax = hmin ( v1 ); bool hit = ( tmax >= 0 ) && ( tmax >= tmin ) && ( tmin <= hitT ); if ( hit ) hitT = tmin ; return hit ; }

And here's the code which calls it:

float hitT = FLT_MAX ; bool hit = intersectRayBox ( testorg , float3 ( 1 , 1 , 1 ) / testdir , testbbmin , testbbmax , hitT ); printf ( "hit %i at t=%f

" , hit , hitT );

Notice how although we're using intrinsics, our code remains very clean and readable. Unlike trying to write directly with SSE intrinsics, we haven't compromised at all on our design.

So what assembly does this generate? In the demo I've provided two different implementations. The first is a simple non-SSE version, using a simple struct vec3 { float x, y, z; }; type. Let's see how it holds up. Note that I've passed the parameters using references in this case, to avoid as many redundant copies as possible.

Traditional non-SSE implementation (ray-box code)

intersectRayBoxFP: sub esp , 18h mov eax , dword ptr [ ecx + 8 ] movq xmm0 , mmword ptr [ ecx ] mov dword ptr [ esp + 14h ], eax mov eax , dword ptr [ esp + 1Ch ] movq mmword ptr [ esp + 0Ch ], xmm0 movq xmm0 , mmword ptr [ eax ] mov eax , dword ptr [ eax + 8 ] mov dword ptr [ esp + 8 ], eax movss xmm3 , dword ptr [ esp + 8 ] subss xmm3 , dword ptr [ esp + 14h ] mov eax , dword ptr [ edx + 8 ] movq mmword ptr [ esp ], xmm0 movss xmm1 , dword ptr [ esp ] subss xmm1 , dword ptr [ esp + 0Ch ] movss xmm2 , dword ptr [ esp + 4 ] subss xmm2 , dword ptr [ esp + 10h ] movq xmm0 , mmword ptr [ edx ] mov dword ptr [ esp + 14h ], eax mov eax , dword ptr [ ecx + 8 ] movss xmm5 , dword ptr [ esp + 14h ] movq mmword ptr [ esp + 0Ch ], xmm0 movq xmm0 , mmword ptr [ ecx ] movss xmm4 , dword ptr [ esp + 0Ch ] movss xmm6 , dword ptr [ esp + 10h ] mov dword ptr [ esp + 8 ], eax mov eax , dword ptr [ esp + 20h ] movq mmword ptr [ esp ], xmm0 mulss xmm4 , xmm1 movq xmm0 , mmword ptr [ eax ] mov eax , dword ptr [ eax + 8 ] movq mmword ptr [ esp + 0Ch ], xmm0 movss xmm1 , dword ptr [ esp + 0Ch ] subss xmm1 , dword ptr [ esp ] movq xmm0 , mmword ptr [ edx ] mulss xmm6 , xmm2 mov dword ptr [ esp + 14h ], eax movss xmm2 , dword ptr [ esp + 10h ] subss xmm2 , dword ptr [ esp + 4 ] mov eax , dword ptr [ edx + 8 ] movq mmword ptr [ esp + 0Ch ], xmm0 movss xmm7 , dword ptr [ esp + 0Ch ] movss xmm0 , dword ptr [ esp + 10h ] mulss xmm5 , xmm3 movss xmm3 , dword ptr [ esp + 14h ] subss xmm3 , dword ptr [ esp + 8 ] mulss xmm7 , xmm1 mov dword ptr [ esp + 14h ], eax mulss xmm0 , xmm2 comiss xmm7 , xmm4 movss xmm2 , dword ptr [ esp + 14h ] movss dword ptr [ esp + 1Ch ], xmm7 mulss xmm2 , xmm3 jbe intersectRayBoxFP + 101h ( 0AA11A1h ) movaps xmm1 , xmm4 jmp intersectRayBoxFP + 104h ( 0AA11A4h ) movaps xmm1 , xmm7 comiss xmm0 , xmm6 jbe intersectRayBoxFP + 10Eh ( 0AA11AEh ) movaps xmm7 , xmm6 jmp intersectRayBoxFP + 111h ( 0AA11B1h ) movaps xmm7 , xmm0 comiss xmm2 , xmm5 jbe intersectRayBoxFP + 11Bh ( 0AA11BBh ) movaps xmm3 , xmm5 jmp intersectRayBoxFP + 11Eh ( 0AA11BEh ) movaps xmm3 , xmm2 comiss xmm4 , dword ptr [ esp + 1Ch ] ja intersectRayBoxFP + 12Bh ( 0AA11CBh ) movss xmm4 , dword ptr [ esp + 1Ch ] comiss xmm6 , xmm0 ja intersectRayBoxFP + 133h ( 0AA11D3h ) movaps xmm6 , xmm0 comiss xmm5 , xmm2 ja intersectRayBoxFP + 13Bh ( 0AA11DBh ) movaps xmm5 , xmm2 comiss xmm1 , xmm7 ja intersectRayBoxFP + 143h ( 0AA11E3h ) movaps xmm1 , xmm7 comiss xmm1 , xmm3 ja intersectRayBoxFP + 14Bh ( 0AA11EBh ) movaps xmm1 , xmm3 comiss xmm6 , xmm4 ja intersectRayBoxFP + 153h ( 0AA11F3h ) movaps xmm4 , xmm6 comiss xmm5 , xmm4 ja intersectRayBoxFP + 15Bh ( 0AA11FBh ) movaps xmm4 , xmm5 comiss xmm4 , dword ptr ds :[ 0AAD198h ] jb intersectRayBoxFP + 182h ( 0AA1222h ) comiss xmm4 , xmm1 jb intersectRayBoxFP + 182h ( 0AA1222h ) mov ecx , dword ptr [ esp + 24h ] movss xmm0 , dword ptr [ ecx ] comiss xmm0 , xmm1 jb intersectRayBoxFP + 182h ( 0AA1222h ) mov al , 1 movss dword ptr [ ecx ], xmm1 add esp , 18h ret 0Ch xor al , al add esp , 18h ret 0Ch

Bleurgh. That's over 100 instructions. There's a ridiculous amount of comparison instructions towards the end, which is mostly due to our naive implementation of min/max. I've included a variation which uses the SSE scalar min/max intrinsics to help, but honestly it's not much better.

Let's also take a look at the code which calls it:

Traditional non-SSE implementation (usage)

testFP : sub esp , 1 Ch movq xmm0 , mmword ptr ds :[ 1 A200Ch ] lea edx ,[ esp + 4 ] movss xmm1 , dword ptr ds :[ 19 D19Ch ] mov ecx , 1 A2000h mov eax , dword ptr ds :[ 001 A2014h ] movaps xmm2 , xmm1 movq mmword ptr [ esp + 4 ], xmm0 movaps xmm0 , xmm1 divss xmm2 , dword ptr [ esp + 4 ] mov dword ptr [ esp + 0 Ch ], eax mov dword ptr [ esp ], 7 F7FFFFFh divss xmm1 , dword ptr [ esp + 0 Ch ] divss xmm0 , dword ptr [ esp + 8 ] movss dword ptr [ esp + 18 h ], xmm1 mov eax , dword ptr [ esp + 18 h ] mov dword ptr [ esp + 0 Ch ], eax lea eax ,[ esp ] push eax push 1 A2024h unpcklps xmm2 , xmm0 push 1 A2018h movq mmword ptr [ esp + 10 h ], xmm2 call intersectRayBoxFP ( 01910 A0h ) movss xmm0 , dword ptr [ esp ] sub esp , 8 cvtps2pd xmm0 , xmm0 movzx eax , al movsd mmword ptr [ esp ], xmm0 push eax push 19 D188h call printf ( 0191352 h ) add esp , 2 Ch ret

That's about 30 instructions to do the setup, then about 5 or so to do the printf. Not great.

Our New Results

Now let's try the same test with the shiny new SSE type. We're using exactly the same test code, the only difference is we pass our vectors in by value, not by reference. This is important for vectorcall to do its magic.

Vectorcall SSE implementation (ray-box code)

intersectRayBox: subps xmm2 , xmm0 subps xmm3 , xmm0 mulps xmm2 , xmm1 mulps xmm3 , xmm1 movaps xmm1 , xmm2 minps xmm1 , xmm3 maxps xmm2 , xmm3 movaps xmm0 , xmm1 shufps xmm0 , xmm1 , 0A1h maxps xmm1 , xmm0 movaps xmm0 , xmm1 shufps xmm0 , xmm1 , 52h maxps xmm1 , xmm0 movaps xmm0 , xmm2 shufps xmm0 , xmm2 , 0A1h minps xmm2 , xmm0 movaps xmm0 , xmm2 shufps xmm0 , xmm2 , 52h minps xmm2 , xmm0 comiss xmm2 , dword ptr ds :[ 3CD198h ] jb intersectRayBox + 5Bh ( 03C109Bh ) comiss xmm2 , xmm1 jb intersectRayBox + 5Bh ( 03C109Bh ) movss xmm0 , dword ptr [ ecx ] comiss xmm0 , xmm1 jb intersectRayBox + 5Bh ( 03C109Bh ) mov al , 1 movss dword ptr [ ecx ], xmm1 ret xor al , al ret

Wowzers! Is that it? That's 32 instructions!

It's... it's literally perfect. There's not a single memory load/store anywhere in there, not even in the function prologue. Well OK, there's one to store the final 't' value out, and there's one to load a constant zero value. But that's pretty darn good. I'd say this easily rivals any hand-written assembler.

Let's take a look at the code that calls it:

Vectorcall SSE implementation (usage)

testSSE: push ecx movaps xmm1 , xmmword ptr ds :[ 3CD1C0h ] lea ecx ,[ esp ] divps xmm1 , xmmword ptr ds :[ 3D30F0h ] mov dword ptr [ esp ], 7F7FFFFFh movaps xmm0 , xmmword ptr ds :[ 3D3100h ] movaps xmm2 , xmmword ptr ds :[ 3D30E0h ] movaps xmm3 , xmmword ptr ds :[ 3D3110h ] call intersectRayBox ( 03C1040h ) movss xmm0 , dword ptr [ esp ] sub esp , 8 cvtps2pd xmm0 , xmm0 movzx eax , al movsd mmword ptr [ esp ], xmm0 push eax push 3CD188h call printf ( 03C1352h ) add esp , 14h ret

It's like night and day. See how much shorter it is? Again there's no fat here, no unneeded instructions that should have been trimmed off. It's about 15 instructions for the body, and 5 to call the printf.

I'll tease you with one more little example. Let's say you have a function which returns a matrix, and a second function which accepts a matrix as a parameter.

Normally, you might choose to not actually return the matrix, instead passing it by reference/pointer and having the first function fill that in. Then you pass the reference into the second function. Matrices are pretty big, so you really want to avoid the copies.

void calcMatrix ( float4x4 & output ); void updateThing ( const float4x4 & mtx );

With vectorcall, you can actually just return the matrix as a return value, matching the HLSL style:

float4x4 calcMatrix (); void updateThing ( float4x4 mtx );

And here's the kind of assembly you'd get from doing that:

call calcMatrix ( 5614AC10h ) call updateThing ( 5614AA50h )

There's no copies at all! It all just vanishes! The results from calcMatrix stay in the XMM0-XMM3 registers, and as we call updateThing they're already in the right place. We don't have to do any work at all, not even passing the address of the matrix. It literally does not get any better than that.

Caveats

There is one annoying thing I have to mention. When using a non-standard calling convention enabled globally, any DLLs you link against need to be aware of it. All the built-in Windows DLLs are, but many third party libraries are not. So if you want to use something like, for example, Lua, then you may need to just go through the library's header file and stick a "__cdecl" in front of everything.

The alternative is to manually add __vectorcall to every function in your codebase, but that's silly. Don't clutter your code, let the compiler clutter things for you.

If you're linking libraries statically, none of this affects you.

Wrap Up

Well there you have it ladies and gentlemen. If you're after a modern, fast, beautiful vector maths library, here is the way to do it.

In the past, trying to use SSE to accelerate a vector maths library was usually a waste of both your own time and the CPUs. But time does not stand still, and yesterdays rules may be different than todays.

So my advice today is to give vectorcall a try. But remember, while the advice from 10 years ago may not apply today, please also bear in mind that in 10 years time this article will be wrong too.