Keep in mind that the simple class presented here is not the best solution to using SIMD. Compilers don't like the idea of wrapping a fundamental type in a class and often do unrequired things in such cases. The abstraction layer does not suffer this problem but you have to code in a C like manner to use it. From a practical point of view though, if you are not familiar with SIMD coding, the wrapper class provides a nice speed boost without a lot of hassles involved.

Terminology

Goals

SIMD Instruction Sets

Vectorization

Initial Work Environment

Alignment Issues

////////////////////////////////////////////////////////////////////////// // Per compiler macro's. #if defined( COMPILER_MSVC ) || defined( COMPILER_INTEL ) # define BEGIN_ALIGNED( n, a ) __declspec( align( a ) ) n # define END_ALIGNED( n, a ) #elif defined( COMPILER_GNU ) || defined( COMPILER_CLANG ) # define BEGIN_ALIGNED( n, a ) n # define END_ALIGNED( n, a ) __attribute__( (aligned( a ) ) ) #endif

SIMD Intrinsic Includes

The Intrinsic SIMD Abstraction Layer

Doing The Real Work

Include/Math/Simd Include/Math/Simd/Reference Include/Math/Simd/Sse Include/Math/Simd/Neon

Math::Simd struct Reference; Math::Simd struct Sse; Math::Simd struct Sse3; Math::Simd struct Sse4; Math::Simd struct Avx; Math::Simd struct Neon;

Simd_t::Vec4f_t mine = {1.0f, 2.0f, 3.0f};

namespace Math { namespace Simd { struct Reference { static Vec4f_t Create( float x, float y, float z ) { Vec4f_t result = {x, y, z, 0.0f}; return result; } } } }

class BEGIN_ALIGNED( Vector3fv, 16 ) { private: Simd_t::Vec4f_t mData; } END_ALIGNED( Vector3fv, 16 );

template< Simd = Simd_t > class BEGIN_ALIGNED( Vector3f_Simd, 16 ) { public: typedef typename Simd::Vec4f_t ValueType_t; private: ValueType_t mData; } END_ALIGNED( Vector3f_Simd, 16 ); typedef Vector3f_Simd< Simd_t > Vector3fv;

template< Simd = Simd_t > class BEGIN_ALIGNED( Vector3f_Simd, 16 ) { public: typedef typename Simd::Vec4f_t ValueType_t; Vector3f_Simd() {} Vector3f_Simd( const Vector3f_Simd& rhs ) : mData( rhs.mData ) {} Vector3f_Simd( float x, float y, float z ) : mData( Simd::Create( x, y, z ) {} private: ValueType_t mData; } END_ALIGNED( Vector3f_Simd, 16 ); typedef Vector3f_Simd< Simd_t > Vector3fv;

The First SIMD Instruction

__m128 _mm_add_ps( __m128 lhs, __m128 rhs );

float32x4_t vaddq_f32( float32x4_t lhs, float32x4_t rhs );

// in Math::Simd::Reference: static Vec4f_t Add( const Vec4f_t& lhs, const Vec4f_t& rhs ) { Vec4f_t result = { lhs.v[ 0 ] + rhs.v[ 0 ], lhs.v[ 1 ] + rhs.v[ 1 ], lhs.v[ 2 ] + rhs.v[ 2 ], lhs.v[ 3 ] + rhs.v[ 3 ] }; return result; }

// In Math::Simd::Sse: static Vec4f_t Add( Vec4f_t lhs, Vec4f_t rhs ) { return _mm_add_ps( lhs, rhs ); }

// In Math::Simd::Neon: static Vec4f_t Add( Vec4f_t lhs, Vec4f_t rhs ) { return vaddq_f32( lhs, rhs ); }

Vector3f_Simd operator + ( const Vector3f_Simd& rhs ) const {return Simd::Add( mData, rhs );}

The Dot Product

static float Dot3( const Vec4f_t& lhs, const Vec4f_t& rhs ) { return ( lhs.v[ 0 ]*rhs.v[ 0 ] + lhs.v[ 1 ]*rhs.v[ 1 ] + lhs.v[ 2 ]*rhs.v[ 2 ] ); }

static float Dot3( const Vec4f_t& lhs, const Vec4f_t& rhs ) { static const int swap1 = _MM_SHUFFLE( 3, 3, 3, 2 ); static const int swap2 = _MM_SHUFFLE( 3, 3, 3, 1 ); float sresult; __m128 result = _mm_mul_ps( lhs, rhs ); __m128 part1 = _mm_shuffle_ps( result, result, swap1 ); __m128 part2 = _mm_shuffle_ps( result, result, swap2 ); result = _mm_add_ps( result, part1 ); result = _mm_add_ps( result, part2 ); _mm_store_ss( &sresult, result ); return sresult; }

static float Dot3( const Vec4f_t& lhs, const Vec4f_t& rhs ) { float sresult; __m128 result = _mm_dp_ps( lhs, rhs, 0x77 ); _mm_store_ss( &sresult, result ); return sresult; }

Conclusion

Note On The Results

Math and math types are the glue which holds a game together: collision, physics, movement and a multitude of other game elements are basically just large math problems to be solved. A well created math library is something you begin using and tend to forget about as you go, a poor library is something you continually stumble over. There are many tutorials on basic math, library setup and architecture, there are also many tutorials about SIMD support in general. What is rarely covered is creating a library in such a manner as to allow a scalable and multi-targeted SIMD backend with no major changes for the user. In this article we will cover the multiple levels of Intel SSE and, as an option for Xcode, additionally target ARM Neon for iDevices.This article is built using the environment from prior articles and as such you may want to read those. It is not required reading as the work done here is focused on the Intel SSE instruction sets and some minor examples of ARM Neon for comparisons. Only minor details of the build environment will need to be understood to take advantage of the presented material. If you are interested; the prior article series can be found here: Article Unlike prior articles, this will simply present the reader with the final library which is hosted as a live project on Google Code. ARM Neon is discussed as a possible target but at this time the library is incomplete in that area, it will get progressively better in the future so check for updates. The code is hosted at: code . The directory "MathLibrary" corresponds to this article.For the rest of the article there is a bit of vocabulary in use which needs to be understood or things can become confusing. First off, what is SIMD? The terse definition is "Single Instruction Multiple Data". Used as such, it is referring to a subclass of instructions which the target CPU supplies. This does not mean it is specifically Intel SSE, Power PC Altivec (VMX) or ARM Neon instructions, it is a general term to refer to the class of instructions on a given target CPU. This article will be focused on the Intel SSE instruction set as it provides the most complicated variation, though when referring to other instruction sets, the specific name of the target will be given as: Neon, or VMX as examples. Additionally, as SIMD refers to a set of instructions, the term will not be used when describing the process of converting normal math code to use the instructions. Instead the generic term for applying SIMD instructions is 'vectorization' which is appropriate as SIMD instructions are also often referred to as vector processing. Understanding the differences in terminology will be crucial to the remaining article.The goal of the presented math library is to show multiple targets of SIMD instructions both based on specific CPU targets and levels of support given a target CPU. For instance, there are at least eight different variations of SSE on Intel CPUs at this time. Each level of SSE becomes significantly more powerful and corrects for missing items from prior versions. For instance, SSE 1 only supports vertical concepts of manipulation. In other words an SSE 1 based library loses considerable performance during the vectorization process due to an inability to occasionally work horizontally. SSE 3 added horizontal manipulations, specifically add and subtract, because of how useful they are in certain contexts such as performing dot products. Solving how to write a library which can support SSE 1 and SSE 3 (and higher) variations in a maintainable manner is one of the goals. A secondary goal is to perform the systematic vectorization on various types yet still allow best fit selection of the underlying implementation. Using the Steam Hardware Survey, it is possible to conclude that targeting SSE 3 is quite viable given that 99.5% of all surveyed hardware in use supports that level of SSE. This is a simple solution and very reasonable, but what happens if you target your personal machine, enable SSE 4.2 and find that it is a very large performance gain for you? You might go look at the survey again but this time you find support is around 50% and would make the game less broadly acceptable. There are several solutions to this which we wish to support. You can compile your game twice, one for SSE 3 and the other for SSE 4.2, create a little launcher app which chooses which version to load. Another variation is to compile the game itself for SSE 3 and then move all the really heavy math portions into a shared library which can be recompiled for variations of the instruction sets which you load at runtime based on the availability of instructions. While the details of each method are beyond the scope here, supporting them is intended to be covered. The final goal is to expose the SIMD instruction sets in a manner where systematic optimization using vectorization is possible. Simply vectorizing a 3D vector or a matrix is going to give notable gains, but applying vectorization to entire algorithms can make that seem trivial in comparison. Exposing the actual SIMD instructions in a manner which allows their use without using specific knowledge of the target is going to be possible. Of course this is quite a large topic and only the generic idea of how to approach it is covered, but the work is ready for expansion and usage in such a case.SIMD instructions come in many forms. If you have been using Intel CPUs for a while, you have probably watched the evolution from MMX to today's latest and greatest AVX instruction sets. Today, if you target hand held devices (iDevices specifically in this article), the ARM Neon instruction set is a relatively recent addition to the family of SIMD instruction sets. SIMD instruction sets are not new concepts, other processors have implemented such things for a while, the instructions are simply somewhat new to mainstream CPUs. In this article, we will focus on the subset of instructions intended to manipulate 4 floats at a time which is fairly common to most SIMD instruction sets. One of the difficulties with SIMD instructions is specifically caused by Intel and the way they implemented the original SSE instructions. The intention was not to support horizontal math and they instead tried to push reorganizing data in programs to match the new instruction data layout expectations. Sometimes this was possible to do, but often it is really not a viable approach. As such, on the Intel CPUs, you often waste much of the performance gain with swizzling instructions. This changed as of SSE 3 with the addition of horizontal adds and related instructions which greatly speed the handling of functionality such as dot product and magnitude calculations. Even better, as of SSE 4.1 a single instruction dot product is available which speeds things even more in some cases. Dealing with the multiple levels of SSE on Intel platforms is rather complicated and can lead to difficult-to-maintain code if done poorly. Targeting a completely different set of instructions for ARM Neon is even more difficult when maintaining a uniform interface. The similarities of the instruction sets help ease this difficulty but do not completely remove the problems.Starting the process of vectorization, we need to limit our scope to something fairly simple to begin with. At this point we are going to implement only a 3D vector class. There will be two implementations, not counting SIMD instruction set variations. One version is a normal C++ implementation which does not impose alignment requirements nor does it have a hidden 4th component. The other version is the vectorized variation with all the additional impositions added. Why two classes? Serialization is an area where vectorized types can get in the way. Unless you are writing a high performance binary serializer, it just doesn't make sense to deal with the imposition of alignment and the hidden component in such a system. Awareness of the types is of course important but different systems have different needs and forcing vectorization where it does not belong is not good practice. The names of the classes will follow a modified version of the Open GL conventions. The modification is simply a case of dropping the 'gl' prefix, since we use namespaces and modifying the postfix notations a little bit. So, the names will be Vector3f for a 3 component floating point standard C++ vector and Vector3fv for the vectorized version. The modified postfix naming just changes the meaning of 'v' from meaning a pointer to array of elements to mean vectorized. The concepts are similar and as a pointer type doesn't make sense in the case of a primitive, the naming change does not cause conflict. This is not a particularly original naming convention but it works for the purposes of this article.In prior articles I mentioned that I had a reason for duplicating the name of a library under the 'Include' directory but never got into the details. At this point we are about to use one of the reasons for the additional directory name. Instead of coding the Vector3f and Vector3fv classes side by side, we are going to split into two separate libraries for the time being: "Math" and "Math-SIMD". While this is not required in general, the idea is to leverage the layout such that in a team where the primary math library could already be in use, you don't want to introduce the vectorized types into the code base until they are ready. Of course you also want to be continually committing your work while you go, so moving the code to the separate library keeps it out of the way for others. But, you also don't want to have different include paths which would have to be fixed up later when you merge the libraries. As such, both libraries can reference "Math" files via: "#include "Math/xxx.hpp" style inclusion but obviously only when you explicitly include the 'Math-SIMD' library can you access the new functionality. While this is not required, it is a handy way to work in isolation for a while. Splitting libraries is a key feature of the build environment when discussing refactoring. When applied to large systems, you can move sections into temporary work locations, point to the original library and the new sub-library so nothing seems as if it changed. But, you can then make another library and rewrite a completely new version switching between implementations as needed. This variation of the usage is not covered here but it is a large benefit to the setup.One item to cover briefly is making sure any class which uses a vectorized math primitive can align itself in memory. Unfortunately the state of C++11 support is still in its infancy so it is required in this case to fall back on per-compiler style alignment specifications. Also, since this is a generic item needed through out the codebase, we will be placing it in the 'Core' library to be shared. If you look in 'Libraries/Core/Include/Core/Core.hpp' you will see the following subsection:The macro has to be split into two pieces since MSVC-style compilers and GNU-style compilers use different methods of specifying alignment. MSVC compilers use a pre-class name '__declspec' directive while GNU compilers use a postfix 'attribute' directive. When C++11 is better implemented, it will be possible to remove this bit of macro trash, but until then, this is the practical solution. There are several downsides to this solution. It does not cover new/delete and as such if you new up a vector it will not be guaranteed to fit the alignment requirements of the underlying vectorized type. Another problem case is putting vectors into a std::vector or other container, there is no guarantee the memory will be aligned correctly until full support of 'alignof' is pushed out to the compilers. Finally, and most annoying, because we are wrapping a fundamental intrinsic type, passing an item to functions by value is not supported and may also break alignment rules. We will be working around these problems in general but they are important to keep in mind.With the ARM Neon instruction set, life is fairly simple. There is only one include file required and no varying levels of support to be dealt with. But this does not mean everything is simple due to the development environment for iDevices. When targeting a real device the ARM Neon instruction set is of course supported, but when targeting the simulators they are not. Xcode supports switching targets from the device to the simulator in the IDE, as such we need to differentiate between the targets at compile time. This is fairly easy but does mean, on the simulator, we have to disable the Neon instruction set and fall back to non-SIMD code. (NOTE: You could switch to Intel SSE due to this being a simulator and not an emulator, but we'll just turn it off for now.) Intel SSE is a bit of a can of worms as it is broken into various levels. For ease, we don't follow the actual levels of SSE as they progressed but instead the header file levels which Intel supplies to compilers. The levels are thus broken down into MMX, 1, 2, 3, 3.1, 4.1, 4.2 and AVX. While the library will not actually supply unique code for all of these levels, there will be a structure in place to allow future specialization. Also, we list out MMX as a possible target, though it will not be used initially. The includes in highest to lowest order, with some notes: immintrin.h - AVX 256 bit SIMD instruction set. (Defines __m256.) wmmintrin.h - AVX AES instructions. nmmintrin.h - The Core 2 Duo 4.2 SIMD instruction set. smmintrin.h - The Core 2 Duo 4.1 SIMD instruction set. tmmintrin.h - SSSE 3 aka SSE 3.1 SIMD instruction set. pmmintrin.h - SSE 3 SIMD instruction set. emmintrin.h - SSE 2 SIMD instruction set. xmmintrin.h - SSE(1) SIMD instruction set. (Defines __m128.) mmintrin.h - MMX instruction set. (Defines __m64.) The modifications to compiler options, the specifically exposed CMake options and detection of the various levels to select appropriate includes can be found in the math library environment. Specifically dealing with the different includes and the Xcode oddities is handled in the file "Math-SIMD/Include/Math/Simd/Simd.hpp". It is a fairly ugly little header due to all the SSE options and the simulator difference, but it gets the job done properly.In order to support the different target processors we need to begin abstraction of the data types. At the same time, we are going to be implementing a fallback case for CPUs without SIMD. In fact, implementation of the reference support is a key feature of the library. What we are going to be writing is something similar to the SSEPlus library abstraction of the Intel intrinsics. (See: http://sseplus.sourceforge.net) We will only be dealing with the 4 element floating point types, so we don't really have that many intrinsics to abstract; basic math, load/store and some manipulation abilities are all we really need. There are exceptions we will be dealing with, but for now this is a good enough starting point.For the time being, we start the Vector3fv class as an empty class. We will be refactoring this to be the vectorized version once we get some preparations out of the way. The first step is defining how we will be working, as implied previously we are using the intrinsics and not going directly to assembly language. Beyond just being easier, using the intrinsics provides nearly as much performance gain as hand coding assembly in this case. In fact, with the compilers involved, quite often using the intrinsics produces faster code since register allocations are chosen to fit into surrounding code better. Having said that, the math primitives will not be the most cutting edge and blazing fast items in the world, just a good starting point for further optimization. Our next step is setting up the SIMD instruction set such that it is usable by the Vector3fv class (and other systems as desired) in a manner where it is an abstraction on top of the specific SIMD instructions. There are a number of items which can be used which, beyond naming, behave pretty much identically on all the different SIMD instruction sets. For instance, the ability to add, subtract, multiply and divide are pretty much common on each CPU and as such should be as light a wrapper as possible. We'll be following the same basic style of abstraction as presented in the Intel SSE intrinsics themselves, just at an extra level of abstraction. The primary header to figure out the instruction set selection is "Include/Math/Simd/Simd.hpp", by default it always allows use of the reference C++ implementation so even if you turn off all Simd, everything will continue to compile and function. After inclusion, a macro is defined named: "Simd_t" which is the primary abstraction into the build selected default instruction set. This is a macro since, as part of the goals, we want to be able to run multiple and simultaneous versions of the code based on different variations of the instruction sets, a typedef would not be able to be overridden by other code. The first thing we need in order to use the SIMD instruction sets is a type. The primary 4 element floating point type is defined as "Simd_t::Vec4f_t". The reference implementation defines this as simply a structure wrapped around a 4 element array of floats, the Intel SSE implementation defines the type as '__m128' and ARM Neon defines the type as 'float32x4_t'. The first rule is to treat these types as completely opaque black boxes, even if you know it is an '__m128', don't go using the structure underlying the type directly, always use the "Simd_t" exposed functions. Of course, a nice benefit to maintaining the reference library is that if you break the rules, either the reference version or whatever the active Simd instruction set implemented version will likely fail to compile. In order to structure and maintain the multiple abstractions we add the following structure to the project:Each new directory implements the specific instruction set and variations of each. So, 'Reference' and 'Neon' both contain single files which expose the SIMD abstraction and 'Sse' contains several different files. The content of the files end up defining structures in the namespaces as follows:Notice that for SSE we only support 1, 3, 4 and Avx. The reason is that, for the time being, there is no use implementing MMX, SSE2 or the inbetween versions since they add nothing significant to the math instructions. SSE2 may be included eventually so as to support double types but this is a future extension. Also, if you look in the supplied code, Avx derives from Sse4 which in turn derives from Sse3 etc. We use the simple hierarchy to hide progressively lesser capable variations of the instruction set. With all the structure out of the way, the method to use it needs to be discussed. Our first abstraction is going to be dealing with the need to initialize SIMD types. While all the current types we are targeting can be initialized with aggregate construction such as the following:This leaves a bit to be desired and may also not work on future instruction sets if specialized types are introduced. The undesirable feature of this is that the fourth component is left uninitialized and for purposes of the abstraction, we want the hidden component always to be 0.0f. In fact, when we add debugging, asserting that the hidden component is 0.0f will be part of the checks. So, we introduce our first abstraction function, which just happens to be the standard form for all future abstraction functions:The function simply creates a temporary, initializes it and returns it. Because this is such a standard form, most compilers will completely optimize the function out of release code and the cost of the abstraction remains zero. Visual Studio, Clang and GCC all deal with this form of optimization (often referred to as the Return Value Optimization) fairly well and only a few slower code generation mistakes result from this. With this function in hand, it is possible to initialize our Vector3fv. First, there is some preparation to perform. We need to modify the Vector3fv to be an aligned class or the optimized SIMD instructions will fail:There is still a problem though. We don't actually want to tie this class to the default data type based on build settings, we need to be able to change as desired. So, we will be renaming this class and templating it as follows with a default typedef to stand in as Vector3fv:With the general changes to the class behind us, let's use the SIMD abstraction and create the constructors:A key note here is that the default constructor does not initialize to zero. The type is intended to be used much like a fundamental float type and with the same rules. From a class purity point of view, yes this should default initialize to zero though. The general reason for this decision to act like a fundamental type is that compilers are still fairly weak at removing some unneeded temporary initializations and in the case of primitive math types it gets very expensive. The class of bugs caused by this is no different than missing float and int initializations though unfortunately the compiler will not always complain about such things. Debugging additions will be used to catch such errors later though. Another note is that the fundamental type exposed as Vec4f_t can be copy constructed like any fundamental type. This is critical to leverage as the compilers are able to leave types in registers longer while in use if they recognize copying and optimize appropriately. Gcc seems to be the weakest of the compilers in this area but still does a fair job of not flushing registers until really needed.At this point we will show the first SIMD instruction abstraction. It will also be the last individual instruction as the basics are the same for each. We'll look at more complicated functionality in the next section. We will expose an operation for "Add" in the SIMD abstraction and then implement the addition operator within the vector class. In SSE, the _mm_add_ps intrinsic (addps in assembly) is quite simple:For Neon, the intrinsic is:Basically the instructions are identical except in name and type. This makes it quite simple to implement things for all targets. So, let's start by implementing the reference C++ version:It is worth noting that we are passing by reference in this function but in the other abstractions we will be passing by value. This is a minor optimization applied to the reference implementation which may or may not be required. But in general it has no effect on usage as C++ will figure out the proper passing style be it reference or value. Also keep in mind that while the 3rd index is hidden, we still operate on it because that is how the SIMD instructions function and we are not targeting just Vector3 but eventually 2 and 4 dimension items also. Now the SSE implementation:The Neon implementation:And finally the operator:If you build and compile a test which tried to use this, you would currently get an error since you can not construct a Vector3f_Simd from the fundamental SIMD type. We will add the constructor for this, it is both required and proper. Of course it should be noted that any conversion operator or constructor should 'usually' be marked explicit (conversion operators are able to be marked as such in C++11) we want to encourage passing the fundamental type around as it will pass in a register and generally be faster than passing the wrapper class. So, in this rare occasion I leave the constructor and operator as non-explicit. As you can see, this pattern is very easy to implement, once the conversions are handled. We won't perform all the repetitions here as there is nothing notable to learn from such grunt work. Instead we move onto the more interesting bits.The dot product of two vectors is the first composite operation to be presented. It is also a case where the different levels of Intel SSE provide additional operations which speed the computation up significantly as higher levels of SSE are used. In order to maintain the abstraction and provide the benefits of the higher level instructions, we treat the entire dot product as a single intrinsic instead of attempting to implement it using the currently exposed intrinsic instructions. Without getting into the details, we'll just present the reference version of the operation to start with:The operation is basically a SIMD multiplication of all elements, though ignoring the hidden in this case, and then an addition of each resulting element. From a terminology point of view, this is generally called a vertical multiply followed by a horizontal addition. Unfortunately, the original Intel SSE did not support the horizontal operation which causes a signficant loss of performance. The Intel SSE version is thus considerably more complex:Without getting into details, after multiplying the individual components, we have to swap things around a bit in order to add them all together and then store the result. While generally faster than the reference version, it is not a major win due to the swizzle instructions. Thankfully, in steps the SSE 4 version though, at less than half the cycle count:By inserting these functions into the library of support files and structures, we have access to a per target optimized dot product. By expecting only that the container classes are properly aligned, the implementation of the vector type becomes identical to the non-vectorized version when finished. The level of abstraction is a bit fine grained but we are using assembly language instructions, which are about as fine grained as possible anyway, we are actually stepping back just a bit to contain complete concepts. Further extensions such as performing transformations on entire buffers can be done as single abstractions and optimized heavilly for each target, the speed increases compared to vectorizing just the Vector3 types is fairly impressive in comparison.With the example implementation and the overview, hopefully this has given you a method of using a SIMD math library in a manner which does not intrude on everyday work. While not a particularly long or detailed article, the overview of the design and architecture of the code should provide a simple to follow pattern which extends to nearly any specialized math processing which can be used in general purpose programming. The example library will be fleshed out and expanded on for the next article in which it will be used further. At that time I will also be introducing a simple testbed for the library which will become a benchmarking item for future work.While the next article will cover some of this, I figured I'd drop some results of the work: Using a simple brute force ray tracer with 1024x1024 3x SuperSampling on a simple scene. Normal : ~43500ms Reference : ~53400ms SSE : ~41400ms SSE 3 : ~33800ms SSE 4 : ~29300ms Secondary note: the performance differences also include enabling additional compiler options which speed up the overall program. The next article will use a better controlled timing method which will reflect only the Vector3fv optimizations.