Introduction

In my spare time, I’m working on a 2D game that relies on huge amount of entities being updated and displayed every frame, at 60FPS. This requires to have a permanent eye on overall performances.

For profiling, I’m using an in-house profiler quite similar to what’s described here. Today, I realized that I was spending ~2.37ms calling trigonometric functions per frame, essentially sin and cos to update the transformation matrix of the relevant entities (approximately 50K entities).

According to this document, x86 instructions fsin and fcos takes between 65-100 cycles to execute, so we just need to write some code to compute the sine function using less than that.

Solution

The solution that came to my mind was to use a look-up table (LUT) for sin and cos. Then, I thought to myself that this was an old-school trick that wasn’t much used anymore, but still, I decided to give it a try just to see.

Representing Angles

To avoid a float to integer conversion whenever I need to lookup sines, I quickly converted all my angles in the game to an unsigned 16-bits integer, where [0..65535] is mapped to [0..2pi[. The precision is quite good, and angles gets naturally wrapped to [0..2pi[. Even when doing intermediate computations using larger integer types, we can simply perform a logical and with 65535 to remap the values into this range.

LUT Size

At first, I started with a 16384 entries LUT because I wanted to minimize errors. The sin function looked like this:

FORCEINLINE tFloat Sin(tU16 Value) { const tU16 Index = (Value >> 2); // Map the angle to [0..16383] return SinTable[Index & 16383]; // Lookup angle }

The results were amazing: instead of ~2.37ms per frame, I was down to ~0.09ms per frame. At this point, there was no turning back.

LUT Steps

I immediately saw a problem with this approach when running the game though: as the LUT was 4 times smaller than the angles range, the resulting value would change only every 4 angle steps, creating a “stepcase” effect (exaggerated to better illustrate the problem):



(This graph was generated with the excellent Inigo Quilez’s Graph Toy).

The yellow line is the LUT sin(x), and the blue one is LUT sin(x+1).

Looking at this graph, I decided to add linear interpolation between Sin(x) and Sin(x+1):

FORCEINLINE tFloat Sin(tU16 Value) { const tU16 Index = (Value >> 2); // Map the angle to [0..16383] const tFloat A = SinTable[Index & 16383]; // Lookup angle const tFloat B = SinTable[(Index + 1) & 16383]; // Lookup angle+1 const Weight = (Value & 3) / 4.0f; // Compute weight return Weight * (B - A) + A; // Linear interpolation }

With linear interpolation, the game ran smoothly with no noticeable difference, but it approximately doubled the function execution time, which was now ~0.16ms per frame. Still good compared to the initial timings!

LUT Size Optimization

With interpolation enabled, I decided to make some tests to see how small the LUT could be without introducing too much errors. Here’s the results:

As we can see, the error is quite low even for a 64 entries LUT.

Further Optimizations

Here’s some other optimizations that I didn’t investigate. They could be interesting on target platforms that have low memory or if we want to minimize cache misses.

Using Half-Precision Floats

Using half-precision floats can indeed reduce the table size by 2 without sacrificing too much performances, depending on the platform. In fact, this is true for any other types of memory-hungry data, like animations, etc.

Minimal LUT Encoding

It is possible to only encode the first quadrant in the LUT and adjust the angles accordingly, giving us a 4x saving, as described here.

Using Smooth Interpolation

Another solution that I didn’t investigate is to use other types of interpolation, like polynomial or spline interpolation. This could greatly reduce the table size, but would requires a lot more cycles to execute.

Conclusion

Here’s the source code I used for this post, if anyone’s interested in using it:

template<tU32 SinTableSize>; struct TrigLookup { tFloat SinTable[SinTableSize]; TrigLookup() { CHECK(IsPowerOfTwo(SinTableSize)); for(tU32 i=0; i<SinTableSize; i++) { SinTable[i] = sin(((tFloat)i / SinTableSize) * Pi2); } } FORCEINLINE tFloat Lookup(tU16 Value) { const tU32 Divisor = (65536 / SinTableSize); const tU32 Index = Value / Divisor; const tFloat LUTSinA = SinTable[Index & (SinTableSize - 1)]; const tFloat LUTSinB = SinTable[(Index + 1) & (SinTableSize - 1)]; const tFloat LUTSinW = (Value & (Divisor - 1)) / (tFloat)Divisor; return LUTSinW * (LUTSinB - LUTSinA) + LUTSinA; } FORCEINLINE tFloat Sin(tU16 Value) { return Lookup(Value); } FORCEINLINE tFloat Cos(tU16 Value) { return Lookup(Value + 16384); } FORCEINLINE tFloat Tan(tU16 Value) { return Lookup(Value) / Lookup(Value + 16384); } };

And here’s the assembly dump of the Lookup function, which is roughly ~30 cycles on x86 and might be hand-optimized, if someone is not as lazy as me:

0126FA53 movzx ecx,si 0126FA56 mov eax,ecx 0126FA58 shr eax,8 0126FA5B mov edx,eax 0126FA5D inc eax 0126FA5E and edx,0FFh 0126FA64 and eax,0FFh 0126FA69 fld dword ptr [esp+edx*4+40h] 0126FA6D fld dword ptr [esp+eax*4+40h] 0126FA71 fsub st,st(1) 0126FA73 movzx eax,cl 0126FA76 mov dword ptr [esp+3Ch],eax 0126FA7A fild dword ptr [esp+3Ch] 0126FA7E fmul dword ptr [__real@3b800000 (12D4FC8h)] 0126FA84 fmulp st(1),st 0126FA86 faddp st(1),st