SIMD Optimization in Go

February 3rd, 2016

There's something beautiful about programming in assembly.

With the complexity of modern computers, it's easy to forget the relatively simple interface they provide to the programmer at the lowest level. Machine code is an almost impervious abstraction over the unavoidably physical transistors and capacitors that make up a computer. The abstractions are still human-made, but they are well-tested and reliable. The computers we program are machines underneath, and programming in assembly lets you see just how machiney they really are.

Assembly programming can be very slow and crash-filled compared to programming in a language like Go, but very occasionally it's a good idea or at least a lot of fun. The amazing assembly programming game TIS-100 , with its 14 page instruction manual , captures the fun of programming assembly without the mess of doing it on a real computer with a 3,883 page reference manual and an arsenal of build tools.

Why waste your time programming in assembly when there are perfectly good languages to program in? Your father built that language himself and now it's not good enough for you? He spent years unrolling those loops by hand before he got put out of a job by -funroll-loops and all those highfalutin optimizing compilers. Don't make the same mistake he did.

When the performance of some piece of your code actually does matter to a user, and you've already tried all the easier ways of making it faster, assembly could be a good place to go next. Although the compiler might be better at optimizing assembly than you, you know more about your particular case than the compiler can safely assume.

Writing Assembly in Go

The best place to start here is to write the simplest function possible. So here's one that adds two int64s together, called add .

package main import "fmt" func add ( x , y int64 ) int64 { return x + y } func main ( ) { fmt . Println ( add ( 2 , 3 ) ) }

go get goroutines.com/asm-add-go

To implement this in assembly, you create a separate file called add_amd64.s where you put the assembly implementation. I'm using a Macbook for these examples, so the assembly will be for the AMD64 architecture.

// add.go package main import "fmt" func add ( x , y int64 ) int64 func main ( ) { fmt . Println ( add ( 2 , 3 ) ) } // add_amd64.s TEXT · add ( SB ) , NOSPLIT , $ 0 MOVQ x + 0 ( FP ) , BX MOVQ y + 8 ( FP ) , BP ADDQ BP , BX MOVQ BX , ret + 16 ( FP ) RET

go get goroutines.com/asm-add

To run this example, use go get :

go get goroutines . com / asm - add asm - add

The assembly syntax is...obscure at best. There is the official Go guide and the what appears to be a somewhat ancient manual for the Plan 9 assembler that give some hints as to how the Go assembly language works. The best references are existing Go assembly code and the compiled assembly versions of Go functions which you can get with:

go tool compile - S < go file >

The most important things to know are the function declaration and the stack layout.

The magical incantation to start the function is TEXT ·add(SB),NOSPLIT,$0 . There is a unicode dot character · separating the package name from the function name. In this case the package name is main so the package is left off of the identifier and the name of the function is add . The directive NOSPLIT means that you don't need to write the size of the arguments as the next parameter. The $0 constant at the end is where you would need to put the size of the arguments, but since we have NOSPLIT we can just leave it as $0 . Why is the (SB) there after the function name? That's not really clear, but it doesn't work without it.

The stack layout puts every argument to the function in order on the stack starting at the address 0(FP) which means a zero-byte offset from the FP pointer, continuing on through all the arguments until there is a space for the return value. For func add(x, y int64) int64 , it looks like this:

Here's the assembly again for reference:

TEXT · add ( SB ) , NOSPLIT , $ 0 MOVQ x + 0 ( FP ) , BX MOVQ y + 8 ( FP ) , BP ADDQ BP , BX MOVQ BX , ret + 16 ( FP ) RET

The assembly version of the add function is loading the variable x at memory address +0(FP) into the register BX . It then loads y at memory address +8(FP) into the register BP , adds BP to BX storing the result in BX , and finally copies BX to the memory address +16(FP) and returns from the function. The calling function, which put all the arguments onto the stack, will read the return value from where we left it.

Optimizing a Function with Assembly

We don't really need to write assembly to add two numbers together, so what could we actually use it for?

Let's say you have a bunch of vectors, and you want to multiply them by a transformation matrix. Maybe the vectors are points and you want to translate them through 3D space . We'll use 4D vectors with a 4x4 transformation matrix with the vectors in a densely packed array. This is better than say, an array of objects with a vector property on each, because doing a linear scan through densely packed memory is more cache friendly.

type V4 [ 4 ] float32 type M4 [ 16 ] float32 func M4MultiplyV4 ( m M4 , v V4 ) V4 { return V4 { v [ 0 ] * m [ 0 ] + v [ 1 ] * m [ 4 ] + v [ 2 ] * m [ 8 ] + v [ 3 ] * m [ 12 ] , v [ 0 ] * m [ 1 ] + v [ 1 ] * m [ 5 ] + v [ 2 ] * m [ 9 ] + v [ 3 ] * m [ 13 ] , v [ 0 ] * m [ 2 ] + v [ 1 ] * m [ 6 ] + v [ 2 ] * m [ 10 ] + v [ 3 ] * m [ 14 ] , v [ 0 ] * m [ 3 ] + v [ 1 ] * m [ 7 ] + v [ 2 ] * m [ 11 ] + v [ 3 ] * m [ 15 ] , } } func multiply ( data [ ] V4 , m M4 ) { for i , v : = range data { data [ i ] = M4MultiplyV4 ( m , v ) } }

go get goroutines.com/asm-process-vectors

This takes 140ms for 128MB of data on a 2012 Macbook Pro using Go 1.5.3. What's the fastest this implementation could be? A memory copy seems like a good benchmark, it's about 14ms

Here's a version of the function written in assembly using SIMD instructions to perform the multiplications, which lets us multiply 4 float32s in parallel:

// func multiply(data []V4, m M4) // // memory layout of the stack relative to FP // +0 data slice ptr // +8 data slice len // +16 data slice cap // +24 m[0] | m[1] // +32 m[2] | m[3] // +40 m[4] | m[5] // +48 m[6] | m[7] // +56 m[8] | m[9] // +64 m[10] | m[11] // +72 m[12] | m[13] // +80 m[14] | m[15] TEXT · multiply ( SB ) , NOSPLIT , $ 0 // data ptr MOVQ data + 0 ( FP ) , CX // data len MOVQ data + 8 ( FP ) , SI // index into data MOVQ $ 0 , AX // return early if zero length CMPQ AX , SI JE END // load the matrix into 128-bit wide xmm registers // load [m[0], m[1], m[2], m[3]] into xmm0 MOVUPS m + 24 ( FP ) , X0 // load [m[4], m[5], m[6], m[7]] into xmm1 MOVUPS m + 40 ( FP ) , X1 // load [m[8], m[9], m[10], m[11]] into xmm2 MOVUPS m + 56 ( FP ) , X2 // load [m[12], m[13], m[14], m[15]] into xmm3 MOVUPS m + 72 ( FP ) , X3 LOOP : // load each component of the vector into xmm registers // load data[i][0] (x) into xmm4 MOVSS 0 ( CX ) , X4 // load data[i][1] (y) into xmm5 MOVSS 4 ( CX ) , X5 // load data[i][2] (z) into xmm6 MOVSS 8 ( CX ) , X6 // load data[i][3] (w) into xmm7 MOVSS 12 ( CX ) , X7 // copy each component of the matrix across each register // [0, 0, 0, x] => [x, x, x, x] SHUFPS $ 0 , X4 , X4 // [0, 0, 0, y] => [y, y, y, y] SHUFPS $ 0 , X5 , X5 // [0, 0, 0, z] => [z, z, z, z] SHUFPS $ 0 , X6 , X6 // [0, 0, 0, w] => [w, w, w, w] SHUFPS $ 0 , X7 , X7 // xmm4 = [m[0], m[1], m[2], m[3]] .* [x, x, x, x] MULPS X0 , X4 // xmm5 = [m[4], m[5], m[6], m[7]] .* [y, y, y, y] MULPS X1 , X5 // xmm6 = [m[8], m[9], m[10], m[11]] .* [z, z, z, z] MULPS X2 , X6 // xmm7 = [m[12], m[13], m[14], m[15]] .* [w, w, w, w] MULPS X3 , X7 // xmm4 = xmm4 + xmm5 ADDPS X5 , X4 // xmm4 = xmm4 + xmm6 ADDPS X6 , X4 // xmm4 = xmm4 + xmm7 ADDPS X7 , X4 // data[i] = xmm4 MOVNTPS X4 , 0 ( CX ) // data++ ADDQ $ 16 , CX // i++ INCQ AX // if i >= len(data) break CMPQ AX , SI JLT LOOP END : // since we use a non-temporal write (MOVNTPS) // make sure all writes are visible before we leave the function SFENCE RET

go get goroutines.com/asm-process-vectors-asm-simd

This one takes 18ms, so it's pretty close to the speed of the memory copy. A better optimization might be to run this sort of thing on the GPU instead of on the CPU because the GPU is really good at it. Here are the running times for the different programs, including a manually inlined Go version and a non-SIMD assembly implementation:

program time speedup original 140ms 1x inline 69ms 2x assembly 43ms 3x simd 17ms 8x copy 15ms 9x

When optimizing, we can pay the cost of some code complexity to make things easier for the machine. Assembly is a complicated way to do that, but sometimes it's the best method available.

Implementation Notes

I developed the assembly parts mostly in C and x64 assembly using XCode and then ported the assembly to the Go format. XCode has a nice debugger that lets you inspect CPU registers while the program is running. If you include an assembly .s file in an XCode project, it will build it and link it to your executable.