Go allows authors to write functions in assembly if required. This is called a stub or forward declaration.

package asm // Add returns the sum of a and b. func Add(a int64, b int64) int64

Here we’re declaring Add , a function which takes two int64 ‘s and returns their sum. Add is a normal Go function declaration, except it is missing the function body.

If we were to try to compile this package the compiler, justifiably, complains;

% go build examples/asm ./decl.go:4:6: missing function body

To satisfy the compiler we must supply a body for Add via assembly, which we do by adding a .s file in the same package.

TEXT ·Add(SB),$0-24 MOVQ a+0(FP), AX ADDQ b+8(FP), AX MOVQ AX, ret+16(FP) RET

Now we can build, test, and use our Add function just like normal Go code. But, there’s a problem, assembly functions cannot be inlined.

This has long been a complaint by Go developers who want to use assembly either for performance or to access operations which are not exposed in the language. Some examples would be vector instructions, atomic instructions, and so on. Without the ability to inline assembly functions writing these functions in Go can have a relatively large overhead.

var Result int64 func BenchmarkAddNative(b *testing.B) { var r int64 for i := 0; i < b.N; i++ { r = int64(i) + int64(i) } Result = r } func BenchmarkAddAsm(b *testing.B) { var r int64 for i := 0; i < b.N; i++ { r = Add(int64(i), int64(i)) } Result = r }

BenchmarkAddNative-8 1000000000 0.300 ns/op BenchmarkAddAsm-8 606165915 1.93 ns/op

Over the years there have been various proposals for an inline assembly syntax similar to gcc’s asm(...) directive. None have been accepted by the Go team. Instead, Go has added intrinsic functions .

An intrinsic function is Go code written in regular Go. These functions are known the the Go compiler which contains replacements which it can substitute during compilation. As of Go 1.13 the packages which the compiler knows about are:

math/bits

sync/atomic

The functions in these packages have baroque signatures but this lets the compiler, if your architecture supports a more efficient way of performing the operation, transparently replace the function call with comparable native instructions.

For the remainder of this post we’ll study two different ways the Go compiler produces more efficient code using intrinsics.

Ones count

Population count, the number of 1 bits in a word, is an important cryptographic and compression primitive. Because this is an important operation most modern CPUs provide a native hardware implementation.

The math/bits package exposes support for this operation via the OnesCount series of functions. The various OnesCount functions are recognised by the compiler and, depending on the CPU architecture and the version of Go, will be replaced with the native hardware instruction.

To see how effective this can be lets compare three different ones count implementations. The first is Kernighan’s Algorithm .

func kernighan(x uint64) int { var count int for ; x > 0; x &= (x - 1) { count++ } return count }

This algorithm has a maximum loop count of the number of bits set; the more bits set, the more loops it will take.

The second algorithm is taken from Hacker’s Delight via issue 14813.

func hackersdelight(x uint64) int { const m1 = 0x5555555555555555 const m2 = 0x3333333333333333 const m4 = 0x0f0f0f0f0f0f0f0f const h01 = 0x0101010101010101 x -= (x >> 1) & m1 x = (x & m2) + ((x >> 2) & m2) x = (x + (x >> 4)) & m4 return int((x * h01) >> 56) }

Lots of clever bit twiddling allows this version to run in constant time and optimises very well if the input is a constant (the whole thing optimises away if the compiler can figure out the answer at compiler time).

Let’s benchmark these implementations against math/bits.OnesCount64 .

var Result int func BenchmarkKernighan(b *testing.B) { var r int for i := 0; i < b.N; i++ { r = kernighan(uint64(i)) } Result = r } func BenchmarkPopcnt(b *testing.B) { var r int for i := 0; i < b.N; i++ { r = hackersdelight(uint64(i)) } Result = r } func BenchmarkMathBitsOnesCount64(b *testing.B) { var r int for i := 0; i < b.N; i++ { r = bits.OnesCount64(uint64(i)) } Result = r }

To keep it fair, we’re feeding each function under test the same input; a sequence of integers from zero to b.N . This is fairer to Kernighan’s method as its runtime increases with the number of one bits in the input argument.

BenchmarkKernighan-8 100000000 11.2 ns/op BenchmarkPopcnt-8 618312062 2.02 ns/op BenchmarkMathBitsOnesCount64-8 1000000000 0.565 ns/op

The winner by nearly 4x is math/bits.OnesCount64 , but is this really using a hardware instruction, or is the compiler just doing a better job at optimising this code? Let’s check the assembly

% go test -c % go tool objdump -s MathBitsOnesCount popcnt-intrinsic.test TEXT examples/popcnt-intrinsic.BenchmarkMathBitsOnesCount64(SB) /examples/popcnt-intrinsic/popcnt_test.go popcnt_test.go:45 0x10f8610 65488b0c2530000000 MOVQ GS:0x30, CX popcnt_test.go:45 0x10f8619 483b6110 CMPQ 0x10(CX), SP popcnt_test.go:45 0x10f861d 7668 JBE 0x10f8687 popcnt_test.go:45 0x10f861f 4883ec20 SUBQ $0x20, SP popcnt_test.go:45 0x10f8623 48896c2418 MOVQ BP, 0x18(SP) popcnt_test.go:45 0x10f8628 488d6c2418 LEAQ 0x18(SP), BP popcnt_test.go:47 0x10f862d 488b442428 MOVQ 0x28(SP), AX popcnt_test.go:47 0x10f8632 31c9 XORL CX, CX popcnt_test.go:47 0x10f8634 31d2 XORL DX, DX popcnt_test.go:47 0x10f8636 eb03 JMP 0x10f863b popcnt_test.go:47 0x10f8638 48ffc1 INCQ CX popcnt_test.go:47 0x10f863b 48398808010000 CMPQ CX, 0x108(AX) popcnt_test.go:47 0x10f8642 7e32 JLE 0x10f8676 popcnt_test.go:48 0x10f8644 803d29d5150000 CMPB $0x0, runtime.x86HasPOPCNT(SB) popcnt_test.go:48 0x10f864b 740a JE 0x10f8657 popcnt_test.go:48 0x10f864d 4831d2 XORQ DX, DX popcnt_test.go:48 0x10f8650 f3480fb8d1 POPCNT CX, DX // math/bits.OnesCount64 popcnt_test.go:48 0x10f8655 ebe1 JMP 0x10f8638 popcnt_test.go:47 0x10f8657 48894c2410 MOVQ CX, 0x10(SP) popcnt_test.go:48 0x10f865c 48890c24 MOVQ CX, 0(SP) popcnt_test.go:48 0x10f8660 e87b28f8ff CALL math/bits.OnesCount64(SB) popcnt_test.go:48 0x10f8665 488b542408 MOVQ 0x8(SP), DX popcnt_test.go:47 0x10f866a 488b442428 MOVQ 0x28(SP), AX popcnt_test.go:47 0x10f866f 488b4c2410 MOVQ 0x10(SP), CX popcnt_test.go:48 0x10f8674 ebc2 JMP 0x10f8638 popcnt_test.go:50 0x10f8676 48891563d51500 MOVQ DX, examples/popcnt-intrinsic.Result(SB) popcnt_test.go:51 0x10f867d 488b6c2418 MOVQ 0x18(SP), BP popcnt_test.go:51 0x10f8682 4883c420 ADDQ $0x20, SP popcnt_test.go:51 0x10f8686 c3 RET popcnt_test.go:45 0x10f8687 e884eef5ff CALL runtime.morestack_noctxt(SB) popcnt_test.go:45 0x10f868c eb82 JMP examples/popcnt-intrinsic.BenchmarkMathBitsOnesCount64(SB) :-1 0x10f868e cc INT $0x3 :-1 0x10f868f cc INT $0x3

There’s quite a bit going on here, but the key take away is on line 48 (taken from the source code of the _test.go file) the program is using the x86 POPCNT instruction as we hoped. This turns out to be faster than bit twiddling.

Of interest is the comparison two instructions prior to the POPCNT ,

CMPB $0x0, runtime.x86HasPOPCNT(SB)

As not all intel CPUs support POPCNT the Go runtime records at startup if the CPU has the necessary support and stores the result in runtime.x86HasPOPCNT . Each time through the benchmark loop the program is checking does the CPU have POPCNT support before it issues the POPCNT request.

The value of runtime.x86HasPOPCNT isn’t expected to change during the life of the program’s execution so the result of the check should be highly predictable making the check relatively cheap.

Atomic counter

As well as generating more efficient code, intrinsic functions are just regular Go code, the rules of inlining (including mid stack inlining) apply equally to them.

Here’s an example of an atomic counter type. It’s got methods on types, method calls several layers deep, multiple packages, etc.

import ( "sync/atomic" ) type counter uint64 func (c counter) get() uint64 { return atomic.LoadUint64((uint64)(c)) } func (c counter) inc() uint64 { return atomic.AddUint64((uint64)(c), 1) } func (c counter) reset() uint64 { return atomic.SwapUint64((uint64)(c), 0) } var c counter func f() uint64 { c.inc() c.get() return c.reset() }

You’d be forgiven for thinking this would have a lot of overhead. However, because of the interaction between inlining and compiler intrinsics, this code collapses down to efficient native code on most platforms.

TEXT main.f(SB) examples/counter/counter.go counter.go:23 0x10512e0 90 NOPL counter.go:29 0x10512e1 b801000000 MOVL $0x1, AX counter.go:13 0x10512e6 488d0d0bca0800 LEAQ main.c(SB), CX counter.go:13 0x10512ed f0480fc101 LOCK XADDQ AX, 0(CX) // c.inc counter.go:24 0x10512f2 90 NOPL counter.go:10 0x10512f3 488b05fec90800 MOVQ main.c(SB), AX // c.get counter.go:25 0x10512fa 90 NOPL counter.go:16 0x10512fb 31c0 XORL AX, AX counter.go:16 0x10512fd 488701 XCHGQ AX, 0(CX) // c.reset counter.go:16 0x1051300 c3 RET

By way of explanation. The first operation, counter.go:13 is c.inc a LOCK ed XADDQ , which on x86 is an atomic increment. The second, counter.go:10 is c.get which on x86, due to its strong memory consistency model, is a regular load from memory. The final operation, counter.go:16 , c.reset is an atomic exchange of the address in CX with AX which was zeroed on the previous line. This puts the value in AX , zero, into the address stored in CX . The value previously stored at (CX) is discarded.

Conclusion