A Foray Into Go Assembly Programming

This blog post started last August when I was integrating the Spectator PercentileTimer concept into the metrics library in Rend so we could get better cross-fleet latency percentiles.

As a part of doing this, I had to port the code that selects which bucket (counter) to increment inside the PercentileTimer distribution. A PercentileTimer is implemented as a large array of counters, each of which represent a bucket. They are incremented whenever the observation lands in that bucket. The Atlas backend can then use these (properly tagged) counters as a group to derive cross-fleet percentiles within a couple percent error. The bucketing scheme divides the range of int64 into powers of 4, which are then subdivided linearly by 3 for the set of final buckets. This code is farly quick to run and compact, if a bit obtuse at first.

Side note: optimizations

When I saw a divide by 3, I shuddered a little bit because I assumed it would be less efficient to do the division as a DIV instruction instead of as a shift operation like a divide by 4 or 2 would be. Little did I know that people had solved this problem before. It’s a common compiler optimization to apply Montgomery Division when the division is by an integer constant. In this case, a divide by 3 is equivalent to multiplying by 0x55555556 and then taking the top half of the output. Thanks, StackOverflow

The descent

At this point I was also looking for an excuse to program something in assembly with Go. I actually already had a couple of other files in Rend that had assembly in them, but they were borrowed from other places and not my own original work. I wanted to translate this bucket selection code into assembly to see how fast I could make it.

The first step was to translate it into Go code so I had something to compare my assembly code against. This was easy as it was pretty straightforward to change the Java code into Go. The only hangup was recreating the indexes array that is part of the static initialization in the Java code.

From there, it was time to create the function and just get it set up to be called. This is where the trickery begins. From here on this post is mostly a list of things that I ran into that I had to do some research to solve.

Go version

This is the entire Go version of the code below for reference. This is the original code that I am working off of throughout this process. In this code, lzcnt is another assembly function stolen from Rend that just returns the number of leading zeros of a uint64.

const numBuckets = 276 var powerOf4Index = []int{ 0, 3, 14, 23, 32, 41, 50, 59, 68, 77, 86, 95, 104, 113, 122, 131, 140, 149, 158, 167, 176, 185, 194, 203, 212, 221, 230, 239, 248, 257, 266, 275, } func getBucket(n uint64) uint64 { if n <= 15 { return n } rshift := 64 - lzcnt(n) - 1 lshift := rshift if lshift&1 == 1 { lshift-- } prevPowerOf4 := (n >> rshift) << lshift delta := prevPowerOf4 / 3 offset := int((n - prevPowerOf4) / delta) pos := offset + powerOf4Index[lshift/2] if pos >= numBuckets-1 { return numBuckets - 1 } return uint64(pos + 1) }

The setup

Declaring an assembly function is more complex than a standard function. There is the equivalent of a C function declaration in Go code and then the actual assembly implementation in another .s file.

There’s a few things that are sticky when making a new function:

“Bridging” Go and ASM

This is something I had a little difficulty with because the method was not entirely clear at first.

In order to be able to compile the program, you need to create a Go version of the function declaration in a .go file alongside the .s file that contains the assembly implementation:

func getBucketASM(n uint64) uint64

The way to think about this, at least in my mind, is that this is your interface definition in Go and the implementation is in assembly in another file. Go code uses the interface.

The middot and (SB)

Function names in Go assembly files start with a middot character ( · ). The function declaration starts like this:

TEXT ·getBucketASM(SB)

TEXT means that the following is meant for the text section of the binary (runnable code). Next comes the middot · then the name of the function. Immediately after the name the extra (SB) is required. This means “static base” and is an artifact of the Plan9 assembly format. The real reason, from the Plan9 ASM documentation, is that functions and static data are located at offsets relative to the beginning of the start address of the program.

I literally copy-paste the middot every time. Who has a middot key?

To NOSPLIT or not to NOSPLIT

The asm doc says this about NOSPLIT :

NOSPLIT = 4 (For TEXT items.) Don’t insert the preamble to check if the stack must be split. The frame for the routine, plus anything it calls, must fit in the spare space at the top of the stack segment. Used to protect routines such as the stack splitting code itself.

In this function’s case, we can add NOSPLIT because the function doesn’t use any stack space at all beyond the arguments it receives. The annotation was probably not strictly necessary, but it’s fine to use in this case. At this point I’m still not sure how the “spare space” at the top of the stack works and I haven’t found a good bit of documentation to tell me.

How much stack space?

If your function requires more space than you have registers, you may need to spill on to the stack temporarily. In this case, you need to tell the compiler how much extra space you need in bytes. This function doesn’t need that many temporary variables, so it doesn’t spill out of the registers. That means we can use $0 as our stack space. The stack space is the last thing we need on the declaration line.

At this point we have one line done!

TEXT ·getBucketASM(SB), 4, $0

I have 4 written instead of NOSPLIT because I wasn’t quite doing things right. I’ll get to that in the next section.

Static array

In order to make the algorithm work, I needed to declare a static array of numbers that represent offsets into the array of counters. First, I derived a declaration from the standard library AES GCM code:

GLOBL powerOf4Index<>(SB), (NOPTR+RODATA), $256 > unexpected NOPTR evaluating expression

This didn’t work, however, because the NOPTR and RODATA symbols were undefined. I tried each on their own:

GLOBL powerOf4Index<>(SB), RODATA, $256 > illegal or missing addressing mode for symbol RODATA GLOBL powerOf4Index<>(SB), NOPTR, $256 > illegal or missing addressing mode for symbol NOPTR

Again, same result. To be expected, because they weren’t defined before. I didn’t know this at the time, though, because I was flailing about in the dark. I tried it without the annotation at all:

GLOBL powerOf4Index<>(SB), $256 > missing Go type information for global symbol

Again, no dice. It needs something there to tell the compiler how to treat the data.

It took me a while to find, but the asm documentation on the official Go website was actually the most helpful here. For “some unknown reason,” my code was unable to compile with the mnemonics so I just replaced RODATA and NOPTR with the numbers that represent them:

GLOBL powerOf4Index<>(SB), (8+16), $256

Aha! These two symbols tell the compiler to treat the “array” as a constant and not having any pointers.

Of course hindsight is 20⁄ 20 , meaning that after this entire exercise was over I found the proper header file to include to get these symbols. I didn’t figure out how to actually compile my code with this header file in place for this post, but the assembly files in the Go codebase all include it right at the top:

#include "textflag.h"

It’s also important to note how the data is laid out. Each DATA line is declaring a value for a given 8 byte chunk of the static data. The name of the “array” is first, followed by the offset (the type of which is defined in this picture from Plan9) and the size, then finally the value. After all of the DATA lines are complete, the GLOBL symbol powerOf4Index is declared along with some flags and its total size.

Now the Go array

var powerOf4Index = []int{ 0, 3, 14, 23, 32, 41, 50, 59, 68, 77, 86, 95, 104, 113, 122, 131, 140, 149, 158, 167, 176, 185, 194, 203, 212, 221, 230, 239, 248, 257, 266, 275, }

has become this block of assembly DATA declarations:

DATA powerOf4Index<>+0x00(SB)/8, $0 DATA powerOf4Index<>+0x08(SB)/8, $3 DATA powerOf4Index<>+0x10(SB)/8, $14 DATA powerOf4Index<>+0x18(SB)/8, $23 DATA powerOf4Index<>+0x20(SB)/8, $32 DATA powerOf4Index<>+0x28(SB)/8, $41 DATA powerOf4Index<>+0x30(SB)/8, $50 DATA powerOf4Index<>+0x38(SB)/8, $59 DATA powerOf4Index<>+0x40(SB)/8, $68 DATA powerOf4Index<>+0x48(SB)/8, $77 DATA powerOf4Index<>+0x50(SB)/8, $86 DATA powerOf4Index<>+0x58(SB)/8, $95 DATA powerOf4Index<>+0x60(SB)/8, $104 DATA powerOf4Index<>+0x68(SB)/8, $113 DATA powerOf4Index<>+0x70(SB)/8, $122 DATA powerOf4Index<>+0x78(SB)/8, $131 DATA powerOf4Index<>+0x80(SB)/8, $140 DATA powerOf4Index<>+0x88(SB)/8, $149 DATA powerOf4Index<>+0x90(SB)/8, $158 DATA powerOf4Index<>+0x98(SB)/8, $167 DATA powerOf4Index<>+0xa0(SB)/8, $176 DATA powerOf4Index<>+0xa8(SB)/8, $185 DATA powerOf4Index<>+0xb0(SB)/8, $194 DATA powerOf4Index<>+0xb8(SB)/8, $203 DATA powerOf4Index<>+0xc0(SB)/8, $212 DATA powerOf4Index<>+0xc8(SB)/8, $221 DATA powerOf4Index<>+0xd0(SB)/8, $230 DATA powerOf4Index<>+0xd8(SB)/8, $239 DATA powerOf4Index<>+0xe0(SB)/8, $248 DATA powerOf4Index<>+0xe8(SB)/8, $257 DATA powerOf4Index<>+0xf0(SB)/8, $266 DATA powerOf4Index<>+0xf8(SB)/8, $275 // RODATA == 8 // NOPTR == 16 GLOBL powerOf4Index<>(SB), (8+16), $256

If you properly import textflag.h then you could just change the declaration to be like it was originally:

GLOBL powerOf4Index<>(SB), (NOPTR+RODATA), $256

As for most of my struggles, careful reading of the Go ASM doc would have explained this to me.

Troubleshooting the shifts

Illegal instruction

This is what I was faced with. Not much information there. I did manage to isolate the error to the most recent bit of code I had added, which had both a SHRQ and a SHLQ , which shift a quadword (64 bits) right and left, respectively. These can shift by a fixed amount or by a dynamic amount. I needed to use the dynamic amount in this case. The same mnemonic actually produces two different encodings at the binary level because the instructions for dynamic and static shift amounts are different.

I had written the code to use some arbitrary register because I hadn’t thought anything of it at the time. It so turns out that the assembler was smart enough to recognize that the instruction was not actually encodeable, which is what the error message meant in the first place.

I dug around a little bit, not really knowing exactly where to start. Eventually, I looked at the SSA code in the Go compiler to see what kind of logic they had around the instruction. Jackpot.

The SSA code showed me only CX can be used as a variable shift amount:

//... { name: "SHRQ", argLen: 2, resultInArg0: true, asm: x86.ASHRQ, reg: regInfo{ inputs: []inputInfo{ {1, 2}, // CX {0, 65519}, // AX CX DX BX BP SI DI R8 R9 R10 R11 R12 R13 R14 R15 }, clobbers: 8589934592, // FLAGS outputs: []regMask{ 65519, // AX CX DX BX BP SI DI R8 R9 R10 R11 R12 R13 R14 R15 }, }, }, //...

After this I had a bit of an epiphany: I could have just looked at the Intel manuals the whole time. In the manual (page 4-582 Volume 2B) it shows only the CL register being usable as the shift amount for the dynamic versions of SHR and SHL .

Final code

I’ve reproduced the entire assembly version of the code here without comments for brevity. If you want to see the entire thing, you can take a look at the source files below. This code depends on the data declaration above.

TEXT ·getBucketASM(SB), 4, $0 MOVQ x+0(FP), R8 CMPQ R8, $16 JC underSixteen sixteenAndOver: BSRQ R8, BX SUBQ $63, BX NEGQ BX MOVQ $63, R10 SUBQ BX, R10 MOVQ R10, BX MOVL $1, CX ANDQ R10, CX JEQ powerOfFour SUBQ $1, R10 powerOfFour: MOVQ R8, R9 MOVQ BX, CX SHRQ CX, R9 MOVQ R10, CX SHLQ CX, R9 MOVQ $0x5555555555555556, AX MULQ R9 MOVQ DX, BX MOVQ R8, AX SUBQ R9, AX MOVQ $0, DX DIVQ BX SHLQ $2, R10 LEAQ powerOf4Index<>(SB), DX MOVQ (DX)(R10*1), BX ADDQ BX, AX CMPQ AX, $275 JGE bucketOverflow ADDQ $1, AX MOVQ AX, ret+8(FP) RET bucketOverflow: MOVQ $275, ret+8(FP) RET underSixteen: MOVQ R8, ret+8(FP) RET

Benchmarks

So what did all this struggle get me?

I wrote up some benchmarks to compare three versions of the code:

The original Go version above The same Go code above but with //go:noinline to try to control for inlining The assembly version

$ go test -run asdsdf -bench . -count 100 | tee -a benchdata ... $ benchstat benchdata name time/op GetBucket-8 17.4ns ± 2% GetBucketNoInline-8 17.4ns ± 2% GetBucketASM-8 12.8ns ± 1%

The answer is 4.6ns.

If it took me 4 hours to write this code, it would have to run 3.13043478 × 1012 times to be worthwhile. Luckily, this code would be run that many times in about a day or so in production at Netflix.

However, I didn’t use it.

I have a couple reasons:

Maintenance. If this code needs to be changed in the future by me or anyone else, I (or they) would need to brush up on assembly on x86, Go assembly quirks, etc. in order to do so. There’s quite a lot of overhead in that. The time difference wasn’t worth optimizing in this case. I only did it for fun. The latency of this code is in the low dozens of microseconds, so saving 15 or 20 nanoseconds per request is not useful. There are bigger fish to fry that won’t introduce the programmer overhead of assembly code.

For those of you wondering, I did actually benchmark the ported Go code before starting this whole process.

I know this may sound rather disappointing, doing all this work without putting it into production, but it was a fantastic exercise in learning Go assembly idiosyncrasies. Hopefully this chronicle of my struggles helps you overcome some small hurdle in your assembly ventures.

Please let me know what you think about this article in the comments or @sgmansfield on twitter.

Files

If you would like to peruse the files that made up this work in their entirety, you can find them here:

You can also see them at the GitHub repository for this blog.

References

Sites / documents:

Code: