Like many people who have backgrounds in higher level languages like JavaScript and Ruby, one thing that really attracted me to Rust was the ability to get “closer to the metal”. While Rust offers plenty of high level abstractions, it certainly makes you think a bit more about lower level concerns like the memory allocation than the JavaScript or Ruby do. But of course, you can always go deeper, and learning more about the abstraction layer underneath Rust can be a really great way to really understand what makes Rust tick.

In this series, we'll explore the world of assembly language from the perspective of a Rust developer. We'll treat the compiler as a black box and see what kind of assembly instructions get produced from standard, run-of-the-mill Rust code. Doing this should get us a bit closer to understanding what's actually happening on our machine (though, of course, the stack does even deeper than the assembly language abstraction layer).

The Setup

Assembly Language Variety

As assembly has a close relationship to the actual machine code for a particular computer architecture and is therefore not a platform agnostic abstraction, we have to choose which variety we'll be exploring. For us this will be x86-64 assembly which is the architecture that you're most likely to find on most desktop and server computers today. Hopefully at some point in this series we'll take what we've learned and see if we can apply it to another machine architecture.

There are different syntax flavors for assembly code for a given computer architecture, allowing us to write and read different assembly language syntax which compiles to the same machine instructions to be passed to the CPU. For our purposes, we'll be looking at the Intel x86-64 assembly syntax which is often compared to the AT&T syntax. While wikipedia says that the Intel syntax is more common in the Windows world while AT&T is more common in Unix circles, in my limited experience I've seen the Intel syntax used more often even in Unix contexts.

Godbolt

To explore assembly code, we can use a plethora of tools, but one that I find the most convenient for rapid exploration is Matt Godbolt's Compiler Explorer. This tool allows us to write Rust code, and have it compile automatically and show us the relevant assembly code complete with color coordinated highlighting indicating which parts of our code produce which parts of the assembly output. The compiler explorer uses Intel syntax by default.

What You Need to Know

I'd really love this to be as accesible as possible, but I do assume some background knowledge. You should have passing high level familiarity with the following concepts:

The stack: a growable stack data structure that contains stack frames where are each a set of local variables for each function call that get “automatically” cleaned up when the function returns.

Registers: very small (64 bits on a 64-bit machine) memory storage on the CPU where data can be manipulated.

Memory: each process gets its own memory space that contains static data, the code being executed, the stack, and some space for dynamically allocated memory known as the heap. Memory can be thought of as a long array of bytes that starts at index (better known as address) 0 and goes all the way to address 2 64 .

Basic Rust: we're only writing three lines of Rust but it still helps to have familiarity with Rust, C, or C++.

Ok, now that we're all on the same page, let's get started:

INC - debug

In this post, we're going to explore a very simple Rust library that provides one function inc which takes in a u8 , adds one to it wrapping around if it goes beyond 255 , and then returns the result:

pub fn inc (n: u8 ) -> u8 { n.wrapping_add( 1 ) }

If you're not familiar with wrapping_add , it simply wraps the number around when it overflows unlike + which panics on overflow in debug mode ( + and wrapping_add behave the same in release mode).

Go to the compiler explorer, make sure you select Rust from the language drop down menu (as C++ is the default), and copy in the Rust program to the panel on the left. For this post, we'll be using Rust version 1.40.0. If you use a different version of the compiler it's possible you may see different results.

On the right hand side of the screen, you should see the following:

core::num::<impl u8>::wrapping_add: sub rsp, 2 add dil, sil mov byte ptr [rsp + 1], dil mov al, byte ptr [rsp + 1] mov byte ptr [rsp], al mov al, byte ptr [rsp] add rsp, 2 ret example::inc: push rax movzx edi, dil mov esi, 1 call core::num::<impl u8>::wrapping_add mov byte ptr [rsp + 7], al mov al, byte ptr [rsp + 7] pop rcx ret

This is quite a bit of assembly code just to add 1 to a number! Don't worry, we'll see later on that we can easily turn this code into just two instructions. In the meantime, this assembly has lots of interesting bits to it.

Let's explore this by first looking at the code underneath the example::inc . The example::inc: text is what's known as a label. The label labels a piece of memory - in this case our inc function. We could use the label in our assembly code as a way to refer to the location in memory where our example::inc function sits.

The Function Prologue

The first instruction in our example::inc function is push rax . which pushes whatever value is in the rax register on to the stack. In more precise terms it means the value in the rax register is copied to the location indicated in the rsp register (the stack pointer register which always contains the location at the top of the stack) and then it subtracts 8 from rsp .

rsp and rax are 64 bit registers known as “general purpose registers” but this is a bit of a misnomer since as we've seen rsp has the special purpose of pointing to the top of the stack. You should take a sec to read about the different registers on an x86-64 machine and how they “contain” smaller versions of themselves inside of them (e.g., rax “contains” a 32-bit register named eax , a 16-bit register named ax , and two 8-bit registers named ah and al ).

So why does push subtract 8 from rsp ? For historical reasons the stack grows downward meaning the top of the stack is at a lower memory address than the bottom of the stack. If you want to grow the stack, you need to move the top to an even lower address by subtracting from it. The reason 8 is subtracted is because this is the size in bytes of the rax (8 bytes is 64 bits) - so we're moving the stack pointer just beyond the value we just pushed on to the stack.

But what's the purpose of all this? Well it turns out that we do this to uphold an important part of the function calling convention.

Aside: ABIs and Calling Conventions An ABI (or application binary interface) is the binary interface between two binary modules. In other words if two pieces of actual machine code need to talk with each other, there's a whole host of things they need to agree upon in order to do so successfully. One such thing is a calling convention which is an agreed upon way for how functions are called. x86 assembly only has two instructions dedicated to functions: call for calling a function and ret for returning from a function. call pushes the next instruction's location on to the stack and ret pops that address off the stack and jumps to that location. But this isn't enough to handle all function calls. Where do the function arguments go? Where does the return value go? These need to be agreed upon so we can call functions with arguments and return values. We'll explore these questions in depth in this series.

While Rust may change which calling convention it uses between releases of the compiler, it needs to have a consistent way inside of a binary to call functions. It seems that as of Rust 1.40.0, Rust is using the SystemV ABI at least for its function calling convention. We'll be exploring what this actually entails in great depth over this series, so don't worry if this seems fuzzy. We simply need to know what the caller of a function and the called function itself need to do to allow functions to be called successfully.

One thing that the System V calling convention dictates is that the stack be 16 byte aligned - meaning that the stack pointer (i.e., rsp ) should be divisible by 16. Why this is, I'm not entirely sure, but it needs to be this way. If you have the answer, let me know! Since we're inside the example::inc function, we know that call was the last instruction executed. Because call pushes 8 bytes (i.e., a 64-bit address) on to the stack, the stack must not be 16-byte aligned. To correct for this, we can either subtract 8 from rsp or we can push something else that's 8 bytes big on to the stack which will do this for us. Apparently, Rust and LLVM believe doing push is better choice than subtracting, but I'm not really sure why.

It turns out that there's usually a bit of ceremony that a function must do when it's first called to make sure everything is in order and the actual function body can successfully take place. In the case of example::inc this was just one instruction, but for other functions this may be be more things many of which we'll see later in this series. This ceremony is referred to as a function's prologue. As we'll see later there's usually also a function epilogue which cleans things up at the end of the function.

Aside: Naked Functions As a side note: there's actually an experimental feature in Rust called “naked functions” which allow the programmer to tell the compiler to not include the function's prologue and epilogue.

Phew… that's a lot of explanation for one instruction! How long is this post going to be?! Well hopefully things should pick up a bit more from here.

Calling core::num::<impl u8>::wrapping_add

The next three instructions all have to do with calling the function wrapping_add :

movzx edi, dil ;; "copy" `dil` into `edi` and sign extend mov esi, 1 ;; copy 1 into `esi` call core::num::<impl u8>::wrapping_add ;; call `wrapping_add`

In order to call a function, we have to prepare the function arguments. In the System V calling convention, the registers rdi , rsi , rdx , rcx , r8 , and r9 (and their smaller variants) are used to store integer function arguments (with the stack being used for additional arguments after that).

The first instruction movzx copies the contents of the 8-bit register dil into edi . If you've read a bit about x86-64 registers, you may have noticed that dil is the 8-bit version of edi which is itself the 32-bit version of the 64-bit register rdi . As rdi is the first function argument register, edi must contain the first (and only) argument to the example::inc function.

The movzx will “sign extend” dil and keep it in the edi register. “Sign extension” is the process by which the most significant bit will be extended out to fill up the space that can fit in the numbers larger representation. For example, when sign extending 0b1000_0001 to 16 bits , it will become 0b1111_1111_1000_0001 . I assume this is done to give the number more room to overflow.

Next, 1 is copied into esi . Notice that we've now filled edi with the contents of dil the argument to example::inc and esi with 1. edi and esi are the (32-bit versions of the) first two function argument registers. We've set up the arguments to wrapping_add , which we're now ready to call using call which we learned above pushes the next instruction on to the stack and jumps to the label provided - in our case, wrapping_add .

The wrapping_add Function

Now we enter the wrapping_add function:

sub rsp, 2 ;; make room on stack add dil, sil ;; do the addition mov byte ptr [rsp + 1], dil ;; copy answer to stack - 1 mov al, byte ptr [rsp + 1] ;; copy that value back to `al` mov byte ptr [rsp], al ;; copy `al` to top of stack mov al, byte ptr [rsp] ;; copy that back to `al` add rsp, 2 ;; restore the stack pointer ret ;; jump back

The first thing we'll do is the function's prologue: sub rsp, 2 which will subtract 2 from rsp (the stack pointer) and store this new value in rsp . We're going to use 2 bytes of the stack in this function, so we're making room.

Next, the reason our function was called in the first place happens: the two function arguments dil and sil get added together.

What happens after this is a bit strange, and I'm not sure why this code got generated. With mov byte ptr [rsp + 1], dil , dil gets copied to the location rsp + 1 (i.e., one below the top byte of the stack). Remember the stack grows down so adding 1 will get us 1 position below the top position of the stack. Then with mov al, byte ptr [rsp + 1] , we turn around and copy that byte into al (one of the 8-bit registers inside of rax ). Then, strangely we do the same dance again this time at the top of the stack. We've essentially done 4 instructions to copy the value from dil into al . Why this code was generated this way, I'm not sure though I suspect it's because the compiler/LLVM need additional passes to eliminate the code and in debug mode they skips this.

At any rate, System V calling convention dictates that return values are found in rax . Since the calculation of our addition is now found in rax 's 8-bit variant al , we're done!

Finally, in the epilogue, we restore rsp back to what it was before the prologue by adding 2 to it, and then calling ret which will pop the return address off the stack and jump to it. If this function seemed a bit wasteful, it was, but it's over!

Finishing Up

We have all the tools in our toolbox to understand the rest of the example::inc function:

mov byte ptr [rsp + 7], al ;; move return value to 8th byte in stack mov al, byte ptr [rsp + 7] ;; move that value back to `al` pop rcx ;; epilogue: pop top of stack ret ;; return

The call to wrapping_add ended with the result in al . For some reason (probably similar to what happened in wrapping_add ) we copy al to the 8th byte of the stack and the immediately copy it back to al .

Finally, we must complete our epilogue and undo what we did in the prologue, namely pop off the top of the stack. I believe, we pop into rcx because it's not being used. We could have poped off the stack into another unused register and things would still work. Finally we return!

We're done! 🎉 We just did a lot in order to add one to a number. Surely, we can do better, right? Turns out we can, by increasing the level of optimization.

INC - debug

The Godbolt compiler explorer uses rustc directly and does not turn on any optimizations. If you're familiar with Rust, then you know that by default Rust doesn't do a lot of optimization. Usually with Cargo we would add the --release flag and our code would get optimized (at the expense of longer compilation times), but with rustc we have to pass a different flag -C opt-level=3 which tells rustc to apply the maximum level of optimization. In the compiler explorer, we can pass this flag in the “compiler options” box. Doing so, we should see dramatically different output:

example::inc: lea eax, [rdi + 1] ret

Wow! We now only have essentially 1 instruction (plus ret to return from our function). lea eax, [rdi + 1] does everything we need. lea stands for “load effective address”, and here it's been used in a way that's not really in line with its name. I believe the “normal” use of lea is to load an address into a destination register. Clearly rdi + 1 is not an address, but that's ok, it still gets the job done. It simply takes the contents of rdi which we know is the argument to our example::inc function, adds 1 to it, and then stores that into eax where our return value is expected to be.

We're done in 1 instruction. 🎉

Conclusion

That was a jam packed first look at the x86-64 assembly that Rust produces in both debug and release mode. Hopefully you've learned some neat x86-64 instructions, a bit about the System V calling convention, about the various ways that non-optimized Rust code does funny things. If you enjoyed this, please let me know!