Hello and a happy new year to everyone,

as some of you may be aware I gave a summary talk on inline assembly at the Rust Cologne Meetup in June 2017 (recording, slides). One reason for that was getting information to the Rust community to start a proper discussion on this (which I mostly failed to do, due to being preoccupied). The other reason was getting myself motivated to actually do the research, so I could come up with an RFC.

So this is a first draft of that RFC. It proposes an inline assembly syntax somewhat similar to what is available in gcc and clang, but in my opinion more readable and easier to remember.

Feedback and suggestions are very welcome.

Summary

Define a stable syntax for inline assembly, meant to be portable among various backends and architectures.

Motivation

In systems programming some tasks require dropping down to the assembly level. The primary reasons are for performance, precise timing, and low level hardware access. Using inline assembly for this is sometimes convenient, and sometimes necessary to avoid function call overhead.

The inline assembler syntax currently available in nightly Rust is very ad-hoc. It provides a thin wrapper over the inline assembly syntax available in LLVM IR. For stabilization a more user-friendly syntax that lends itself to implementation across various backends is preferable.

Guide-level explanation

Rust provides support for inline assembly via the asm! macro. It can be used to embed handwritten assembly in the assembly output generated by the compiler. Generally this should not be necessary, but might be where the required performance or timing cannot be otherwise achieved. Accessing low level hardware primitives, e.g. in kernel code, may also demand this functionality.

Let us start with the simplest possible example:

unsafe { asm!("nop"); }

This will insert a NOP (no operation) instruction into the assembly generated by the compiler. Note that all asm! invocations have to be inside an unsafe block, as they could insert arbitrary instructions and break various invariants. The instructions to be inserted are listed in the first argument of the asm! macro as a string literal.

Now inserting an instruction that does nothing is rather boring. Let us do something that actually acts on data:

let x: u32; unsafe { asm!("movl $5, {}", out(reg) x); }

This will write the value 5 into the u32 variable x . You can see that the string literal we use to specify instructions is actually a template string. It is governed by the same rules as Rust format strings. The arguments that are inserted into the template however look a bit different then you may be familiar with. First we need to specify if the variable is an input or an output of the inline assembly. In this case it is an output. We declared this by writing out . We also need to specify in what kind of location the assembly expects the variable. This is called a constraint specification. In this case we put it in an arbitrary general purpose register by specifying reg . We could also have said mem telling the compiler the assembly expects a memory location for this argument. The compiler will choose an appropriate register, or memory location to insert into the template and read the variable from there after the inline assembly.

Let see another example that also uses an input:

let i: u32 = 3; let o: u32; unsafe { asm!(" movl {0}, {1}; addl {number}, {1}; ", in(reg) i, out(reg) o, number = in(imm) 5); }

This will add 5 to the input in variable i and write the result to variable o . The particular way this assembly does this is first copying the value from i to the output, and then adding 5 to it.

The example shows a few things:

First we can see that inputs are declared by writing in instead of out .

Second one of our input operands has a constraint specification we haven’t seen yet, imm . This tells the compiler to expand this argument to an immediate inside the assembly template. This is only possible for constants and literals.

Third we can see that we can specify an argument number, or name as in any format string. For inline assembly templates this is particularly useful as arguments are often used more than once. For more complex inline assembly using this facility is generally recommended, as it improves readability, and allows reordering instructions without changing the argument order.

In some cases we need an argument to be both an input and an output:

let mut bytes: u32 = 0x01_02_03_04; unsafe { asm!("bswap {}", inout(reg) bytes); } assert_eq!(bytes, 0x04_03_02_01);

This example uses the bswap instruction to swap the byte order of the bytes variable. We can see that inout is used to specify an argument that is both input and output. This is different from specifying an input and output separately in that it is guaranteed to assign both to the same register or memory location.

The Rust compiler is conservative with its allocation of operands. It is assumed that an out can be written at any time, and can therefore not share its location with any other argument. However, to guarantee optimal performance it is important to use as few registers as possible, so they won’t have to be saved and reloaded around the inline assembly block. To achieve this Rust provides a lateout specifier. This can be used on any output that is guaranteed to be written only after all inputs have been consumed. There is also a inlateout variant of this specifier.

Some instructions require that the operands be in a specific register. Therefore, Rust inline assembly provides some more specific constraint specifiers. While reg , mem , and imm will be available on any architecture, these are highly architecture specific. Usually a specifier for each register class, and register will be provided. E.g. for x86 the general purpose registers eax , ebx , ecx , edx , esp , ebp , esi , and edi among others can be addressed by their name.

unsafe { asm!("out {}, $0x64", in(eax) cmd); }

In this example we call the out instruction to output the content of the cmd variable to port 0x64 . Since the out instruction only accepts eax (and its sub registers) as operand we had to use the eax constraint specifier.

It is somewhat common that instructions have operands that are not explicitly listed in the assembly (template). Hence, unlike in regular formating macros, we support excess arguments:

fn mul(a: u32, b: u32) -> u64 { let lo: u32; let hi: u32; unsafe { asm!("mul {}", in(reg) a, in(eax) b, lateout(eax) lo, lateout(edx) hi); } hi as u64 << 32 + lo as u64 }

This uses the mul instruction to multiply two 32-bit inputs with a 64-bit result. The only explicit operand is a register, that we fill from the variable a . The second implicit operand is the eax register which we fill from the variable b . The lower 32 bits of the result are stored in eax from which we fill the variable lo . The higher 32 bits are stored in edx from which we fill the variable hi .

In many cases inline assembly will modify state that is not given as output. Usually this is either because we have to use a scratch register in the assembly, or instructions modify state that we don’t need to further examine. This state is generally referred to as being “clobbered”. We need to tell the compiler about this since it may need to save and restore this state around the inline assembly block.

let ebx: u32; let ecx: u32; unsafe { asm!(" movl $4, %eax; xorl %ecx, %ecx; cpuid; ", out(ebx) ebx, out(ecx) ecx, clobber(eax, edx)); } println!( "L1 Cache: {}", ((ebx >> 22) + 1) * (((ebx >> 12) & 0x3ff) + 1) * ((ebx & 0xfff) + 1) * (ecx + 1) );

We specify the clobbered state via a clobber argument following all inputs and outputs. In the example above we use the cpuid instruction to get the L1 cache size. This instruction writes to eax , ebx , ecx , and edx , but for the cache size we only care about the contents of ebx and ecx . Hence, we declare those as outputs, while declaring the other registers as clobbers.

Clobber specifications are generally architecture specific. The only clobber specification that is always available is mem , meaning memory that is not specified as output is being written. Other than that all architecture registers are usually available by name.

When we said earlier that the asm!("nop") statement would insert a nop instruction that was actually not the whole truth. Rust’s asm! macro is designed to allow optimization. This is another reason inputs and outputs need to be known to the compiler. If outputs of the inline assembly block are never read, or there are no outputs, the inline assembly block may be optimized away. Also if inputs don’t change across multiple invocations of an inline assembly block the compiler may assume it always yields the same result, only executing it once.

In some cases this may not be what we want. For example we may want to clear the interrupt flag on an x86 system:

unsafe { asm!("cli", flags(volatile)); }

As you can see in the example we do this using the cli instruction. However, this instruction has no output. We only run it for the side-effect. To avoid deletion of this inline assembly block by the optimizer we specify the volatile flag.

Flags can be provided as an optional final argument to the asm! macro. For now the only generally available flag is volatile , which enforces that the inline assembly block is always executed. However, there may be other architecture specific flags. E.g. on x86 the intelsyntax flag is provided to switch from AT&T to Intel assembly syntax.

Reference-level explanation

Inline assembler is implemented as a macro asm!() . The first argument to this macro is a template used to build the final assembly. The following arguments specify input and output operands. When required, clobbers and flags are specified as the final two arguments.

The assembler template uses the same syntax as format strings. I.e. placeholders are specified by curly braces. The corresponding arguments are accessed in order, by index, or by name. Future revisions may also use the format_spec to specify what LLVM calls template argument modifiers. However, this initial proposal elides this, as it is not necessary for inline assembly to be useful.

The following ABNF specifies the general syntax:

dir_spec := "in" / "out" / "lateout" / "inout" / "inlateout" constraint_spec := "reg" / "mem" / "imm" / <arch specific> operand := [ident "="] dir_spec "(" constraint_spec ")" expr clobber_spec := "mem" / <arch specific> clobber := "clobber(" clobber_spec ")" flag := "volatile" / <arch specific> flags := "flags(" flag *["," flag] ")" asm := "asm!(" format_string *("," operand) ["," clobber] ["," flags] ")"

Direction specification

The direction specification indicates in what way the operand is being used by the generated assembly.

Five kinds of operands are supported:

in input operand may be read at any time may not be written

out output operand may not be read may be written at any time

lateout output operand may not be read may only be written after all inputs were consumed

inout input and output operand may be read at any time may be written at any time

inlateout input and output operand may be read at any time may only be written after all inputs were consumed



The expr given with an output must resolve to a mutable or uninitialized location.

Constraint specification

The constraint specification indicates which kinds of operand is required by the assembly template in the operands position.

Across platforms three constraint specifications are supported:

reg : the operand is placed in a general purpose register

: the operand is placed in a general purpose register mem : the operand is placed in a memory location

: the operand is placed in a memory location imm : the operand is an immediate

All other constraint specifications are defined per architecture. It is suggested that one exist for at least each physical register and register class (e.g. floating point register, 128-bit vector register). Names should be speaking rather than single letter acronyms. I.e. prefer for example float over f and xmm_vector over x .

Clobber specification

The clobber specification is used to indicate what state is being modified apart from the outputs. The mem clobber specification is always available. It indicates that arbitrary memory is being modified.

All other clobber specifications are defined per architecture. It is suggested that one exist for at least each physical register.

Flags

Flags are used to further influence the behaviour of the inline assembly block. The only flag defined at this point in time is volatile . The volatile flag indicates that the inline assembly block may have side-effects not indicated by inputs, outputs, or clobber (i.e. may not be optimized away).

Other flags can be defined per architecture. An intelsyntax flag for the x86 architecture should be provided.

Mapping to LLVM IR

The direction specification maps to a LLVM constraint specification as follows (using a register operand as an example):

in(reg) => r

=> out(reg) => =&r (Rust’s outputs are early-clobber outputs in LLVM/GCC terminology)

=> (Rust’s outputs are early-clobber outputs in LLVM/GCC terminology) inout(reg) => =&r,0 (an early-clobber output with an input tied to it, 0 here is a placeholder for the position of the output)

=> (an early-clobber output with an input tied to it, here is a placeholder for the position of the output) lateout(reg) => =r (Rust’s late outputs are regular outputs in LLVM/GCC terminology)

=> (Rust’s late outputs are regular outputs in LLVM/GCC terminology) inlateout(reg) => =r, 0 (cf. inout and lateout )

As written this RFC requires architectures to map from Rust constraint specifications to LLVM constraint codes. This is in part for better readability on Rust’s side and in part for independence of the backend:

reg is mapped to r

is mapped to mem is mapped to m

is mapped to a register name r1 is mapped to {r1}

is mapped to additionally mappings for register classes are added as appropriate (cf. llvm-constraint)

For clobber specifications the following mappings apply:

mem is mapped to ~{memory}

is mapped to a register name r1 is mapped to ~{r1} (cf. llvm-clobber)

The volatile flag is mapped to adding the sideeffect keyword to the LLVM asm statement. The intelsyntax flag is mapped to adding the inteldialect keyword to the LLVM asm statement.

Drawbacks

Unfamiliarity

This RFC proposes a completely new inline assembly format. It is not possible to just copy examples of gcc-style inline assembly and re-use them. There is however a fairly trivial mapping between the gcc-style and this format that could be documented to alleviate this.

The clobber example above would look like this in gcc-sytel inline assembly:

int ebx, ecx; asm ( "mov $4, %%eax;" "xor %%ecx, %%ecx;" "cpuid;" "mov %%ebx, %0;" : "=r"(ebx), "=c"(ecx) // outputs : // inputs : "eax", "ebx", "edx" // clobbers ); printf("L1 Cache: %i

", ((ebx >> 22) + 1) * (((ebx >> 12) & 0x3ff) + 1) * ((ebx & 0xfff) + 1) * (ecx + 1));

Rationale and alternatives

Implement an embedded DSL

Both MSVC and D provide what is best described as an embedded DSL for inline assembly. It is generally close to the system assembler’s syntax, but augmented with the ability to directly access variables that are in scope.

// This is D code int ebx, ecx; asm { mov EAX, 4; xor ECX, ECX; cpuid; mov ebx, EBX; mov ecx, ECX; } writefln("L1 Cache: %s", ((ebx >> 22) + 1) * (((ebx >> 12) & 0x3ff) + 1) * ((ebx & 0xfff) + 1) * (ecx + 1));

// This is MSVC C++ int ebx_v, ecx_v; __asm { mov eax, 4 xor ecx, ecx cpuid mov ebx_v, ebx mov ecx_v, ecx } std::cout << "L1 Cache: " << ((ebx_v >> 22) + 1) * (((ebx_v >> 12) & 0x3ff) + 1) * ((ebx_v & 0xfff) + 1) * (ecx_v + 1)) << '

';

While this is very convenient on the user side in that it requires no specification of inputs, outputs, or clobbers, it puts a major burden on the implementation. The DSL needs to be implemented for each supported architecture, and full knowledge of the side-effect of every instruction is required.

This huge implementation overhead is likely one of the reasons MSVC only provides this capability for x86, while D at least provides it for x86 and x86_64. It should also be noted that the D reference implementation falls slightly short of supporting arbitrary assembly. E.g. the lack of access to the RIP register makes certain techniques for writing position independent code impossible.

As a stop-gap the LDC implementation of D provides a llvmasm feature that binds it closely to LLVM IR’s inline assembly.

The author believes it would be unfortunate to put Rust into a similar situation, making certain architectures a second-class citizen with respect to inline assembly.

Provide intrinsics for each instruction

In discussions it is often postulated that providing intrinsics is a better solution to the problems at hand. However, particularly where precise timing, and full control over the number of generated instructions is required intrinsics fall short.

Intrinsics are of course still useful and have their place for inserting specific instructions. E.g. making sure a loop uses vector instructions, rather than relying on auto-vectorization.

However, inline assembly is specifically designed for cases where more control is required. Also providing an intrinsic for every (potentially obscure) instruction that is needed e.g. during early system boot in kernel code is unlikely to scale.

Make the asm! macro return outputs

It has been suggested that the asm! macro could return its outputs like the LLVM statement does. The benefit is that it is clearer to see that variables are being modified. Particular in the case of initialization it becomes more obvious what is happening. On the other hand by necessity this splits the direction and constraint specification from the variable name, which makes this syntax overall harder to read.

fn mul(a: u32, b: u32) -> u64 { let (lo, hi) = unsafe { asm!("mul {}", in(reg) a, in(eax) b, lateout(eax), lateout(edx)) }; hi as u64 << 32 + lo as u64 }

Unresolved questions

Clobbers

What actually can/has to be clobbered is somewhat unclear. The LLVM IR documentation claims that only explicit register constraints and ~{memory} are supported. Yet clang generates IR that has additional constraints. E.g. it will forward a cc (condition code) clobber from C inline assembly.

Flags