To provide perhaps a clearer example, on x86_64, compiled with the -O flag, the function

pub fn leet(a : i128) -> i128 { a + 1337 }

compiles to

example::leet: mov rdx, rsi mov rax, rdi add rax, 1337 adc rdx, 0 ret

(My original post had u128 rather than the i128 you asked about. The function compiles the same code either way, a good demonstration that signed and unsigned addition are the same on a modern CPU.)

The other listing produced unoptimized code. It’s safe to step through in a debugger, because it makes sure you can put a breakpoint anywhere and inspect the state of any variable at any line of the program. It’s slower and harder to read. The optimized version is much closer to the code that will actually run in production.

The parameter a of this function is passed in a pair of 64-bit registers, rsi:rdi. The result is returned in another pair of registers, rdx:rax. The first two lines of code initialize the sum to a .

The third line adds 1337 to the low word of the input. If this overflows, it carries the 1 in the CPU’s carry flag. The fourth line adds zero to the high word of the input—plus the 1 if it got carried.

You can think of this as simple addition of a one-digit number to a two-digit number

a b + 0 7 ______

but in base 18,446,744,073,709,551,616. You’re still adding the lowest “digit” first, possibly carrying a 1 to the next column, then adding the next digit plus the carry. Subtraction is very similar.

Multiplication must use the identity (2⁶⁴a + b)(2⁶⁴c + d) = 2¹²⁸ac + 2⁶⁴(ad+bc) + bd, where each of these multiplications returns the upper half of the product in one register and the lower half of the product in another. Some of those terms will be dropped, because bits above the 128th don’t fit into a u128 and are discarded. Even so, this takes a number of machine instructions. Division also takes several steps. For a signed value, multiplication and division would additionally need to convert the signs of the operands and the result. Those operations are not very efficient at all.

On other architectures, it gets easier or harder. RISC-V defines a 128-bit instruction-set extension, although to my knowledge no one has implemented it in silicon. Without this extension, the RISC-V architecture manual recommends a conditional branch: addi t0, t1, +imm; blt t0, t1, overflow