If you've ever look at the disassembly output for C or C++ code, you'll probably notice that there are a lot of push/pop instructions. And if you pay close enough attention, you'll notice that the compiler prefers to use certain registers over others. In particular, compilers will prefer pushing "old" registers like RBP (i.e. the ones available on 32-bit x86 CPUs) instead of the "new" registers like R15 (which aren't available in 32-bit mode).

The C calling convention on x86 systems specifies that callees need to save certain registers. There are a few different names for these kinds of registers, such as nonvolatile registers, callee-saved registers, and so on.

If a method needs to use some registers, it's best to use the volatile registers, since they don't require additional push/pop instructions. However, there are only a few volatile registers. If a method needs additional registers it will have to dip in to the nonvolatile set. These nonvolatile registers must be pushed on function entry, and popped on function exit. So that explains the first part: why these registers are pushed/popped at all.

But what about the second part: why does the compiler prefer pushing/popping the old registers RBX, RBP, RDI, RSI, and RSP over the new registers R12, R13, R14, and R15?

The answer lies in the historical legacy of how instruction encoding worked on 32-bit systems. First let's look at a chart showing the different general purpose registers, and their characteristics:

| Number | Register | Volatile? | Old/New | |--------|----------|-----------|---------| | 0 | RAX | Yes | Old | | 1 | RCX | Yes | Old | | 2 | RDX | Yes | Old | | 3 | RBX | No | Old | | 4 | RSP | No | Old | | 5 | RBP | No | Old | | 6 | RSI | No | Old | | 7 | RDI | No | Old | | 8 | R8 | Yes | New | | 9 | R9 | Yes | New | | 10 | R10 | Yes | New | | 11 | R11 | Yes | New | | 12 | R12 | No | New | | 13 | R13 | No | New | | 14 | R14 | No | New | | 15 | R15 | No | New |

Since the designers of x86 knew that these registers were going to be pushed/popped all the time, they wanted to try to make the push/pop instructions really compact. So they reserved one-byte instruction encodings to push/pop every register. This is pretty unusual: there aren't too many instructions that can be encoded with a single byte. The one-byte instruction encodings are only used for the most common instructions.

To push a register, you take the number in the chart above and add it to 0x50. So if you want to push RSP, the instruction is 0x54, which is 0x50 + 4.

To pop a register, you take the number in the chart above and add it to 0x58. So if you want to pop RSP, the instruction is 0x5c, which is 0x58 + 4.

As you can see, they only reserved space for eight registers when pushing; the same is true when popping. This makes sense, because at the time there were only eight general purpose registers. However, this is a problem because no space was reserved for the higher numbers.

When they designed the 64-bit versions of x86 they came up with a clever solution for this problem, aimed at keeping backwards compatibility with 32-bit systems. They added new prefix instructions to indicate that certain fields should be the extended versions. The details of how this work are a bit complicated, but for a push or pop instruction the prefix 0x41 means that the register should be considered the extended version, and the register number is then subtracted by eight when encoding.

Here's an example. Suppose we want to push R9. The first byte of the instruction is 0x41. The second byte is 0x50 + (9 - 8) = 0x51. Thus the full encoding will be 0x4151.

Suppose we want to pop R14. The encoding will be 0x41 followed by 0x58 + (14 - 8) = 0x5e. Thus the fully encoded instruction will be 0x415e.

As you can see, the original eight registers have more compact encodings: they can each be pushed and popped with a single byte, whereas the new registers require two bytes to push/pop. This applies to certain other instructions too, not just push/pop. The actual time it takes to execute a push/pop is the same either way, so there's not any actual CPU cycles saved. But using smaller instructions means a slightly smaller executable, means less data for the decoding pipeline to process, and means that instructions are more likely to stay in caches. So if you can, it's better to use the old registers: you'll save a byte for each push/pop.