Branch Delay Slots are one of the awkward features of RISC architectures. RISC CPUs are pipelined by definition, so while the current instruction is in execution, the following instruction(s) will be in the pipeline already. If there is for example a conditional branch in the instruction stream, the CPU cannot know whether the next instruction is the one following the branch or the instruction at the target location until it has evaluated the branch. This would cause a bubble in the pipeline; therefore some RISC architectures have a branch delay slot: The instruction after the branch will always be executed, no matter whether the branch is taken or not.

So in practice, you can put the instruction that would be before the branch right after the branch, if this instruction is independent of the branch instruction, i.e. doesn’t access the same registers. Otherwise, you can fill it with a NOP. Out-of-order architectures can do this reordering at runtime, so there would be no need for a delay slot. Nevertheless, the delay slot is a feature of the architecture, not the implementation.

Some RISCs like PowerPC and ARM do not have a delay slot, but for example MIPS, SPARC, PA-RISC have it. But there are some variations: MIPS and PA-RISC have an annihilation/nullify/likely bit in the instruction, so the programmer can choose that the instruction in the delay slot only gets executed if the branch is taken.

Other CPUs, like SPARC, PA-RISC or the ill-fated Motorola M88K have optional delay slots: The programmer can set a bit in the opcode if he cannot come up with a good instruction for the delay slot, and save the wasted NOP in the program code this way – the CPU will put a bubble into the pipeline.

Now the interesting question is what happens if the branch and the delay instruction are not independent. What if the delay instruction writes r5 and the branch jumps to r5? What if it’s a branch-and-link, and the delay instruction modifies the link register? On MIPS, this is illegal, and undefined.

In practice, MIPS won’t halt and catch fire though. As you would expect from the design of a CPU pipeline, the CPU basically executes the branch and the delay instruction in order, as they are stored in the instruction stream, and it only delays the write to PC, i.e. the actual jump until after the delay instruction. So, for example, if you modify a register that the branch depends on, it will not influence the branch, but be in effect after the jump.

On the aforementioned Motorola M88K, this behaviour is documented, and GCC even makes use of it:

820: 7d ad 00 08 cmp r13,r13,0x08 ; compare 824: d4 6d 00 05 bb0.n 0x03,r13,0x834 ; cond. branch 828: 63 df 00 00 addu r30,r31,0 ; delay slot 82c: cc 00 00 7f bsr.n 0xa28 ; function call 830: 60 21 01 ac addu r1,r1,0x1ac ; delay slot: fix up link 834: 00 00 00 00 nop

The first three instructions are a compare/branch sequence. If the branch is taken, execution will continue at 0x834, otherwise at 0x82c, after the delay slot. The delay instruction is independent of the branch, nothing special here yet.

But now look at the following two instructions: 0x82c and 0x830 are not independent. r1 is the M88K’s link register, so the “bsr” writes the addres of the following instruction after the delay slot (0x834) into r1. The delay slot also writes into r1: It adds 0x1ac to it. These instructions are executed in order, but the actual branch to 0xa28 will only be done after the delay instruction.

So what these two instructions effectively do is call a function, but set up the return address to skip the next 0x1ac bytes (107 instructions) after return. If the conditional branch at 0x824 is taken, the code at 0x834 will be executed, otherwise, 0xa28 is called, and the “taken” case of the conditional branch is skipped. This trick can be used whenever you have C code with an “if” statement of which one case is a single function call:

if (...) { call_something(); } else { [...] }