Field Sort

I've discovered a new, fast technique for sorting sprites by Y-position on the Commodore 64. While it doesn't beat the fastest routine known, it strikes a good balance between speed and memory usage, and it makes use of several interesting low-level programming tricks. To see how it works, please come with me down the rabbit hole of bleeding-edge C64 programming!

Background

Many early consoles and home computers, including the Commodore 64, support sprites. This is a kind of hardware acceleration that allows you to put small, movable objects on the screen. Typically you only get to place a limited number of sprites on the screen simultaneously (eight on the C64), but as the raster beam progresses down the display, you can reuse the sprite hardware many times. This technique is called sprite multiplexing, and it is what allows games to have lots of enemies—and bullets!—moving all over the place.

A game might thus maintain a set of actors, things that are supposed to move around on the screen. In preparation of every video frame, these actors have to be sorted according to their current Y-position. Then, throughout the visible portion of the frame, the game software picks actors from the sorted list and loads the correct data (e.g. coordinates and a pointer to the pixels) into the hardware sprite registers just in time for the display generator to pick them up.

Over at CSDb, there was an interesting forum thread about how to optimise the sorting step. That is, given an array of actors, compute an array of indices (actor IDs), such that the Y-coordinates of the corresponding actors are ordered from lowest (top of screen) to highest (bottom of screen).

State of the art

Now, sorting is of course a problem that has been studied a lot. If you've taken one or two courses in computer science, you know that there are several different sorting algorithms. Perhaps you remember that the best ones have a time complexity of O(n log n). Those are the linearithmic comparison sorts, and for this particular application, you'd do well to forget all about them. Big-O notation is useful for describing how algorithms behave when n grows very large, but in a C64 game, the number of on-screen objects is quite limited.

So instead, we pick a realistic scenario, and try to minimise the worst-case execution time in clock cycles, and the amount of memory required. The benchmark scenario discussed in the forum thread involves 32 actors.

Back in the 1980s, game programmers would generally opt for sorting algorithms with a good average-case execution time. One example is Ocean sort, named after the game publisher that popularised it. This is a kind of incremental bubble sort: An array containing the sort order from the previous frame is traversed, and each element is checked to see if it is out of order with respect to its predecessor. If so, the element bubbles backwards through the array until this is no longer the case. Ocean sort is often very fast in practice, because game sprites tend to remain in roughly the same order from one frame to the next. When they don't, however, the execution time blows up and the timing of the game is thrown off, which is annoying for the player.

Demo coders have different constraints than game programmers. For instance, it is often viable to dedicate large areas of memory to unrolled loops. This is known as speedcode, and it is a highly efficient optimisation technique on platforms—such as the C64—that don't have an instruction cache (or any cache at all).

The visible portion of the normal C64 display (i.e. with closed borders) comprises 200 raster lines. Sprites have a fixed height of 21 lines, and it is desirable to be able to place sprites so that they're partially hidden by the border. Therefore, a sprite-sorting routine needs to cope with a maximum of 220 different Y-positions. With such a limited range, it becomes feasible to use integer sorting algorithms, such as counting sort.

A very straightforward integer sorting algorithm is bucket sort: One simply drops the 32 actors into an array of 220 buckets, one for each Y-position. Then the buckets are visited from top to bottom, and emptied of their actor IDs, which are thus encountered in the desired order. To allow multiple actors in the same bucket, some kind of linked structure is often used.

What caught my eye in the forum thread was that Color Bar and Christopher Jam were discussing a new type of bucket sort where the buckets themselves are blocks of speedcode. To keep it simple, I will describe Color Bar's original idea for how to represent a bucket (which has since been optimised further):

bucket bcc nextbucket ; branch always lda #0 pha lda #0 pha lda #0 pha lda #0 pha lda #0 pha lda #0 pha lda #0 pha lda #0 ;

The idea is to string together 220 copies of the above routine, back to back, for a total of 6820 bytes. The carry flag is kept clear, so the branches are always taken. Initially, all buckets are empty, and each branch instruction points to the next bucket.

To add an actor to a bucket, the branch instruction is modified to point to the last lda #0 instruction (marked with a comment above), and the operand of this instruction is modified to contain the actor ID. When the bucket is executed, the actor ID is pushed to the stack, and finally (with the sty instruction) the branch offset is reset to its original value, which has been preloaded into the Y register. For each new actor with the same Y-coordinate, the target of the branch instruction is modified to point to the preceeding lda , where the new actor ID is stored. Since the buckets are executed in order, what ends up on the stack is indeed a list of actor IDs sorted by Y-position.

Color Bar's code for filling the buckets is also heavily optimised, although I won't dive into that routine here. Suffice it to say that the branch offsets are precalculated and kept in a table, and if an attempt is made to add more than eight actors to the same bucket, the later ones will just overwrite each other in the last (first in the code) slot. That's fine because those actors must be dropped in a later stage of the sprite multiplexer anyway, since there are only eight sprite units in the hardware.

A taken branch costs three cycles (actually there is a penalty cycle when crossing a page boundary, but we'll ignore that for the present discussion), so visiting all the buckets takes 660 cycles, if they are empty. In the worst case, all actors end up in different buckets, which adds nine cycles for each of the 32 actors, for a total of 948 cycles for this part of the algorithm.

But the code for filling the buckets still dominates the execution time, with a further ~1400 cycles. In Color Bar's original proposal, this code existed in 220 specialised copies, one for each bucket. The total memory usage for buckets and bucket-filling routines was on the order of 16 kB, one fourth of the total memory of the C64. Christopher Jam improved execution time further by organising the filling routines sparsely in memory in a clever way, but this caused memory usage to skyrocket well above 32 kB. In the fastest version of this routine so far, the total execution time for 32 actors is a mere 1972 cycles (worst case). Alas, interesting as it may be, this memory-hungry approach is not feasible at all for game programming.

My solution, that I've dubbed field sort, is heavily inspired by Color Bar's executable buckets, but it fits in a modest 2 kB of memory. As we shall see, however, there are some restrictions on where things must be located in the address space.

Field sort

At the heart of the new technique is the field, a string of 220 iny instructions (opcode $c8 ), one for each bucket. Each of these will execute in two cycles, so we have a baseline cost of 440 cycles for visiting the buckets when they are empty. As we proceed through the field, the Y register is incremented to reflect the current bucket number (i.e. Y-coordinate).

We also maintain a separate 220-byte array called the link table. This array contains the actor ID of the first actor in each bucket. The link table is initially filled with negative numbers, to indicate empty buckets. To place an actor in a bucket, we store its (non-negative) ID in the link table, while copying the old value from the link table into an array of next-pointers indexed by actor ID. In this way, we build a linked list of all the actors that share a bucket. The list can be traversed by starting from the actor ID obtained from the link table, and following the next-pointers until a negative number is encountered, which marks the end.

In addition to updating the linked lists of actor IDs, we modify the target bucket inside the field. The iny instruction is replaced with a jmp instruction (opcode $4c ). But wait! The jmp instruction has a two-byte operand, the target address of the jump. If we inspect the field with a disassembler, after putting actors in some of the buckets, it will look something like this:

.C:fe29 C8 INY .C:fe2a 4C C8 C8 JMP $C8C8 .C:fe2d C8 INY .C:fe2e 4C C8 C8 JMP $C8C8 .C:fe31 C8 INY .C:fe32 4C 4C C8 JMP $C84C .C:fe35 C8 INY .C:fe36 C8 INY .C:fe37 C8 INY .C:fe38 4C C8 C8 JMP $C8C8 .C:fe3b C8 INY .C:fe3c C8 INY .C:fe3d C8 INY .C:fe3e C8 INY .C:fe3f C8 INY .C:fe40 C8 INY .C:fe41 C8 INY

Depending on whether the two subsequent buckets are empty or non-empty, the operand is going to be interpreted as either $4c4c, $4cc8, $c84c or $c8c8. At each of these four addresses is a copy of the bucket-emptying routine:

sty endptr+1 lax ylink,y emit pha lda actorlink,x bpl emit-1 sta inyfield,y sta ylink,y endptr jmp inyfield

inyfield is the location of the field of iny instructions, whith must be aligned on a page boundary. ylink is the link table, with references to the first actor in each bucket. actorlink is the array of next-pointers indexed by actor ID. The first thing the routine does, is to self-modify the last instruction to jump back into the field, to where we came from. This is possible because the Y register reflects the current bucket number.

Next, the routine obtains the ID of the first actor in this bucket, from the ylink array. lax is one of the undocumented (or “illegal”) opcodes of the 6502 processor, and it loads both the A and X registers with the same value. The actor ID is then pushed to the stack, and the corresponding next-pointer is fetched. If there was only one actor in the bucket, this number will be negative, and the branch is not taken. Time for another bit of cleverness: The end-of-list sentinel can be any negative number. Well, the iny opcode ( $c8 ) is a negative number. By using that value as the sentinel, it will be conveniently available in the A register at this point, saving us two cycles per used bucket. The $c8 is written back into both the link table and the field, and we jump right to it. After the jump, Y will be incremented and we will proceed down the field.

If there's more than one actor in the bucket, the value from the actorlink table won't be negative, and the branch is taken. But there is no instruction for loading A and X from an address indexed by X. We have to transfer the value from A to X somehow, but we don't want to spend unnecessary cycles on doing that in the first iteration of the loop. Here, the trick is to branch to the address one byte ahead of the emit label. This will execute the last byte of the lax instruction, i.e. the most significant byte of its operand, as an instruction. By locating ylink at address $aa00 , we ensure that this byte doubles as the tax instruction ( $aa ), which neatly solves our problem.

Finally, let's have a look at the bucket-filling part of the algorithm. This is an unrolled loop over the actors. With the opcode for jmp ( $4c ) prepared in the X register, the following snippet is executed for each actor i:

ldy ypos+i shx inyfield,y lda ylink,y sta actorlink+i lda #i sta ylink,y

The actor's Y-coordinate is loaded into the Y register. The jmp opcode is stored into the field at this offset. Then the linked list is updated, which is straightforward. But notice that I've used the undocumented shx instruction to store the contents of the X register at a location indexed by the Y register. That's because the ordinary stx instruction isn't available in this addressing mode. But the shx instruction is generally regarded as unstable, because it has the following interesting properties: First of all, if adding the index causes a carry into the high byte of the address, the instruction will sometimes write to the wrong address. This won't happen here, because inyfield is page-aligned. Secondly, sometimes the value written is actually a bitwise-AND between the X register and the high byte of the target address plus one. To get around this peculiar side-effect, we have to ensure that the value that gets ANDed in (the high byte of the target address plus one) doesn't have a zero-bit where our X-value ( $4c ) has a one-bit. There are several options, but the simplest solution is to place the field at $fe00 , since ANDing with $ff is harmless.

To summarise the memory constraints, the location of inyfield has to be chosen from a handful of addresses, where $fe00 is one possibility. ylink has to be at $aa00 . And the bucket-emptying routines occupy some memory on pages $4c and $c8 .

If we place the ypos and actorlink arrays on the zero-page, the total worst-case execution time for 32 actors becomes 2208 cycles. To my knowledge, this is the fastest routine for sorting 32 sprites on the C64 in less than 2 kB of RAM.

Proof of concept

To demonstrate my technique, I released a small one-file demo called Field Sort (pictured at the top of this page).

Here is the CSDb page.

Posted Sunday 24-Sep-2017 23:00