Posted on December 14, 2013

A long time ago, when I was a college undergrad, I spent some time working on computer video games. This was in the 8-bit PC era, when the gaming hardware was almost impossibly slow by today’s standards.

It might not surprise you, then, to learn that game programmers of yore did all sorts of crazy things to make their games run at playable speeds. Crazy, crazy things.

This is a story about one of those things.

While I’ve done my best to recall the important details, I may have gotten some things wrong. If I have, please forgive me. It was a long time ago.

(Note: This post contains some small images as inline-data URLs that some feed readers apparently don’t handle well. If you can’t see the images in your feed, click through to the original article to see them.)

Update 2013-12-16: There have been lots of great comments about this story on Hacker News and Proggit, as well as here on the blog. Thanks for sharing, everyone!

The set-up

A friend of mine, a gifted programmer, was nearly finished with a new game. Somehow, he had squeezed into a 1980’s-era PC, pretty faithfully, what was at the time a graphically impressive coin-op game popular at the arcades.

The only problem was that his version of the game was unplayable. It was too slow, and the choppy motion broke the player’s immersion. It was after all a side-scroller.

My friend, who had been working on the game while also taking a full load of college courses, was starting to seem a little frazzled. Fearing that he had missed some easy optimization, he asked me to take a look at the code.

I did. But there was no easy optimization to be had.

The code was tight. Everywhere I looked, he had already done all the things I could think of. Loops had been unrolled. Unneeded draws had been eliminated. All evidence of waste had vanished.

And, with them, our hopes of an easy fix.

But what about a hard fix? A crazy fix?

Well, that was always a possibility.

Before I get to the crazy stuff, however, I should back up and explain how graphics hardware worked back then.

How graphics hardware worked back then

Here’s the quick version: The graphics hardware didn’t do jack squat.

On the PC in question, if you wanted to draw something onto the screen, you had to do that yourself, byte by byte. No texture mappers, no blitters. Just bytes. Bytes that you had to move yourself.

Fun.

My friend’s game spent most of its time redrawing the background. (Again: side-scroller.) For every frame, it had to draw nearly a screen’s worth of background tiles, shifted for the player’s position.

If my memory serves me, each tile was 28 by 28 pixels. Each pixel was one of 16 colors, taking half a byte to represent. Therefore, tiles were represented in memory as 28 contiguous rows of 14 bytes each. The first 14 bytes represented the first row, the second 14 bytes the second row, and so on.

The screen, however, was 320 pixels wide by 240 pixels high. In memory, then, a screen buffer was laid out as 240 contiguous rows of 160 bytes each.

So when copying a tile at address X onto a screen buffer location starting at address Y, you had to copy 28 rows. To copy each row, you had to copy 14 bytes. Fortunately, the processor for this game was a 6809, which had a few 16-bit index registers and delightful “auto-increment” addressing modes (sort of like applying the postfix ++ operator to pointers in C). That meant you could copy 4 pixels at a time while advancing both the X and Y registers in passing:

LDU ,X++ ; read a 2-byte word (= 4 pixels) from source tile STU ,Y++ ; write it to screen buffer

To copy a full row, you had to do that seven times, so you might sandwich those lines within a 7-count loop:

LDB #7 ; remaining words <- tile width in words @LOOP LDU ,X++ ; read a 2-byte word (= 4 pixels) from source tile STU ,Y++ ; write it to screen buffer DECB ; reduce remaining-word count BNE @LOOP ; loop while words remain

When you were done copying the row, you needed to advance the destination pointer Y so that it pointed to the starting address for the next row you would draw into the screen buffer. Since the screen buffer was 160 bytes wide and a tile was only 14 bytes wide, that meant you had to add their difference to Y:

LEAY 160-14,Y

And, just like that, you have copied a row onto the screen.

That’s one. To copy a full tile, you need to do the same thing 28 times. So, in turn, you sandwich that code within a 28-count loop.

Putting it all together, and giving names to the important numbers, you might end up with a subroutine like this:

;;; important constants SCRW = 160 ; screen-buffer width in bytes (= 320 4-bit pixels) TILW = 14 ; background-tile width in bytes (= 28 4-bit pixels) TILH = 28 ; background-tile height in rows WOFF = SCRW - TILW ; s-b offset from end of one tile row to start of next COPYTILE ;;; ;;; Copy a 28x28 background tile into a screen buffer. ;;; Arguments: ;;; X = starting address of background tile ;;; Y = starting address of destination in screen buffer ;;; LDA #TILH ; remaining rows <- tile height @COPYROW LDB #TILW/2 ; remaining words <- tile width in words @LOOP LDU ,X++ ; read a word (= 4 pixels) from source tile STU ,Y++ ; write it to screen buffer DECB ; reduce remaining-word count BNE @LOOP ; loop while words remain ;; LEAY WOFF,Y ; advance dst ptr to start of next dst row DECA ; reduce remaining-row count BNE @COPYROW ; loop while rows remain ;; RTS ; done! return to caller

And that code would be fine.

If you didn’t care about speed.

Caring about speed

Knowing that the game is probably going to spend most of its time running that code, you do what any good programmer would: start counting cycles. Here’s the inner loop again, with setup and finishing, annotated with cycle counts:

LDB #TILW/2 ; 2 cycles (set-up) @LOOP LDU ,X++ ; 8 STU ,Y++ ; 8 DECB ; 2 BNE @LOOP ; 3 ;; LEAY WOFF,Y ; 8 (finishing)

Looking at those counts within the loop, you’re not likely to miss that you’re burning 21 cycles to copy just 4 pixels. To copy a full row, then, works out to 2 cycles + (7 iterations) * (21 cycles/iteration) + 8 cycles = 157 cycles. Ouch.

But this isn’t your first time at the keyboard. You know what to do. Unroll that loop!

LDU ,X++ ; 8 cycles STU ,Y++ ; 8 LDU ,X++ ; 8 STU ,Y++ ; 8 LDU ,X++ ; 8 STU ,Y++ ; 8 LDU ,X++ ; 8 STU ,Y++ ; 8 LDU ,X++ ; 8 STU ,Y++ ; 8 LDU ,X++ ; 8 STU ,Y++ ; 8 LDU ,X++ ; 8 STU ,Y++ ; 8 LEAY WOFF,Y ; 8 (finishing)

Now, with the loop overhead reduced to zero – you’ve even eliminated the set-up – each row takes only 7 * (8 + 8) + 8 = 120 cycles. That’s a 30-percent speed-up. Pretty good.

And that’s where most programmers probably would have left it.

But not my friend.

He knew that those ++ operations were costly, 3 cycles apiece. And, with the loop unrolled, he also knew exactly where each word to be read or written was located with respect to X or Y. So he cleverly replaced those 3-cycle post-increments with exact offsets. They cost only 1 cycle apiece, and the 0 offset is actually free:

LDU ,X ; 5 cycles STU ,Y ; 5 LDU 2,X ; 6 STU 2,Y ; 6 LDU 4,X ; 6 STU 4,Y ; 6 LDU 6,X ; 6 STU 6,Y ; 6 LDU 8,X ; 6 STU 8,Y ; 6 LDU 10,X ; 6 STU 10,Y ; 6 LDU 12,X ; 6 STU 12,Y ; 6 LEAX TILW,X ; 8 (finishing) LEAY SCRW,Y ; 8 (finishing)

With his optimizations, the cycle count per row had been reduced to (5 + 5) + 6 * (6 + 6) + (8 + 8) = 98 cycles. Compared to the original code, that’s 60 percent faster:

original_speed = (1*row) / (157*cycle) optimized_speed = (1*row) / (98*cycle) speed_up = optimized_speed / original_speed = 157 / 98 = 1.60.

Putting it all together – and, again, I’m going by memory so the code might have been slightly different – the copy-tile subroutine looked something like this, and it copied a full tile, all 28 rows, in a lean 2893 cycles:

COPYTILE2 ;;; ;;; Copy a 28x28 screen tile into a screen buffer. ;;; Arguments: ;;; X = starting address of background tile ;;; Y = starting address of destination in screen buffer ;;; Execution time: ;;; 4 + 28 * (82 + 8 + 8 + 2 + 3) + 5 = 2893 cycles ;;; LDA #TILH ; initialize row count (4 cycles) ;; @COPY1 ;; unroll inner loop (copies one row of 28 pixels in 82 cycles) LDU ,X ; (1) read 4 pixels (5 cycles) STU ,Y ; write 4 pixels (5 cycles) LDU 2,X ; (2) (6 cycles) STU 2,Y ; (6 cycles) LDU 4,X ; (3) ... STU 4,Y ; ... LDU 6,X ; (4) STU 6,Y ; LDU 8,X ; (5) STU 8,Y ; LDU 10,X ; (6) STU 10,Y ; LDU 12,X ; (7) STU 12,Y ; ;; LEAX TILW,X ; advance src to start of next row (8 cycles) LEAY SCRW,Y ; advance dst to start of next row (8 cycles) DECA ; reduce remaining count by one (2 cycles) BNE @COPY1 ; loop while rows remain (3 cycles) ;; RTS ; done! return to caller (5 cycles)

This code, all in all, was 60% faster than the naive COPYTILE code we started out with.

But that wasn’t fast enough. Not nearly.

So when my friend showed me his code and asked if I could make it faster, I really wanted to help him. I really wanted to be able to answer yes.

But I had to answer no. I hated giving that answer. But, studying the code, I couldn’t see any way to speed it up.

And yet, the hook was set.

Later, I couldn’t get the problem out of my head. Maybe I had missed something. I had grown up on Apple II computers and their 6502 processors. My friend’s code, however, was for the 6809. Maybe it afforded optimizations that I didn’t know about.

With renewing optimism, I dialed into the university’s library to access the card catalog. (These were the days before the World Wide Web.) Through a VT220 terminal emulator, I searched for books about the 6809.

There was one. A 6809 microprocessor manual. It was in the engineering library. Luckily, I was an engineering student and had sign-out privileges.

Just like that, I was off to the library.

A crazy idea

When I got to the engineering library, I found the book right where it was supposed to be, looking a little beat up:

The MC6809-MC6809E Microprocessor Programming Manual.

Not bothering to find a chair, I flipped through the pages standing, looking for some oddball, 6809-specific instruction that might throw a lot of bytes and throw them fast. Page after page, though, turned up nothing.

Then I came upon PSHS. “Push registers on the hardware stack.”

On my beloved 6502, if you wanted to save registers on the stack, you had to do it one register at a time and, even then, by transferring them through the accumulator. It was slow and expensive. So I had learned to avoid the stack when speed mattered.

But on the 6809 you could save all of the registers – or any subset of them – with a single instruction. And, amazingly, it cost a mere 5 cycles, plus 1 cycle more per byte pushed.

Since the processor had three 16-bit general-purpose registers – D, X, and Y – I could load them up and then use a single PSHS instruction to write 6 bytes in just 11 cycles. The corresponding pull instruction, PULS , had the same low cost.

Further, the 6809 had two stack registers, S and U. I could use one as a source pointer and the other as a destination pointer. In theory, with a single PULS / PSHU pair, I could copy 6 bytes in 22 cycles.

That’s crazy, crazy fast.

Excitement mounting, I headed for the front desk and reached for my student ID. I was going to sign out that wonderful little book for further study.

A crazy plan

Walking back to the dorms, I formed my plan.

I would save the S and U registers somewhere, and then point S at the background tile and U at the screen buffer. Then I would pull from S and push to U, copying 6 bytes at a time, using D, X, and Y as go-betweens. To copy the 14 bytes that make up a row would require three such iterations, which unrolled would be about 60 cycles.

When I reached my room, I found a piece of paper and sketched it out:

PULS D,X,Y ; first 6 bytes 11 cycles PSHU D,X,Y ; 11 PULS D,X,Y ; second 6 bytes 11 PSHU D,X,Y ; 11 PULS D ; final 2 bytes 7 PSHU D ; 7 LEAU -WOFF,U ; advance dst ptr 8

Just 66 cycles, including the after-row adjustment to U that prepares it for the next row. (Note that the adjustment is now negative.) For comparison, the naive row-copy loop that we looked at earlier took 157 cycles to do the same thing. And my friend’s optimized code took 98. Already, this crazy idea was looking like a big win.

Still, that last PULS / PSHU pair! What a clunker. It was needed to handle the final two bytes per row, since the rows were 28 pixels = 14 bytes wide, and 6 doesn’t divide 14 evenly.

That darn remainder of 2 bytes!

If only the game had used 24-by-24 tiles instead… But it hadn’t, so I pored through the manual, looking for a way to lower the cost of that inevitable last pair.

And then, by surprise, I struck gold! It was another 6809 oddity, the DP register.

On the 6502 and most 8-bit processors of the era, the lowest 256 bytes of memory was called the zero page. The zero page was special because its memory locations had single-byte addresses and could be accessed with shorter and usually faster instructions.

The 6809’s designers took this idea one step further. They let you use the DP register to designate any page as the zero page, which they called the “direct page.”

But none of my byte-slinging instructions needed to use the direct page. That meant I could use the DP register as an additional one-byte go-between. Now I could copy 7 bytes per pull-push pair!

And 14 is evenly divisible by 7.

With this change, I could copy a whole 28-pixel row and advance to the next in just 5 instructions:

PULS D,X,Y,DP ; first 7 bytes 12 cycles PSHU D,X,Y,DP ; 12 PULS D,X,Y,DP ; second 7 bytes 12 PSHU D,X,Y,DP ; 12 LEAU -WOFF,U ; advance dst ptr 8

56 cycles! Fifty. Six. Cycles.

That bit of code made me feel just awesome. I had managed to harness every single available register on the machine for byte-slinging! D, X, Y, U, S, and even the oddball DP – they were all fully engaged.

I was feeling great about this solution.

Except for just one thing…

I’ll have a little more crazy, please

If you’re acquainted with stacks, you may have noticed a subtle flaw in my brilliant plan:

Pushes and pulls work in opposite directions.

See, I lied about that little block of code I just showed you. It didn’t actually copy one row from a background tile onto the screen. What it actually did was break the row into two 7-byte blocks – I’ll call them “septets” – and then draw them onto the screen swapped.

Recall that tile rows were laid out sensibly in memory as 14 contiguous bytes, like this:

+---+---+---+---+---+---+---+---+---+---+---+---+---+---+ | 0 1 2 3 4 5 6 7 8 9 A B C D | +---+---+---+---+---+---+---+---+---+---+---+---+---+---+

So when you pulled-and-pushed a row onto the screen in 7-byte septets, it would get horizontally split like this:

+---+---+---+---+---+---+---+---+---+---+---+---+---+---+ | 7 8 9 A B C D | 0 1 2 3 4 5 6 | +---+---+---+---+---+---+---+---+---+---+---+---+---+---+

The first septet was now second, and the second now first. Note, however, that the bytes within each septet were unchanged; they retained their original ordering.

Further, if you ran the row-copy code multiple times, the vertical stack of rows you copied would end up being upside down. That’s because the code uses pushes to write the rows onto the screen. Pushes move the stack pointer toward lower addresses, and lower addresses correspond to screen lines higher up on the display.

To visualize the havoc that this row-reversal and septet-swapping would wreak, let’s say that you had the following 28-by-28 tile, representing a key:

If you had used 28 applications of my row-copy code to draw it onto the screen, you would have ended up with this not-a-key:

So . . . that was a problem.

But this problem, too, had an elegant solution!

To review:

the rows were upside down

the septets within each row were swapped

but the bytes within each septet were unchanged

The breakthrough was to see “upside down” and “swapped” as two instances of the same thing. Reversal.

Reversal has this handy property: it’s an involution. Reverse something twice, and you get back the original. With this in mind, I reasoned that if the tiles were pre-processed to reverse their rows and then the septets within the rows, the tiles would end up looking fine when they hit the screen. The mangling that occurred during copying and pre-processing would, in effect, undo each other.

Working it out on paper, I also discovered that the pre-processing step could ignore the rows and just treat the tiles as one long, linear sequence of septets. To see why this is so, let’s consider a small tile of 2 rows of 2 septets each:

+---+---+ | a b | +---+---+ | c d | +---+---+

In memory, it would be laid out in row-major order, that is, as 4 sequential septets:

+---+---+---+---+ | a b | c d | +---+---+---+---+

So, reversing the rows and then swapping the septets within each row is the same as just reversing the septet order in memory:

+---+---+ +---+---+ +---+---+ | a b | | c d | | d c | +---+---+ =======> +---+---+ =======> +---+---+ | c d | reverse | a b | swap | b a | +---+---+ rows +---+---+ septets +---+---+ +---+---+---+---+ +---+---+---+---+ +---+---+---+---+ | a b | c d | | c d | a b | | d c | b a | +---+---+---+---+ +---+---+---+---+ +---+---+---+---+

So the solution to the mangling problem turned out to be a simple, one-time pre-processing step: Just break each tile into septets and reverse their order.

Knowing that the mangling problem could be solved, I was feeling good about my overall idea. I couldn’t see any other problems, so I sketched out a new copy-tile subroutine to present to my friend. Since the row-copy logic was now just 5 instructions, I used it 4 times, unrolling the sole remaining loop a bit. Now, instead of copying 28 single rows, I copied 7 quad-rows.

The code looked something like this:

COPYTILE3 ;;; ;;; Copy a 28x28 screen tile into a screen buffer. ;;; Arguments: ;;; X = starting address of *septet-reversed* background tile ;;; Y = starting address of destination in screen buffer ;;; Execution time: ;;; 34 + 7 * (224 + 7 + 3) + 7 + 10 = 1689 cycles ;;; ;; setup: 34 cycles PSHS U,DP ; save U and DP (8 cycles) STS >SSAVE ; save S (7 cycles) ;; LDA #TILH/4 ; initial quad-row count (2 cycles) STA >ROWCT ; (5 cycles) ;; LEAS ,X ; initialize src ptr (4 cycles) LEAU (TILH-1)*SCRW+TILW,Y ; initialize dst ptr (8 cycles) ;; @COPY1 ;; copy four rows of 28 pixels in 4 * (48 + 8) = 224 cycles PULS X,Y,D,DP PSHU X,Y,D,DP PULS X,Y,D,DP PSHU X,Y,D,DP LEAU -WOFF,U PULS X,Y,D,DP PSHU X,Y,D,DP PULS X,Y,D,DP PSHU X,Y,D,DP LEAU -WOFF,U PULS X,Y,D,DP PSHU X,Y,D,DP PULS X,Y,D,DP PSHU X,Y,D,DP LEAU -WOFF,U PULS X,Y,D,DP PSHU X,Y,D,DP PULS X,Y,D,DP PSHU X,Y,D,DP LEAU -WOFF,U ;; DEC >ROWCT ; reduce remaining quad-row count by one (7 cycles) BNE @COPY1 ; loop while quad-rows remain (3 cycles) ;; LDS >SSAVE ; restore S (7 cycles) PULS U,DP,PC ; restore regs and return to caller (10 cycles) SSAVE ZMD 1 ; stash for saving S while drawing ROWCT ZMB 1 ; var: remaining rows to copy

Adding up the cycles, the total was just 1689. I had shaved almost 1200 cycles off of my friend’s code. That’s a 70 percent speed-up!

Beaming, I went to show my friend.

The acid test

When I caught up with my friend and said I’d figured out how to make the copy-tile routine 70% faster, his face lit up. When I explained about the whole septets-and-mangling thing, though, his face unlit, doubts rising. But when I showed him the code, he got it. It clicked.

“Let’s try it out!” he said.

In about a half hour’s work, he had the code integrated into the game. After a rebuild and reboot, the game was loading.

The title screen came up.

He was into the game.

And . . .

Holy crap!

The game seemed amazingly fast. In truth, it was probably only a third to a half faster, but that was enough. Some important perceptual threshold had been crossed. The game-play was now smooth, natural. You could feel the difference.

We just sat there, playing the game and grinning. It was a good day.

But it would not last.

In for a penny . . .

A few days after the speed problem was behind us – or so we thought – the game’s finishing touches were applied. One of those finishing touches was sampled sound effects. This required feeding byte-sized morsels to the audio-output DAC a few thousand times each second. To schedule the feedings, my friend had turned on a hardware-clocked interrupt.

Worked like a charm. The sounds sounded great, and everything was perfect.

Except for one thing.

After playing into the game one night, we noticed that some of the tiles were getting corrupted. The more we played, the more corruption we saw.

Oh, no.

Something very bad was happening.

And then it hit us.

What happens if an interrupt goes off during the copy-tile routine?

The processor, to prepare for calling the interrupt handler, would push its current state onto the system stack. But, during the copy-tile routine, there was no system stack. The system-stack register S had been commandeered. And where was it pointing? Right into the memory buffer that held the reference tiles!

Oops.

We slumped. After all that effort, to think that our all-important speed-up was causing memory corruption…

Needing a break, we walked to an all-night diner near campus to eat and think. Over pancakes and bacon, scrapple and grilled stickies, we talked through the problem.

There were only two stack registers, and without commandeering both of them, the copy-tile routine wouldn’t be nearly as fast. There was no way, then, to return the S register to the system without losing our hard-won speed. Rock. But there was also no way that we could get reliable sound without using interrupts. Hard place.

One way or another, then, interrupts were going to be triggered. And, when they were, if copy-tile was running, whatever S was pointing to was going to get hammered.

“How can we prevent the corruption?” I asked.

We sat there, ignoring our food, the question hanging in the air.

Suddenly, my friend slapped the table. He had it.

“Don’t prevent it!” he said.

“What?” I asked.

“Don’t prevent the corruption,” he explained. “Let it happen. Just not to the reference tiles.”

It was simple, really. He continued:

“Just swap S and U in the copy-tile routine. U will point at the reference tiles, and S at the screen buffer. If an interrupt goes off, the corruption will happen where S is pointing – on the screen. The corruption will last only until we redraw the next frame.”

“That’s brilliant!” I said.

Eager to try his solution, we quickly finished our meal.

. . . In for a pound

Walking back to the dorms, though, it bothered us that players might see screen glitches, even if for just a frame. We both felt that there had to be some way to make the hack perfect, if only we could find it.

Later that night, we found it.

Again, it was simple once we saw it. All we had to do was change the tile ordering. Instead of placing the tiles onto the screen from top to bottom, left to right, we would go right to left, bottom to top. In other words, from high addresses to low addresses.

That way, when an interrupt went off and corrupted the memory locations immediately before the tile we were currently drawing, the corruption would be fixed when the very next tile was drawn. And since tile-drawing took place in a buffer that wasn’t displayed until it was fully drawn (and the vertical refresh had occurred), nobody would ever see the corruption.

It was the perfect hack!

To the old days!

And that’s the way things were. The challenge wasn’t overwhelming complexity, as it is today. The challenge was cramming your ideas into machines so slow, so limited that most ideas didn’t fit.

You had to bang your ideas around, twist them, turn them, searching for something, anything that would help you squeeze them into the machine. Sometimes you found it, and you got one step closer to realizing your ideas. Sometimes you didn’t.

But the search was always instructive.

In this case, the search yielded several small victories that, together, solved the problem. But when you consider all the things we had to do to make that darn game fast enough, it does seem crazy.

We started with a tile-copying subroutine whose core was a hand-tuned, unrolled loop of machine instructions. Then, to buy a 70-percent speed-up:

We replaced this subroutine with a very special manifestation of insanity that commandeered both stack pointers and used pulls and pushes, and every single available register, to draw tiles upside down and horizontally broken. Then we pre-processed the tiles so that drawing them would actually fix them. But then – dammit! – interrupts that occurred during drawing could corrupt the reference tiles. So, to protect the reference tiles, we corrupted the screen buffer instead. But that corruption would be visible. So we changed the tile-placement order to repair – on the fly – any corruption that might have occurred, before it could ever be displayed. And it all worked!

We did that. For 70 percent.

And it was so worth it.

Tell ’em if you got ’em

Anyway, I wanted to share that story with you for two reasons.

The first is that it’s a fun story. It’s one my earliest memories of struggling with a messy computing problem and, through dogged persistence, zeroing in on a solution that was both effective and elegant. The result was powerfully satisfying.

The struggle, I learned, is worth it.

The second reason is to encourage you to tell your stories. I know that, “back in the day,” most video games probably gave rise to many stories like this one. I’d love to hear them. But too many of them are lost, having faded out of memory before anyone thought to preserve them.

If you have a story, please don’t wait. Tell it. Every day you wait, it gets harder.

Tell it.