What does one do when certain global events forces them to stay inside*? Well assembly language optimization of course! Today I’m going to tell you about how I made the slowest routine in the game 25% faster. This is about to get very technical so break out your assemblers and hang on to your hat.

*For those of you reading this from the future, this is because of the Coronavirus pandemic of 2020

The biggest bottleneck in the Chuck Jones: Space Cop of the Future is, as it is in most VGA era MS-DOS games, drawing sprites. The reasons for this are twofold. The first is that the CGA, EGA and VGA cards used in MS-DOS PCs did not contain any hardware to handle the drawing of sprites, like dedicates consoles of the time did, as a result the CPU needs to transfer every byte to the screen all by itself. The second reason is video memory wait states, essentially this means that video memory cannot be accessed by the CPU and the video hardware at the same time. Because of this the CPU needs to wait while the video hardware access memory before it can draw any more pixels, unfortunately the timing of these accesses are essentially random and varies between cards. Essentially this just means video memory is very slow, and how slow it is depends on the video card.

One way to speed up drawing sprites is to use repeated x86 string instructions, this way memory accesses can happen as quickly as possible. Unfortunately using these instructions is not always possible when sprites can be partially transparent, because a check must be done before plotting each pixel. There are several ways to speed up transparent sprites as well, the first being compiled sprites and the second being RLE sprites. Neither of these are really viable in CJ:SCotF, compiled sprites take way too much memory given that the game has many unique sprites and my use of planar graphics in Mode X make RLE sprites inefficient. So I’m essentially stuck with making the basic algorithm faster in assembly language. So lets try that!

Some C style code for drawing transparent sprites might look like this, while I don’t suggest anyone actually use this code, it illustrates the idea pretty well.

while(numlines--) { while(numpixels--) { char pixel = sprite[image_index]; if(pixel != 0) //if the pixel is non transparent { screen[screen_index] = pixel; //draw it } //go to the next pixel screen_index++; image_index++; } //add the remaining distance to the end of the line screen_index += screen_diff; image_index += image_diff; }

A basic assembly language implementation

So lets look at a pretty reasonable assembly implementation

; bx = num rows ; dx = num colums ; ds:si = sprite data ; es:di = screen data cld rowLoop : mov cx, dx ; reload pixel counter lodsb ; load 1 pixel test al, al jnz storePixel ; if the pixel != 0 then goto store skipPixel: inc di ; increment the screen pointer dec cx jz finishRow ; if no more pixels then goto end loadPixel: lodsb ; load 1 pixel test al, al jz skipPixel ; if the pixel != 0 then goto skip storePixel: stosb ; store the pixel loop loadPixel ; if cx != 0 load next pixel finishRow: add si, ss:lineDiff add di, ss:screenDiff dec bx jnz rowLoop ; if bx != 0 goto next row

This has a few advantages over the C version, the code can be arranged so that the skipPixel case falls through into the next iteration of the loop. This guarantees that only one branch is taken per pixel, this is especially good because on x86 processors branches that are not taken are much less expensive than branches that are. Additionally this version uses very compact string instructions such as lodsb & stosb which can be loaded very quickly. Now of course we can do better.

Unrolling loops

Lets apply a simple optimization, namely unrolling loops. This way we can do two pixels per loop instead of one so that we have fewer branches

; bx = num rows ; dx = num colums ; ds:si = sprite data ; es:di = screen data rowLoop : mov cx, dx ; reload pixel counter shr cx, 1 ; divide by two jz lastPixel ; if theres nothing left do the last pixel lodsw ; load 2 pixels test al, al jnz storePixel1 ; if the first pixel != 0 then store it skipPixel1: inc di ; increment the screen pointer or al, ah ; al == 0, this sets al = ah & sets flags in one go jnz storePixel2 ; if the second pixel != 0 then store it skipPixel2: inc di ; increment the screen pointer dec cx jz lastPixel ; if no more pixels then goto end loadPixel: lodsw ; load 2 pixels test al, al jz skipPixel1 ; if the first pixel != 0 then skip it storePixel1: stosb ; store the first pixel test ah, ah jz skipPixel2 ; if the first pixel != 0 then skip it mov al, ah storePixel2: stosb ; store the second pixel loop loadPixel lastPixel: test dl, 1 ; if the number of pixels was even, otherwise do 1 more jz finishRow ; then jump to the end lodsb ; load the final pixel test al, al jz skipLast ; if the last pixel == 0 then skip it stosb ; store the last pixel dec di skipLast: inc di ; increment the screen pointer finishRow: add si, ss:lineDiff add di, ss:screenDiff dec bx jnz rowLoop ; if bx != 0 goto next row

How much better is this? Roughly ~22% better* then the standard assembly implementation above! That’s pretty good especially given that this code is a huge bottleneck in the game. I’ll admit that this isn’t a simple naive unroll, but it the unrolling allows us to use lodsw to load two pixels at a time which is a big win too.

However for even this code could be better. This code checks whether we have an even or odd number of pixels on each row, for tall and skinny sprites like characters this could be a significant waste of time given that we know whether the rows are even or odd outside of the row loop.

*timed on a 10 MHz 286 w/ VGA using the PIT counter for timing

Specialized routines

We can instead write 3 different routines. One for even widths, one for odd widths and one for a width of exactly one and then move these checks outside of the loop.

if(bytes_per_line == 1) { doTransparentBlitOne(source.h, screenOffset, bitmapPtr, lineDiff, screenDiff); } else if(bytes_per_line & 1) { doTransparentBlitOdd(source.h, screenOffset, bitmapPtr, bytes_per_line, lineDiff, screenDiff); } else { doTransparentBlitEven(source.h, screenOffset, bitmapPtr, bytes_per_line, lineDiff, screenDiff); }

The actual assembly code is a bit verbose so I won’t actually show it here. But with that optimization we get a further ~3% improvement on large sprites and even better with small sprites, definitely worth it!

Conclusion

Now I’m pretty sure that this could be faster, for one thing this version uses a lot of branches which is always bad, but right now I’ve down about all the optimizing I can handle. I actually did write a version without branching but it was actually slower than the initial implementation by about 30%! If anyone is interested I have the ‘final’ version of the code here, formatted as inline code for OpenWatcom C/C++. If anyone reading this comes up with some further improvements or alternative methods, I would be very interested to hear! Until next time, I promise you all one day, hopefully soon, I will actually finish this game…