"Low-level programming is good for the programmer's soul." -John Carmack

When the complexity of modern computing gets fatiguing, I look back to the simpler machines of my childhood. I grew up surrounded by various 8-bit and 16-bit machines (most of them already a few years outdated at the time), but my favorite of the bunch is the Nintendo Entertainment System. It's a small machine, even by '80s standards:

6502 CPU at 1.79 MHz

2 Kilobytes of RAM for the CPU

2 Kilobytes of RAM for the PPU (video)

This tiny RAM was augmented by relatively massive amounts of ROM (and sometimes additional RAM) in each cartridge. Typical software ranged in size from 40KB (Super Mario Bros.) to 384 KB (Super Mario Bros. 3), with a few outliers on each end.

When the mood strikes, I'll variously elect to work on my emulator or tinker with EPROM carts and code. The siren call of 6502 assembly language beckoned, so I dusted off the assembler I'd written in Lisp a few years prior. After a little time spent polishing the code, I wanted to write something cool to show it off; so far, I'd mostly written fairly boring experiments and tests such as the following:

I can do better than that.

The 6502 is a great processor to write an assembler for, with a compact and regularly-encoded instruction set. Simple hardware and no external constraints (as might be imposed by operating systems, library code, compilers, coworkers, etc.) make an ideal playground to do your own thing. My assembler is minimalistic, but not minimal: 570 lines at the core, split evenly between infrastructure and definitions of the 6502 architecture. It's embedded in CL, rather than operating as a standalone language processor, so typical assembly source code is really a series of lisp function calls, each emitting an instruction. I'll say a little more about the design.

The job of an assembler is simple enough: to translate a description of data and machine instructions to an output file (an executable, object file, or raw binary data) which can be loaded in memory on the target machine. Machine code is just another kind of data, for which mnemonic assembly language is syntactic sugar, and so knowledge of a particular instruction set can be considered a layer of drudgery built on top of a very simple foundation, with three essential tasks:

Accumulate an output vector. Remember the current assembly position, in terms of the target's address space. Maintain a mapping of symbolic names to locations so that self-referential structures can be assembled, including forward references.

My assembler bundles these responsibilities into an object I'll call the assembly context, implementing the following protocol:

(defgeneric context-emit (context vector) (:documentation "Emit a vector of bytes into the assembly context")) (defgeneric context-address (context) (:documentation "Returns current virtual address of the context")) (defgeneric (setf context-address) (address context) (:documentation "Set the current virtual address of the context")) (defgeneric context-find-label (context symbol) (:documentation "Returns the address of a label, or nil.")) (defgeneric context-set-label (context symbol &optional address) (:documentation "Set the address of a label. If not supplied, the current address is used.")) (defgeneric context-emit-instruction (context vector) (:documentation "Emit an instruction into the assembly context. This is a hint, for contexts which want to handle instructions specially (e.g. cycle counting).") (:method (context vector) (context-emit context vector))) (defgeneric link (context) (:documentation "Prepare and return final, assembled output."))

Handling forward references is the only tricky bit. Assemblers take various approaches to this. Using delayed evaluation (in the fashion of force / delay ) seemed the simplest way to me - if an expression (such as a call or branch target) involves a label that is not yet defined, a promise object is emitted into the output vector. When LINK is called on the context at the end of assembly, promises are forced to evaluate. If they still can't be resolved, the problems are collected and presented as an error.

For brevity, the context is dynamically bound to the variable *context* . Instruction emitters use this implicitly, as do a number of functions which provide a friendlier user interface to the assembler:

;;;; User interface: (defvar *context* nil "Current assembly context") (define-symbol-macro *origin* (context-address *context*)) (defun emit (bytes) (context-emit *context* bytes)) (defun db (&rest bytes) (dolist (byte bytes) (context-emit *context* (encode-byte byte)))) (defun dw (&rest words) (dolist (word words) (context-emit *context* (encode-word word)))) (defun advance-to (offset &optional (fill-byte #xFF)) (let ((delta (- offset (context-address *context*)))) (when (< offset 0) (error "Cannot advance to ~X, it is less than the current assembly address (~X)" offset (context-address *context*))) (context-emit *context* (make-array delta :initial-element fill-byte)))) (defun align (alignment &optional (fill-byte #xFF)) (advance-to (* alignment (ceiling (context-address *context*) alignment)) fill-byte)) (defun label (name &key (offset 0) (context *context*)) (assert (not (null context))) (delay name (offset) (+ offset (or (context-find-label context name) (error 'resolvable-condition :path (format nil "Label ~A is undefined" name)))))) (defun set-label (name &optional (context *context*)) (context-set-label context name) name)

You could define the simplest instruction emitters as follows:

(defun nop () (db #xEA)) (defun rti () (db #x40)) (defun rts () (db #x60)) ...

This leaves open the question of how to handle the more complicated instructions which take an operand and support multiple addressing modes. One possibility would be to indicate the addressing mode in the function name, yielding lda.imm , lda.zp , lda.mem , etc. I opted for a different approach, where each addressing mode corresponds to a class, and the instruction encoder can select the opcode according to operand type. Worst case, you could define a generic function per instruction mnemonic, with a method for each addressing mode. This would be easy enough, but there's a better way.

The 6502/65C02/65C816 Instruction Set Decoded describes how the 6502 instruction set can be partitioned into three groups. Within each group, addressing modes are expressed consistently by certain patterns of bits within the opcode. With a handful of supporting definitions, the assembler can exploit this knowledge to minimize the manual labor required. For instance, my assembler defines the first group of instructions as follows, with an extra defmethod to handle the lone exception to the pattern:

;;; Group 1: ;;; ORA AND EOR ADC STA LDA CMP SBC ;;; (zp,X) 01 21 41 61 81 A1 C1 E1 ;;; zp 05 25 45 65 85 A5 C5 E5 ;;; # 09 29 49 69 A9 C9 E9 ;;; abs 0D 2D 4D 6D 8D AD CD ED ;;; (zp),Y 11 31 51 71 91 B1 D1 F1 ;;; zp,X 15 35 55 75 95 B5 D5 F5 ;;; abs,Y 19 39 59 79 99 B9 D9 F9 ;;; abs,X 1D 3D 5D 7D 9D BD DD FD (defun group-1-addr-code (x) (typecase x (idxi #b000) ; (zero page,X) (zp #b001) ; zero page (imm #b010) ; #immediate (mem #b011) ; absolute (indi #b100) ; (zero page),Y (zpx #b101) ; zero page,X (aby #b110) ; absolute,Y (abx #b111) ; absolute,X (t (invalid-operand-error nil x)))) (defun group-1-asm (parameter opcode) (join-masks (join-masks (ash opcode 5) (ash (group-1-addr-code parameter) 2)) #b01)) (def6502 ORA group-1-asm #b000) (def6502 ANDA group-1-asm #b001) (def6502 EOR group-1-asm #b010) (def6502 ADC group-1-asm #b011) (def6502 STA group-1-asm #b100) (def6502 LDA group-1-asm #b101) (def6502 CMP group-1-asm #b110) (def6502 SBC group-1-asm #b111) (defmethod choose-opcode ((instruction (eql 'sta)) (operand imm)) ;; One exception: STA with immediate destination makes no sense. (invalid-operand-error instruction operand))

With an assembler in Lisp, it's easy to define higher level control structures. Here I define a conditional ASIF and a simple looping macro, AS/UNTIL (the AS prefix is chosen to distinguish them from Lisp control structures, just as the AND and OR instructions are renamed to ANDA and ORA above to avoid collision with the CL operators of the same name):

(defmethod condition-to-branch ((condition symbol)) (or (cdr (assoc condition '((:positive . bmi) (:negative . bpl) (:carry . bcc) (:no-carry . bcs) (:zero . bne) (:not-zero . beq) (:equal . bne) (:not-equal . beq) (:overflow . bvc) (:no-overflow . bvs)))) (error "Unknown condition ~A" condition))) (defun assemble-if (branch-compiler then-compiler &optional else-compiler) (let ((else-sym (gensym "ELSE")) (finally-sym (gensym "FINALLY"))) (funcall branch-compiler (rel else-sym)) (funcall then-compiler) (when else-compiler (jmp (mem (label finally-sym)))) (set-label else-sym) (when else-compiler (funcall else-compiler)) (set-label finally-sym))) (defmacro asif (condition &body statements) (let ((then statements) (else nil) (part (position :else statements))) (when part (setf then (subseq statements 0 part) else (subseq statements (1+ part) nil))) `(assemble-if ',(condition-to-branch condition) (lambda () ,@then) ,(and else `(lambda () ,@else))))) (defmacro as/until (condition &body body) (let ((sym (gensym))) `(with-label ,sym ,@body (funcall (condition-to-branch ',condition) (rel ',sym)))))

Another way to extend the assembler is by defining new types of contexts. One obvious thing you'd want is a local-context with its own symbol table, for defining nested scopes. Using this, you can define a cute procedure macro, defining its name in the surrounding context but enclosing the body in a local context:

(defmacro procedure (name &body body) `(progn (set-label ',name) (let ((*context* (make-instance 'local-context :parent *context*))) ,@body)))

A cooler trick is to define a cycle-counting-context : by specializing a method on context-emit-instruction , we can peek at each assembled instruction and tally up the number of cycles used in a block of straight-line code:

(defclass cycle-counting-context (delegate-code-vector delegate-symbol-lookup) ((cycle-count :initform 0 :accessor cycle-count :initarg :cycle-count) (precise-p :initform t :accessor precise-p :initarg :precise-p))) (defmethod context-note-cycles ((context cycle-counting-context) num-cycles) (incf (cycle-count context) num-cycles)) (defmethod context-emit-instruction ((context cycle-counting-context) vector) (multiple-value-bind (cycles variable) (opcode-cycles (aref vector 0)) (when variable (setf (precise-p context) nil)) (if cycles (context-note-cycles context cycles) (warn "Don't know number of cycles for opcode ~X" (aref vector 0))) (call-next-method))) (defmacro counting-cycles (&body body) `(let ((*context* (make-instance 'cycle-counting-context :parent *context*))) ,@body (cycle-count *context*)))

Coupled with a utility function emit-delay , this enables a macro called timed-section which can time a block of code and pad it out to consume a specific number of cycles, for precise cycle-timed loops. Here's a simple example from my test cart, using timed-section to produce a sawtooth wave through the 7-bit DAC, timing the loop by dividing the clock rate of the system by the target frequency (220 Hz) and the number of steps in the waveform (128):

(subprogram (sawtooth-220 "Sawtooth 220") (poke 0 +ppu-cr1+) ; Disable NMI (poke 0 +ppu-cr2+) ; Disable display (poke 0 +papu-control+) ; Silence audio (ldy (imm 0)) (timed-section ((round (/ +ntsc-clock-rate+ 220 128)) :loop t) (sty (mem +dmc-dac+)) (iny)))

I'll confess that emit-delay and/or counting-cycles aren't 100% accurate, as I discovered when trying to wrap raster effect kernels with them, but they work well enough for calibrating the pitch of various audio hacks - most notably music-demo.lisp, which streams a surprisingly high quality loop of music. I should improve these.

Most of my test programs were using character ROMs from commercial games (appropriately, as my first MMC3 devcart still had the original CHR ROM soldered to the board). Before uploading those to github, I wanted to strip the graphics out. Since virtually all of the tools for editing NES graphics run on Windows, I threw together some basic functions for converting between GIF files (using the SKIPPY library) and 2-bit NES characters. This initiated a sequence of events culminating in a nifty little graphics demo, which you can watch on Youtube.

With my newfound ability to convert graphics and display them on the NES, I searched out a graphic online I could turn into another simple example to put under the assembler's hacks directory. Typing the name of the first artist I could think of into Google, I settled on Mark Ryden's "Fur Girl" - I hope he doesn't mind. I wanted to run this on my real NES using the EPROM cart I made from an old Gyromite board a few weeks ago, which constrained me to 32 KB for program data - more than enough - but only 8 KB graphics.

8 KB is enough for half a screen of unique graphics, but the NES divides the ROM into two 4 KB pages, typically one for background and another for sprites. I elected to fill the height of the screen and tile the image horizontally, alternating the coloring. To use both pages of ROM for background graphics, I have to hit one of the PPU control registers ($2000) at the right time mid-frame to switch pages. One glaring design flaw in the NES is the lack of a dedicated scanline counter or horizontal blank interrupt. Some cartridges rectify this with additional hardware that cleverly monitors the PPU address lines, but the basic NROM board I used had no such hardware. There are other techniques you might try, but the most basic and broadly applicable solution is a delay loop after some reference point (such as the beginning of the vertical blank) that spins until the necessary time has elapsed.

On top of the scrolling background image, I called the 64 sprites of the NES into play. I believe the demoscene refers to these as "sinebobs". Each sprite has a separate X and Y angle, indexing into a sine table to determine its X/Y screen coordinates. These are initialized to form a circle, and different methods of incrementing the angle variables produce different patterns of motion on screen. For instance, the effect of the circle splitting into four pieces and recombining is achieved by the following code, which increments the X angle only for every other group of 16 sprites:

(procedure split-by-4 (jsr 'framestep) (ldx (imm 63)) (as/until :negative (inc (abx table-y)) (txa) (anda (imm 16)) (lsr) (lsr) (lsr) (clc) (adc (abx table-x)) (sta (abx table-x)) (dex)) (rts))

Looking at the girl's hair in the image, I thought it would be cool to try a wavy raster effect. This is done by reprogramming the scroll/address registers during the horizontal blanking period between scanlines, and is subject to the same timing constraints as changing the character ROM bank mid-frame. In fact, the timing for this must be tighter, because the visual glitch if you miss the timing window here is much more visible. I don't manage to get it 100% right, but in practice it looks pretty good. I do have a good excuse: nearly all of the processing each frame occurs before the mid-frame split point and must be written to execute in a constant number of CPU cycles. Even so, there are a number of factors that complicate CPU/PPU sychronization. Reading Consistent frame synchronization on the Nesdev wiki might make your head explode.

Timing all this caused a big problem for doing music: before coding the wavy effect, I'd brought in some music I'd written several years ago using the MML/MCK toolchain, calling the playback routine each frame after the screen split. At the time, there was nothing left to do after the split but wait for the vertical blank interrupt, so it was a perfect place to put the music and anything else that might take a variable number of CPU cycles. After adding the wavy effect, there was zero time left before the vertical blank. I ended up turning the screen off a few scanlines early (necessary, otherwise you'd notice the wave effect stop before reaching the bottom) and using that time to update the sprite positions and run the code controlling the overall sequence of patterns. There was no room there for the music.

There's a neat trick you can do on the NES to eyeball (literally) the amount of time a piece of code takes during the frame, by toggling the color emphasis bits in register $2001 before and after. Before removing the music, I timed the player in this way, illustrating its wildly variable execution time - from virtually none upward to perhaps 20 scanlines per frame. If it had taken only a few scanlines to execute, I'd have been okay turning the display off a little earlier to make time, but reserving enough time to accomodate the longest spikes would've seriously cut into the image.

There were two mid-frame delay loops (before and after the screen split) which could be shortened to provide plenty of time for music, if the player routine executes in a fixed number of machine cycles. It took a couple days of head scratching and feet dragging before I reduced my ambitions to nearly the simplest design imaginable - compile the music all the way down to a sequence of sound register writes. I defined a music 'frame' as sixteen address/value pairs (a reasonable upper bound), and with the assumption that particular frames would recur frequently, merged the duplicates and stored the song as an array of pointers, one per 60 Hz tick, to the sixteen memory writes performed for each frame.

I realize now that I could've written the music using any of the existing tools, easily converting it by modifying my emulator to log all the sound register writes during playback. I'd spent enough time initially considering more complicated player designs that the idea of having to invent my own tools had already stuck. No matter, I'd have probably done it this way regardless.

I built a miniature embedded language for composing the music based on the same ideas I used in my previous post playing with audio synthesis: the ability to repeat and combine audio sequentially and in parallel. This time around, rather than mixing audio samples, the basic building blocks combine lists of sound register writes:

(defun register (address value) (list value address)) (defun nop-write () (register #x0D 0)) ; Dummy write to unused register. (defun pad-list (list padding desired-length) (assert (<= (length list) desired-length)) (append list (loop repeat (- desired-length (length list)) collect padding))) (defun pad-frame (frame) (pad-list frame (nop-write) 16)) (defun para (&rest args) (apply #'mapcar #'append (mapcar (lambda (x) (pad-list x nil (reduce #'max args :key #'length))) args))) (defun seq (&rest args) (apply #'concatenate 'list args)) (defun repeat (n &rest args) (apply #'seq (mapcan #'copy-list (loop repeat n collect args)))) (defun segment (length list) (if (< (length list) length) (pad-list list nil length) (subseq list 0 length))) (defun measure (&rest args) (segment 128 (apply 'para args))) (defun rst (length) (segment length nil)) ; "Rest"

I then defined functions to control each sound channel. For instance, the NOISE function encodes its arguments as a set of writes to the noise channel registers, upon which I've defined several percussive sounds:

(defun noise (length duration period &key short loop (env t) (vol 15)) (check-type duration (integer 0 31)) (check-type vol (integer 0 15)) (check-type period (integer 0 15)) (segment length (list (list (register #xC (logior (if loop #x20 0) (if env 0 #x10) vol)) (register #xE (logior (if short #x80 0) period)) (register #xF (ash (translate-length duration) 3)))))) (defun kick (length) (noise length 8 15 :vol 1)) (defun snare (length &optional (variation 0)) (noise length 8 (+ 10 variation) :vol 1)) (defun hat (length &optional (variation 0)) (noise length 4 (+ variation 1) :vol 1))

To make certain it's clear, I'll give an example. Here I evaluate (snare 8) , meaning I want a snare sound that consumes 8 frames (or 8/60 = 0.13s) of time:

CL-USER> (write (dollhouse-demo::snare 8) :base 16) (((1 C) (A E) (48 F)) NIL NIL NIL NIL NIL NIL NIL)

The outermost list is of length 8: each element is a list of memory writes to be performed on a particular frame, in the form (value address) . Here, writes occur on the first frame, and the subsequent frames (empty lists) pad the sequence out to the desired length. In this way, rhythms can be built by concatenation. The address is taken relative to $4000, where the NES sound registers begin. Comparing with the list of noise channel registers, the first frame of writes can be interpreted as follows:

Store $1 to $400C: Set envelope length.

Store $A to $400E: Set shift register period.

Store $48 to $400F: Load note length counter and play.

I began building the song up as a series of local function definitions. The seq function combines its inputs sequentially. Here are three half-bar phrases, swagger , stagger , and jagger , which I combine in different orders to form all the drum patterns:

(swagger () (seq (kick 16) (hat 8) (hat 8) (snare 16) (hat 8) (hat 8 4))) (stagger () (seq (hat 8) (kick 8) (hat 8) (hat 8) (snare 16) (rst 16))) (jagger () (seq (shaker 8 15) (shaker 8 4) (shaker 8 8) (shaker 8 12) (shaker 8 15) (shaker 8 5) (shaker 8 11) (shaker 8 14)))

Along with a thump achieved by sweeping the pitch of the triangle channel, specific drum patterns are built up by combining pieces in series and parallel. The measure function combines its arguments in parallel, ensuring the resulting length is one measure (128 frames) exactly. The drum patterns used in the first half of the music are defined as follows:

(four-on-the-floor () (repeat 4 (thump 32 (et -24)))) (intro-beat () (measure (four-on-the-floor) (seq (swagger) (swagger)))) (intro-fill-1 () (measure (four-on-the-floor) (seq (swagger) (stagger)))) (intro-fill-2 () (measure (four-on-the-floor) (seq (jagger) (jagger))))

Finally, I've built up to defining the music. Here, the para (parallel) operator combines its two inputs: a four measure drum pattern, constructed in an AAAB pattern from intro-beat and intro-fill-1 , and a sequence of four arpeggiated chords, each one measure long. This defines the first four measures of the music:

(para (phrase-aaab (intro-beat) (intro-fill-1)) (seq (measure (fat-arp 128 '(0.00 0 3 7 11) :rate 4 :volume (volramp 8 -1/22))) (measure (fat-arp 128 '(0.00 0 2 5 8) :rate 4 :volume (volramp 8 -1/22))) (measure (fat-arp 128 '(0.00 -2 7 8 12) :rate 4 :volume (volramp 9 -1/20))) (measure (fat-arp 128 '(0.00 -1 2 3 7) :rate 4 :volume (volramp 10 -1/18)))))

My arpeggiator functions burn through a lot of space in the ROM, so I was only able to fit a little over 30 seconds of music (which is just fine with me, because I was getting impatient to finish this). You could compress the music substantially, but I wasn't willing to complicate the playback routine (including doing anything requiring conditionals, for which I didn't feel like balancing the timing of both branches). The short music loop is appropriate to the repetitive visuals. Everything came together quickly, and I'm pleased with the result.

One idea I'm disappointed didn't work out was to gradually shift the tuning of the music downward, so that at the end it could appear to modulate upward to a higher key, but actually return to where it started. I'd need a longer loop of music for this to work well. I found I could only shift the tuning by one semitone over a thirty second loop without it being distracting. I'm particularly disappointed because the implementation was cute. Since the music is constructed and concatenated piece by piece, the individual units have no knowledge of where in the final timeline they will appear. This is a problem, because if I'm shifting the tuning continually, pitch becomes an implicit function of time. Since the pitches get encoded into bitfields written to the hardware, it wasn't reasonable to just do a final pass over the fully assembled music, shifting the pitches down.

I solved this by abusing the assembler's delayed evaluation mechanism, modifying the function translating equal tempered pitch to frequency, returning a promise object rather than an actual number. This promise will only resolve if the *tuning-root* is bound in the dynamic environment. After assembling the music, a final pass walks each frame of music, binding the *tuning-root* and forcing the promises to resolve (with a hack to the delayed evaluation framework to stop it from memoizing results of promises, in case a passage is repeated). This was relatively unobtrusive - since CL doesn't transparently support delayed evaluation, you do have to thread support through code that could operate on a promise value - but that only touched the pitch translation and a few sound register functions. I'd like to revisit this idea. It's the sort of neat trick that's usually too much hassle to bother with when you're using conventional MIDI and DAW software. The remains of the idea still exist in the music-test.lisp sandbox, but I stripped it out of the final demo source code.

Anyway, this stuff is a lot of fun.

Links: