Parallelogram

Parallelogram is a demo running on the Commodore One extender board, which contains an Altera Cyclone III FPGA and an SDRAM chip. The logic design was made from scratch, including a homebrew CPU, FM synth and blitter with pixel shader support. The demo won the wild compo at Revision 2012.

Download

The demo also has a pouët page, of course.

Custom logic

The system is coded in Verilog and compiled used Altera's free toolset (Quartus Web edition). PLLs, multipliers and memory blocks are instantiated from within Quartus using so called megafunctions, but the rest of the project consists of plain Verilog files edited with Vim. I used gtkwave to simulate parts of the system when things didn't work, and sometimes that was very helpful.

The overall architecture is illustrated in the presentation video around the 1 minute mark: The CPU is in control of execution, and accesses the external memory through a 16 KB cache. Since I have no control over the initial contents of the SDRAM chip, the demo must be stored somewhere on the FPGA. I opted for a solution where the cache is preloaded with the demo binary at boot, marked as dirty. As other memory gets accessed, the demo gets written "back" into the SDRAM. This limits the demo to 16 KB.

Memory

The SDRAM has a 16-bit bus width, and this property permeats the entire design. Pixels are stored as a0rrrr0gggg0bbbb , where the a bit is a generic alpha bit that can be used freely by software. It conveniently coincides with the sign bit. The point of having zeroes between the fields is that it simplifies saturated addition of colours.

There's an embarrassing error in the text at the beginning of the demo, where it says that only 128 KB of external memory is used. In fact, the system uses 2 MB (1 megaword) of the SDRAM, which requires 20 address bits, but the CPU only has direct access to the first 128 KB because addresses are stored in 16-bit registers. Memory is treated as a rectangular grid of words, 2048 rows by 512 columns. The blitter uses row/column addressing, and has access to the entire 2 MB. Frame buffers are 320 by 240 pixels, and are stored as sub-rectangles occupying columns 0 through 319.

Memory map (Feel free to skip ahead if you're not interested in this much detail...) char in map = 8x16 pixels (words) C = cpu memory with preloaded contents c = unpacked executable f = upper half is 64-character 8x8 font s = $70 sine table $71 freq table $72 channel data $73 synth register copy $74 constant random table $75 raster bar table $7e stack $7f stack 1 = video frame buffer 1 2 = video frame buffer 2 3 = video frame buffer 3 w = workspace frame buffer (for post fx) d, u, v = free memory for effect data e.g. smoke (density, x-vel, y-vel), front and back, 256x242 m = 32x32 texture map e = echo buffer . = kept zero at all times 0 320 384 511 Row CPU Address ---------------------------------------------------------------- |CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC| 000 0000 | | 010 2000 |cccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccc| 020 4000 |cccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccc| 030 6000 |cccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccc| 040 8000 | | 050 a000 |ffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff| 060 c000 |ssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssss| 070 e000 | | 080 | | 090 | mmmm| 0a0 | mmmm| 0b0 | | 0c0 | | 0d0 | | 0e0 |......................................... .| 0f0 |1111111111111111111111111111111111111111. eeeeeeee .| 100 |1111111111111111111111111111111111111111. eeeeeeee .| 110 |1111111111111111111111111111111111111111. eeeeeeee .| 120 |1111111111111111111111111111111111111111. eeeeeeee .| 130 |1111111111111111111111111111111111111111. eeeeeeee .| 140 |1111111111111111111111111111111111111111. eeeeeeee .| 150 |1111111111111111111111111111111111111111. eeeeeeee .| 160 |1111111111111111111111111111111111111111. eeeeeeee .| 170 |1111111111111111111111111111111111111111. eeeeeeee .| 180 |1111111111111111111111111111111111111111. eeeeeeee .| 190 |1111111111111111111111111111111111111111. eeeeeeee .| 1a0 |1111111111111111111111111111111111111111. eeeeeeee .| 1b0 |1111111111111111111111111111111111111111. eeeeeeee .| 1c0 |1111111111111111111111111111111111111111. eeeeeeee .| 1d0 |......................................... eeeeeeee .| 1e0 |......................................... eeeeeeee .| 1f0 |2222222222222222222222222222222222222222. eeeeeeee .| 200 |2222222222222222222222222222222222222222. eeeeeeee .| 210 |2222222222222222222222222222222222222222. eeeeeeee .| 220 |2222222222222222222222222222222222222222. eeeeeeee .| 230 |2222222222222222222222222222222222222222. eeeeeeee .| 240 |2222222222222222222222222222222222222222. eeeeeeee .| 250 |2222222222222222222222222222222222222222. eeeeeeee .| 260 |2222222222222222222222222222222222222222. eeeeeeee .| 270 |2222222222222222222222222222222222222222. eeeeeeee .| 280 |2222222222222222222222222222222222222222. eeeeeeee .| 290 |2222222222222222222222222222222222222222. eeeeeeee .| 2a0 |2222222222222222222222222222222222222222. eeeeeeee .| 2b0 |2222222222222222222222222222222222222222. eeeeeeee .| 2c0 |2222222222222222222222222222222222222222. eeeeeeee .| 2d0 |......................................... eeeeeeee .| 2e0 |......................................... eeeeeeee .| 2f0 |3333333333333333333333333333333333333333. eeeeeeee .| 300 |3333333333333333333333333333333333333333. eeeeeeee .| 310 |3333333333333333333333333333333333333333. eeeeeeee .| 320 |3333333333333333333333333333333333333333. eeeeeeee .| 330 |3333333333333333333333333333333333333333. eeeeeeee .| 340 |3333333333333333333333333333333333333333. eeeeeeee .| 350 |3333333333333333333333333333333333333333. eeeeeeee .| 360 |3333333333333333333333333333333333333333. eeeeeeee .| 370 |3333333333333333333333333333333333333333. eeeeeeee .| 380 |3333333333333333333333333333333333333333. eeeeeeee .| 390 |3333333333333333333333333333333333333333. eeeeeeee .| 3a0 |3333333333333333333333333333333333333333. eeeeeeee .| 3b0 |3333333333333333333333333333333333333333. eeeeeeee .| 3c0 |3333333333333333333333333333333333333333. eeeeeeee .| 3d0 |......................................... eeeeeeee .| 3e0 |......................................... eeeeeeee .| 3f0 |wwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwww. eeeeeeee .| 400 |wwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwww. eeeeeeee .| 410 |wwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwww. eeeeeeee .| 420 |wwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwww. eeeeeeee .| 430 |wwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwww. eeeeeeee .| 440 |wwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwww. eeeeeeee .| 450 |wwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwww. eeeeeeee .| 460 |wwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwww. eeeeeeee .| 470 |wwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwww. eeeeeeee .| 480 |wwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwww. eeeeeeee .| 490 |wwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwww. eeeeeeee .| 4a0 |wwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwww. eeeeeeee .| 4b0 |wwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwww. eeeeeeee .| 4c0 |wwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwww. eeeeeeee .| 4d0 |......................................... eeeeeeee .| 4e0 | eeeeeeee | 4f0 |ddddddddddddddddddddddddddddddddDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDD| 500 |ddddddddddddddddddddddddddddddddDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDD| 510 |ddddddddddddddddddddddddddddddddDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDD| 520 |ddddddddddddddddddddddddddddddddDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDD| 530 |ddddddddddddddddddddddddddddddddDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDD| 540 |ddddddddddddddddddddddddddddddddDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDD| 550 |ddddddddddddddddddddddddddddddddDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDD| 560 |ddddddddddddddddddddddddddddddddDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDD| 570 |ddddddddddddddddddddddddddddddddDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDD| 580 |ddddddddddddddddddddddddddddddddDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDD| 590 |ddddddddddddddddddddddddddddddddDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDD| 5a0 |ddddddddddddddddddddddddddddddddDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDD| 5b0 |ddddddddddddddddddddddddddddddddDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDD| 5c0 |ddddddddddddddddddddddddddddddddDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDD| 5d0 |ddddddddddddddddddddddddddddddddDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDD| 5e0 | | 5f0 |uuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUU| 600 |uuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUU| 610 |uuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUU| 620 |uuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUU| 630 |uuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUU| 640 |uuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUU| 650 |uuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUU| 660 |uuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUU| 670 |uuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUU| 680 |uuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUU| 690 |uuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUU| 6a0 |uuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUU| 6b0 |uuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUU| 6c0 |uuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUU| 6d0 |uuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUU| 6e0 | | 6f0 |vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVV| 700 |vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVV| 710 |vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVV| 720 |vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVV| 730 |vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVV| 740 |vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVV| 750 |vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVV| 760 |vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVV| 770 |vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVV| 780 |vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVV| 790 |vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVV| 7a0 |vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVV| 7b0 |vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVV| 7c0 |vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVV| 7d0 |vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVV| 7e0 | | 7f0 ----------------------------------------------------------------

The cache is direct-mapped, which means that memory addresses where the low bits are identical will compete for the same cache entry. By placing data (e.g. textures) in columns 320 through 511, it will remain in the cache even when the frame buffer is accessed.

VGA

The VGA generator consists of a frontend and a backend. The frontend reads pixels directly from SDRAM and writes them to a FIFO. Since each rasterline is stored in a single SDRAM row, the entire rasterline can be read in one burst. Between the lines, the frontend backs off so other parts of the system can access the memory.

The backend runs in a separate clock domain. At vertical blanking, it sends an asynchronous signal back to the frontend to trigger a new frame, and then it reads 320*240 pixels from the FIFO. Each row is stored in a buffer and emitted twice, since the VGA signal has 480 rows.

The address of the frame buffer is CPU-controlled, and Parallelogram uses triple buffering.

CPU

The CPU was written from scratch. I considered using an existing design, but it was more fun to do it myself, and I was able to take advantage of the added flexibility. For instance, at one point the demo was slightly larger than 16 KB, but I could fix this by adding some new instructions and a new addressing mode in order to make the code compress better.

The CPU is not particularly fast, because most of the work is done by the pixel shaders. Hence, it is implemented without pipelining. There are eight general purpose 16-bit registers. Other registers include a program counter, a stack pointer, a 32-bit product register (accessed as a high and a low half) and status bits (zero and carry). These are accessed using special instructions.

Starting at address 0, there are three vector instructions, which are typically relative jumps: Boot, UART and timer. The boot instruction is executed at boot. The UART instruction is executed (after pushing the program counter) whenever a byte appears on the debug UART; this was used to load new code into the running system during development. The timer instruction gets executed (after pushing the program counter) every 10 ms, and controls music playback.

This is what the instruction set looks like:

Instructions Move immediate high (d <- c * 32) 00 ccc ddd ccccc ccc movih d, c Arithmetic/Logic 01 000 ddd 00000 sss add d, s 01 000 ddd 1cccc-ccc addi d, c 01 001 ddd 00000 sss adc d, s 01 001 ddd 1cccc-ccc adci d, c 01 010 ddd 00000 sss sub d, s 01 010 ddd 1cccc-ccc subi d, c 01 011 ddd 00000 sss and d, s 01 011 ddd 1cccc-ccc andi d, c 01 100 ddd 00000 sss or d, s 01 100 ddd 1cccc-ccc ori d, c 01 101 ddd 00000 sss xor d, s 01 101 ddd 1cccc-ccc xori d, c 01 110 ddd 00000 sss cmp d, s 01 110 ddd 1cccc-ccc cmpi d, c 01 111 ddd 00000 sss mov d, s 01 111 ddd 1cccc-ccc movi d, c Branch (o = signed offset relative pc) 10 0 0001 oooooo-ooo bgt label 10 0 0011 oooooo-ooo bne label 10 0 0101 oooooo-ooo bcc,bge label 10 0 1010 oooooo-ooo bcs,blt label 10 0 1100 oooooo-ooo beq label 10 0 1110 oooooo-ooo ble label 10 0 1111 oooooo-ooo bal label Subroutine call 10 1 0001 oooooo-ooo cgt label 10 1 0011 oooooo-ooo cne label 10 1 0101 oooooo-ooo ccc,cge label 10 1 1010 oooooo-ooo ccs,clt label 10 1 1100 oooooo-ooo ceq label 10 1 1110 oooooo-ooo cle label 10 1 1111 oooooo-ooo cal label Memory 11 000 ddd ooooo sss ld d, s+o 11 001 ddd ooooo sss st s+o, d I/O 11 010 ddd 00ppp 000 in d, p 11 011 ddd 00ppp 000 out p, d Vector jump/call (e = entry in global vector table) 11 100 000 0 eeeeeee jv e 11 101 000 0 eeeeeee cv e Load effective address (o = unsigned offset relative pc) 11 101 ddd 1 ooooooo lea d, label Miscellaneous 11 111 ddd 00000 000 push d 11 111 ddd 00001 000 pop d 11 111 000 00010 000 nop 11 111 ddd 00011 sss mul d, s Store result in special product register 11 111 ddd 00100 000 stsp d Store d into stack pointer 11 111 ddd 00101 000 prod d, s Store s:d in product register 11 111 ddd 00110 000 jr d Jump to address in register 11 111 ddd 00111 000 cr d Call address in register 11 111 000 01000 000 ret 11 111 ddd 01001 000 wait d Wait for status bit (blitter done, vblank...) 11 111 ddd 01010 000 send d Transmit on debug UART 11 111 ddd 01011 000 ldsf d Load d from status flags 11 111 ddd 01100 000 stsf d Store d into status flags 11 111 ddd 01101 000 initv d Set global vector table address Input ports 000 product, low half 001 product, high half 010 status flags (blitter done, vblank...) 011 uart receive buffer 100 frame counter (global time) 101 benchmark timer Output ports 000 blitter row 001 blitter column 010 blitter width 011 blitter height + start 100 blitter program 101 active video page [1..3] 110 synth register select 111 synth register data

And here is some example code, which implements signed multiplication — the CPU only provides unsigned multiplication.

muls ; r2 * r3 -> r1:r0 ; clobbers product register mul r2, r3 in r1, 1 mov r0, r2 add r0, r0 bcc .muls_1 sub r1, r3 .muls_1 mov r0, r3 add r0, r0 bcc .muls_2 sub r1, r2 .muls_2 in r0, 0 ret

The demo is written in assembly language, so I obviously had to write my own assembler. It's quite limited — for instance, values must be either numeric constants or labels — but it was sufficient for my purposes. Shader code, which will be described presently, is inlined with the rest of the code and handled by the same assembler.

First shader running.

Blitter

The blitter is a coprocessor that executes a small shader program for each pixel in a sub-rectangle of memory. The work is distributed across ten identical shader cores, thus exploiting the parallel nature of the FPGA.

First, the CPU writes the address of some shader code into output register 4. This instructs the blitter to start copying the shader from main memory into local RAM blocks within each of the ten shader cores. The first word contains the size of the shader, and is followed by that many longwords (in little endian order) of shader instructions and data. Then, for any number of rectangles, the CPU loads the row, column, width and height into output registers 0 through 3, where the final write to register 3 starts the blitter operation. Before each operation, the CPU must ensure that the blitter has completed the previous job, by waiting on a status bit.

The shader cores deal with 32-bit words (longwords). Each core has a 256-word memory, where execution starts at address 0. The instruction set has a DSP-like flavour, because each instruction consists of several sub-instructions that are executed simultaneously. There are eight 32-bit registers, which are treated as 16.16 fixpoint numbers. Contrary to the CPU registers, these are not general purpose. Registers r0 through r3 receive the results of simple ALU operations ( add , xor etc), r4 and r5 can be used to hold values (and are primed with the current x and y coordinates within the blitting rectangle), r6 contains the result of the latest multiplication and r7 contains the result of the latest shader RAM access. Of these, registers r0 through r5 keep their value unless it's explicitly modified by an instruction, whereas r6 and r7 are volatile and get clobbered unless you use them immediately after assigning them. Expressed in a different way, registers r6 and r7 get written at every clock cycle, regardless of whether there's an instruction in the shader assembly code describing what to put into them.

Here's the shader instruction set:

Instructions come in two varieties: : aop rd, ra, rb : mv rd, rs : mul ra, rb : ld ..., ... 1aaaaaaa aaaapppp ppccccrr rrrrrrrr a = alu op, 000 dd aaa bbb register d becomes a & b 001 dd aaa bbb register d becomes a + b 010 dd aaa bbb register d becomes a - b 011 dd aaa bbb register d becomes a | b 100 dd aaa bbb register d becomes a ^ b 101 dd aaa bbb register d becomes a min b 110 dd aaa bbb register d becomes a max b 111 dd aaa bbb register d is read from global ram at col, row according to registers a, b p = product op, aaa bbb register 6 becomes signed fixed-point adjusted product of registers a, b c = copy op, 0 sss register 4 is read from register s 1 sss register 5 is read from register s r = ram op, 0 aaaaaasss register 7 is read from shader ram at aaaaaa00 + floor(register s) 10 aaaaaaaa register 7 is read from shader ram at a 11 dddaaaaa register 7 is trashed; register d is written to shader ram at 110aaaaa : aop rd, ra, rb : endp rr : jsr xyz 0aaaaaaa aaaa---- ---sssss ssssssss a = alu op, same as before s = special op, 00000 -------- no operation 00001 -------- terminate with no pixel 00010 -----rrr terminate with pixel according to register r 00100 -------- store sign bits of all registers into rSign 00101 --sssttt r7 <- (rx[t] & 0xffff) ^ (rSign[sss]? 0 : 0xffff) 00110 iiiijjjj add signed integer i to r4 and j to r5 10aaa aaaaarrr jump to a if r >= 0 11aaa aaaaarrr jump to a if r < 0 Execution uses alternating fetch/execute cycles, where the execute part may be stalled when global ram is accessed. 00000000 00000000 00000000 00000000 is a nop instruction.

Here's an example shader for visualising the Julia set:

sh_julia shader .end :ld r7, .xmid :sub r0, r4, r7 :ld r7, .ymid :sub r1, r5, r7 :ld r7, .scale :mul r6, r0, r7 :ld r7, .scale :mov r0, r6 :mul r6, r1, r7 :st $d8, r4 :mov r1, r6 :ld r7, .initcount :mov r3, r7 :mul r6, r0, r0 :st $d9, r5 :mov r4, r6 :mul r6, r1, r1 :mov r5, r6 .loop ; square z :mul r6, r0, r1 :add r1, r6, r6 :sub r0, r4, r5 :ld r7, .c_re ; add c :add r0, r0, r7 :ld r7, .c_im :add r1, r1, r7 :mul r6, r0, r0 ; determine length :mov r4, r6 :mul r6, r1, r1 :mov r5, r6 :add r2, r4, r6 :ld r7, .limit :sub r2, r2, r7 :ld r7, .step :sub r3, r3, r7 :jpos r2, .break :jpos r3, .loop .break :ld r7, .topcount :sub r1, r3, r7 :ldd r7, .palette, r3 :mov r1, r7 :jpos r1, .bg :emit r1 .bg :skip .xmid long $00a00000 .ymid long $00780000 .c_re long $fffff000 .c_im long $ffff8000 .scale long $00000300 .initcount long $00100000 shalign ; aligns to 4-longword address, for ldd instruction .topcount long $000f0000 .step long $00010000 .limit long $00040000 long #000 ; the '#' encodes a colour into a longword .palette long #000 long #100 long #211 long #322 long #433 long #544 long #655 long #766 long #877 long #988 long #a99 long #baa long #988 long #766 long #544 .end

A shader produces a single word of output, which gets stored at the predetermined memory position for which the shader was executed. Alternatively, the shader may choose to terminate itself without writing to memory. Writing is done to the external SDRAM directly, bypassing the cache, because in most situations the blitter will be constructing a frame buffer that will be consumed by the VGA generator (which also accesses the SDRAM directly), so there's no need to pollute the cache. However, when reading main memory, the blitter uses the cache, because many pixel computations typically depend on the same data, such as textures and the sine table. Sometimes (as in the shadebob effect), a shader depends on data written by earlier blits. In these situations, the CPU must invalidate the cache in between the blitter operations, in order to make the output from earlier blits visible.

Synthesiser

The final part of the logic design is a 16-channel, 4-op FM synthesiser with resonant low-pass filters on each channel, and a global echo facility. Each channel is indepently controlled using 32 hardware registers, arranged as follows:

00 osc 0 frequency, low word 01 osc 0 frequency, high word 02 osc 0 gain 03 filter cutoff 04 osc 1 frequency, low word 05 osc 1 frequency, high word 06 osc 1 gain 07 filter resonance 08 osc 2 frequency, low word 09 osc 2 frequency, high word 0a osc 2 gain 0b left fader 0c osc 3 frequency, low word 0d osc 3 frequency, high word 0e osc 3 gain 0f right fader 10 osc 0 amount of modulation from osc 0 11 osc 0 amount of modulation from osc 1 12 osc 0 amount of modulation from osc 2 13 osc 0 amount of modulation from osc 3 14 osc 1 amount of modulation from osc 0 15 osc 1 amount of modulation from osc 1 16 osc 1 amount of modulation from osc 2 17 osc 1 amount of modulation from osc 3 18 osc 2 amount of modulation from osc 0 19 osc 2 amount of modulation from osc 1 1a osc 2 amount of modulation from osc 2 1b osc 2 amount of modulation from osc 3 1c osc 3 amount of modulation from osc 0 1d osc 3 amount of modulation from osc 1 1e osc 3 amount of modulation from osc 2 1f osc 3 amount of modulation from osc 3

Each operator is based on a sine oscillator which is phase modulated by a weighted sum of the (previous) output of each of the four operators. When an operator modulates itself, the result is noise. The filter then receives a weighted sum of the operators as input, and produces a mono output signal, which is panned and attenuated by two faders (left and right) to produce a stereo mix.

Channels 5 through 15 are connected to the echo buffer. This, as well as the interrupt rate and hence the tempo of the song, is hardcoded in the logic design, because there was no need to make it CPU-controllable for the Parallelogram soundtrack. The echo facility has a small input FIFO and a small output FIFO, but the bulk of the echo buffer is stored in main memory, which is accessed by stalling the CPU just before it's about to fetch an instruction. The left and right parts of the echo output are flipped and mixed into the final sound signal, as well as fed back into the echo buffer.

The synthesiser, as described above, is only concerned with what goes on at sample rate (44.1 kHz). The CPU then modifies these parameters at control rate (100 Hz), in order to implement e.g. envelopes for the operator modulation parameters. This playroutine also updates some global variables reflecting the song position, the current bass drum level and so on, which are then accessed by the visual effects.

C-One hooked up to a UART via an opto isolator.

Toolchain

Apart from the assembler mentioned above, I wrote a tracker which could emulate the FM synthesiser. This allowed me to compose the music interactively on my regular computer. Another tool converts the music data into binary data that can be accessed by the demo, specifically by the playroutine executing in the timer interrupt.

The assembled demo is compressed by a custom packer, and prepended with decompression code. This becomes the demo binary, and is used as initial RAM contents when compiling the FPGA core. However, during development, I didn't want to recompile the logic design for every little change in the demo software. After all, recompiling all the Verilog code and mapping it to the FPGA takes approximately 40 minutes (with ten shader cores and the highest optimisation settings). Hence, I placed a little bootloader in the UART interrupt, and wrote a communication tool to send a demo binary over a serial cable into the chip. The C-One (somewhat surprisingly) does not have a serial port, so I just attached some wires to the mdb bus which is accessible from the extender board.

Finally, to get a nice video capture, I designed a communication protocol for transmitting compressed video frames from within the FPGA over the UART to the computer, where they get uncompressed and stored as pnm files. First I ran the demo in realtime, transmitting the current system time whenever a frame was generated. This gave me a log of which frames were actually present: it wouldn't be honest to present a video capture with a higher frame rate than the actual hardware, and besides some of the effects are stateful and depend on the timing of earlier frames. The demo was then restarted in a non-realtime mode, where the host requests frames (using the log) and the demo computes all effects according to the communicated timestamps rather than the system clock.

Demo code

The demo itself is organised in a pretty straight-forward manner. As mentioned, the first thing that happens is that the code is decompressed. Then, the synthesiser is initialised and the screen displays a solid blue framebuffer for a couple of seconds, to allow the monitor to synchronise. Then, the timer interrupt is enabled, starting music playback. A mainloop reads out the current song position and advances along a script, where the different parts of the demo are described using code pointers (there's a song position, a setup routine, and a per-frame routine).

Most effects calculate some per-frame parameters in the CPU, store the resulting values right into a shader, load the shader into the blitter, then blit. There are utility routines for common functionality, such as invalidating the cache or computing A*sin(B*t+C) where t is the global time.

Standalone extender board

Since the demo runs entirely on the extender board, the C-One mainboard isn't necessary. To make the demo platform a bit more portable, I made my own mainboard replacement. It contains a microcontroller for reading the core image off an SD card and transmitting it to the FPGA at power-on, and it has a bunch of discrete components doing digital-to-analogue conversion of the audio and video signals.

However, the demo is fully C-One compatible, meaning that if you own a C-One you can simply drop the core file into your machine and run it.

Final words

This project was quite a ride, as it basically involved learning Verilog, FPGAs and hardware design. I did have some contact with FPGAs during my engineering education, but in those courses we would just modify existing VHDL code, and all the tricky parts had already been taken care of. Hardware bugs are quite different from software bugs, and it was very frustrating and rewarding to learn about all the gotchas the hard way. Looking back it has been very enjoyable. Hopefully this will also inspire other people to learn new skills and to build cool things!

Posted torsdag 12-apr-2012 00:03