There are various FPGA projects which could benefit from the existence of a really small CPU core to handle things like loading ROMs from SD card. The Minimig project either uses an external microcontroller (For the original Minimig and now also the MIST board), or throws in a second fully-fledged CPU into the FPGA itself. This is either a second instance of the TG68, or in the case of Chaos’s DE1 port, an OpenRisc CPU.

The only problem with this approach is that the second CPU takes up valuable resources – the OpenRisc CPU is smaller than the TG68, but still takes up over 2000 logic elements, so there’s definitely a need for a really small CPU core.

A working CPU is only half the battle, however – to be useful it also needs to be easy to program! Ideally what I’m looking for is a really tiny CPU – well under 1,000 logic elements – that has GCC support, so it can be programmed in C. As it happens, one such CPU does exist – the Zylin CPU.

The ZPU is an interesting design. There’s no register file: instead, it’s stack based, and the instruction set is split into compulsory and optional instructions which allows an implementation to trade speed for resource usage. Unimplemented instructions are emulated by way of an exception table stored in low memory, and if any particular emulated instruction turns out to be a performance killer for your particular application it’s easy to add support for just that one instruction to the CPU implementation.

There are a number of different implementations of the CPU, the most interesting of which for this project is zpu4/zpu_core_small.vhd. This version of the core implements only the barest minimum subset of the CPU, and runs entirely from BlockRAM (more on that later).

The ZPU instruction set is really minimalist, which means you need a lot of instructions to do anything useful. You might think this would lead to very poor code density, and you’d be right if it weren’t for the fact that each opcode is only a single byte in size. A nice reference list of the opcodes can be found here: http://www.alvie.com/zpuino/zpu_instructions.html

A ready-built GCC toolchain can be found here: http://www.alvie.com/zpuino/download.html

So let’s try it out! For this test I’m going to use the DE1 board, and write a small program to write to the HEX display. To do this we need to define a hardware register for the program to poke. The program will look like this:

#define COUNTER *(volatile unsigned int *)(0xFFFFFF80)

int main(int argc,char *argv) { int c=0; while(1) COUNTER=c++; return(0); }

Let’s compile this, and see what the ZPU assembly language looks like:

> zpu-elf-gcc -S countertest.c

This will generate countertest.s, as follows:

.file "countertest.c" .text .globl main .type main, @function main: im _memreg+12 load pushsp im _memreg+12 store im -1 pushspadd popsp im 0 nop im _memreg+12 load im -4 add store .L2: im _memreg+12 load im -4 add load im _memreg+12 load im -4 add load im 1 add im _memreg+12 load im -4 add store loadsp 0 im -128 store storesp 4 impcrel .L2 poppcrel .size main, .-main .ident "GCC: (GNU) 3.4.2"

Whoa! That’s a *lot* of assembly output for such a simple test case. Oh, but we didn’t specify optimization when compiling – try again:

> zpu-elf-gcc -O3 -S countertest.c

This time we get:

.file "countertest.c" .text .globl main .type main, @function main: im -1 pushspadd popsp im 0 storesp 8 .L2: loadsp 4 im 1 addsp 12 storesp 12 im -128 store loadsp 4 im 1 addsp 12 storesp 12 im -128 store impcrel .L2 poppcrel .size main, .-main .ident "GCC: (GNU) 3.4.2"

OK that’s more like it. I’ll pick this apart in more detail in a future post, but for now, here’s a git repo containing a DE1 demo project. The program running on the ZPU simply writes consecutive numbers to the Hex display – which of course happens far too quickly to observe, but you can freeze the display with either KEY0 or SW0, which will reset the project.

According to the Quartus fitting report, the ZPU’s resource usage is as follows:

Logic cells: 644, Dedicated logic registers: 234. Memory bits: 65536

The entire project, according to the summary, takes only 712 Logic Elements.

The memory usage is quite high, but that’s because the stack, the program itself and any working RAM all fall within BlockRAM in the zpu_small design. What I plan to explore is the possibility of using external ROM and RAM, but using a much smaller BlockRAM for a fixed-size stack – an approach that’s apparently been used successfully in the ZPU-Extreme variant of the processor – but that version’s much larger.