This is part of a series of posts detailing the steps and learning undertaken to design and implement a CPU in VHDL. Previous parts are available here, and I’d recommend they are read before continuing.

We’re at the point now where the CPU can run some more involved examples. The examples we’ve run to date on the simulator have been fairly simple, and more to the point, tailored to what we have available. I wanted to take a look back at the ISA, to see where we can make some worthwhile changes before moving forward.

Our more complex example code

Trivial 16-bit multiply!

It’s incredibly simple, again. But, that’s because we are missing some pretty fundamental functionality from the TPU. Even this tiny example exposes them.

The example I came up with is as follows:

nominate a register for a stack location and set it. Set up a simple stack frame to execute a multiply function which takes two 16bit operands. Call the ‘mul16’ function in mul16() grab arguments from the stack perform the multiplication return our result in r0 perform some sort of jump away to a safe place of code where we halt using an infinite loop.

This example, in code form, is similar to this:

ushort mul16( ushort a, ushort b) { ushort sum = 0; while (b != 0) { sum += a; b--; } return sum; } main() { ushort ret = mul16(3,7); while(1) { ret |= ret; } }

For this example, I defined r7 as the stack register. It was set to the top of our embedded ram block, and the stack will grow downwards. We need to store the two mul16 parameters, as well as our return address. As we address 16 bit words instead of the more typical 8-bit bytes, we only subtract 3 from the current stack pointer value. We then need to write in at various offsets our parameters:

sp = return PC

sp+1 = ushort a

sp+2 = ushort b

The first thing to notice is we are writing these values to constant offsets of a register value r7 (our SP). At the moment, our ISA only has a write to an address which is located in a register, so we need to perform writes and additions to a temporary register, or, we implement new functionality into TPU

Reads and Writes to memory with offset

Currently our write instruction takes a destination memory address specified in rA and a value to write specified in rB. The Read memory instruction is similar, but uses rD for the destination register, and rA as the address. This is due to rD being the only internal data select path into the register file.

Looking at the old instruction forms we have various unused bits that are enough to hold a significant offset value for our memory operations. In the case of the write instruction, these bits are non-contiguous, but we can solve that in the decoder. Our new read instruction looks like the following.

With our write instruction a little less clear coming in at

This is when having the immediate data output from the decoder 16-bits becomes useful. We extend the decoder to make those top 8 bits dependant on the instruction opcode, so that when a write is decoded, the immediate offset value is recombined ready for use by the ALU.

when OPCODE_WRITE => O_dataIMM(15 downto 8) <= I_dataInst(IFO_RD_BEGIN downto IFO_RD_END) & I_dataInst(IFO_F2_BEGIN downto IFO_F2_END) & "000"; O_regDwe <= '0';

The changes to the ALU are minimal, and we just do the inefficient thing of adding another adder. Knowing from the previous part that TPU currently takes up a tiny 3% of the Spartan6 LX25 resources, we can concentrate on getting functionality in rather than optimizing for space.

when OPCODE_WRITE => -- The result is the address we want. -- First 5 bits of the Imm value is an offset. s_result(15 downto 0) <= std_logic_vector(signed(I_dataA) + signed(I_dataIMM(15 downto 11))); s_shouldBranch <= '0'; when OPCODE_READ => -- The result is the address we want. -- Last 5 bits of the Imm value is an offset. s_result(15 downto 0) <= std_logic_vector(signed(I_dataA) + signed(I_dataIMM(4 downto 0))); s_shouldBranch <= '0';

You can see the ALU code is very similar. We treat the 5-bit immediate as a signed value, as [-16, 15] is a wide enough range of offsets, and being able to offset back as well as forward will come in very handy.

Calling Functions

Getting back to our example, we need to store the program location that we need to return to after executing our mul16 function. Amazingly, we didn’t have an instruction for getting the current PC, so this was impossible. It was very easy to add, though. The current PC is forwarded to the ALU – just use one of the two reserved opcodes we have free to define a set of special state operations.

The ALU code to serve these instructions is trivial.

when OPCODE_SPEC => -- special case I_dataIMM(IFO_F2_BEGIN downto IFO_F2_END) is when OPCODE_SPEC_F2_GETPC => s_result(15 downto 0) <= I_PC; when OPCODE_SPEC_F2_GETSTATUS => s_result(1 downto 0) <= s_result(17 downto 16); when others => end case; s_shouldBranch <= '0';

The sstatus, or get status instruction, will be used to get overflow and carry status bits – which currently are not implemented.

Now that we can get the current PC value, we can use this to calculate the return address for our callee function to jump to on return. The assembly looks as follows.

start: load.l r7, 0x27 # Top of the stack load.l r1, 7 # constant argument 2 load.l r2, 3 # constant argument 1 subi r7, r7, 3 # reserve 3 words of stack write r7, r1, 2 # write argument at offset +2 write r7, r2, 1 # write argument at offset +1 spc r6 # get current pc addi r6, r6, 4 # offset to after the call write r7, r6 # put return PC on stack bi $mul16 # call addi r7, r7, 3 # pop stack

This creates a call stack for mul16 containing it’s two parameters, and the location of where it should branch to when it returns.

Immediate arithmetic

You may have noticed two new instructions in the above code snippet – addi and subi. These were added to account for the fact simply incrementing/decrementing registers needed an immediate load, which then used up one of our registers.

The add and sub instructions both have two unused flag bits, so one of them was used to signal intermediate mode. In this mode, rD and rA are used as normal, but rB is disregarded, and 5-bits are used to represent an unsigned immediate value.

I took the decision to use only unsigned versions of this instruction, as I thought if someone was really interested in proper overflow detection, they wouldn’t mind taking the additional register penalty, and use the existing add instruction using a register.

In the VHDL, I again didn’t care about resources, and simply added yet another if conditional with adders.

when OPCODE_ADD => if I_aluop(0) = '0' then if I_dataImm(0) = '0' then s_result(16 downto 0) <= std_logic_vector(unsigned('0' & I_dataA) + unsigned( '0' & I_dataB)); else s_result(16 downto 0) <= std_logic_vector(unsigned('0' & I_dataA) + unsigned( '0' & X"000" & I_dataIMM(4 downto 1))); end if; else s_result(16 downto 0) <= std_logic_vector(signed(I_dataA(15) & I_dataA) + signed( I_dataB(15) & I_dataB)); end if; s_shouldBranch <= '0';

The last 8 bits in dataImm always contain the last 8 bits of our instruction word, so we just use that for both the immediate mode check and then for the 5 bits of value itself.

The mul16 Function

Lets recap the C style version of our function:

ushort mul16( ushort a, ushort b) { ushort sum = 0; while (b != 0) { sum += a; b--; } return sum; }

And in the TPU assembly written so far, our stack pointed to by r7 resembles the following:

The assembly code therefore, for the mul16 function, is as follows.

mul16: read r1, r7, 2 read r2, r7, 1 load.l r0, 0 mul16_loop: cmp.u r5, r2, r2 bro.az r5, %mul16_fin add r0, r0, r1 subi.u r2, r2, 1 bi $mul16_loop mul16_fin: read r6, r7, 0 br r6

Pretty simple stuff, but again – a new instruction! bro.az = branch to relative offset when A is zero.

Conditional Branch to relative offset

If you remember our previous parts discussing the conditional branching, and even our first part, you’ll remember that they could only branch to a target stored in a register. It was incredibly inefficient for small loops, taking up a register and bloating the code.

Before implementing relative offset branching, there was a need to make the conditional branching instructions more sane. The conditional bits in the instruction which form the type of condition were split and spread out in the instruction form, despite us not using the rD bits. This was changed, so we have a new instruction coding for conditional jumps:

With this now done, adding relative branch targets was fairly simple. The flag bit (8) is used to detect whether we branch to a register value or an immediate offset from the current PC:

The VHDL checks for the flag bit, and selects a different branch target.

when OPCODE_JUMPEQ => if I_aluop(0) = '1' then s_result(15 downto 0) <= std_logic_vector(signed(I_PC) + signed(I_dataIMM(4 downto 0))); else s_result(15 downto 0) <= I_dataB; end if;

You can see the 5-bit immediate is signed, allowing conditional jumps backwards in the instruction stream. As any TIS-100 player will know, JRO’s backwards are very useful – especially in a multiplier 😉

The full multiplier test

I’ve put the full multiplier assembly listing below, which is bulky but I think helps in understanding the flow.

start: load.l r7, 0x27 # Top of the stack load.l r1, 7 # constant argument 1 load.l r2, 3 # constant argument 2 subi r7, r7, 3 # reserve 3 words of stack write r7, r1, 2 # write argument at offset +2 write r7, r2, 1 # write argument at offset +1 spc r6 # get current pc addi r6, r6, 4 # offset to after the call write r7, r6 # put return PC on stack bi $mul16 # call addi r7, r7, 3 # pop stack bi $end # Multiply two u16s. Doesn't check for overflow. mul16: read r1, r7, 2 read r2, r7, 1 load.l r0, 0 mul16_loop: cmp.u r5, r2, r2 bro.az r5, %mul16_fin add r0, r0, r1 subi.u r2, r2, 1 bi $mul16_loop mul16_fin: read r6, r7, 0 br r6 halt: bi $halt end: or r0,r0,r0 bi $end

If this test works, we should be able to see r0 containing the result of our multiply (21 or 0x15) and the waveform should show the shouldBranch signal oscillating due to the end jump over an or. If shouldBranch is high at all times, we know we’ve hit halt so something isn’t quite right. I’ve not done typical calling convention things such as saving out volatile registers, but it’s easy to see how that would work. But i’m sure those reading by now will be wondering how I get those assembly listings into my test benches in VHDL.

The TPU Assembler – TASM

I have written a 1-file assembler in c# for the current ISA of TPU. In it’s thousand lines of uncommented splendour lies an abundance of coding horrors – fit for the Terrible Processing Unit. It works perfectly well for what I want – just don’t look too deep into it.

I wrote this in a few hours early on in the project, because as you can imagine, writing out instructions forms manually is tedious. The assembler is very simple and is fully self contained without any dependencies. It contains definitions for instructions, how to parse instruction forms, and how to write out their binary representation.

The functional flow for the assembler is as follows:

Parse arguments and open input file for each line in the input file if it starts with a ‘#’, ignore it as a comment. split the line into strings by whitespace and commas If the first element ends with a ‘:’ treat it as a label and note it’s location Add the rest as instruction definitions to a list of inputs For each input definition, replace label names with actual values parse all definitions into a list of Operation Data objects Open output file Output the instruction data using a particular format generator

Assembler Features

The assembler accepts instruction mnemonics as per the ISA document, but will accept some additional ones – like add, which is simply treated as add.u.

There is a data definition (data/dw) which outputs 16-bit hex values directly to the instruction stream, it accepts outputting labels as absolute ($ prefix) and relative (% prefix), but does not currently support the ability to set the current location in memory of definitions – the first line is location 0x0000, and it continues from there.

Errors are not handled gracefully, and there is no real input checking. You could pass a relative offset into a conditional branch which is outside of the bounds of the instruction, and it will generate incorrect code. I’ll fix this stuff at a later date.

Output from the assembler is either binary, hex, or ‘eram’. The Embedded Ram (eram) format is basically VHDL initialization, with the original listing and offsets as comments. The example above assembles to the following:

X"8F27", -- 0000: load.l r7 0x27 # Top of the stack X"8307", -- 0001: load.l r1 7 # constant argument 1 X"8503", -- 0002: load.l r2 3 # constant argument 2 X"1EE7", -- 0003: subi r7 r7 3 # reserve 3 words of stack X"70E6", -- 0004: write r7 r1 2 # write argument at offset +2 X"70E9", -- 0005: write r7 r2 1 # write argument at offset +1 X"EC00", -- 0006: spc r6 # get current pc X"0CC9", -- 0007: addi r6 r6 4 # offset to after the call X"70F8", -- 0008: write r7 r6 # put return PC on stack X"C10C", -- 0009: bi 0x000c # call X"0EE7", -- 000A: addi r7 r7 3 # pop stack X"C117", -- 000B: bi 0x0017 X"62E2", -- 000C: read r1 r7 2 X"64E1", -- 000D: read r2 r7 1 X"8100", -- 000E: load.l r0 0 X"9A48", -- 000F: cmp.u r5 r2 r2 X"D3A4", -- 0010: bro.az r5 4 X"0004", -- 0011: add r0 r0 r1 X"1443", -- 0012: subi.u r2 r2 1 X"C10F", -- 0013: bi 0x000f X"6CE0", -- 0014: read r6 r7 0 X"C0C0", -- 0015: br r6 X"C116", -- 0016: bi 0x0016 X"2000", -- 0017: or r0 r0 r0 X"C117", -- 0018: bi 0x0017

And this is simply pasted into our VHDL ram objects. We need to pad it out to the correct size of the ram – but that is something I want to add as a feature, so you pass in the size of the eRAM and it automatically initializes the rest to zero. We can then simulate and see the TPU running well with the ISA additions.

Wrapping Up

I hope this has shown how easy it was to go in and fix some ISA mistakes made in the past and implement some new functionality. Also, it’s been nice to introduce TASM, despite the assembler itself being about as robust as a matchstick house.

The changes made to the VHDL has increased the resource requirement of the TPU on a Spartan6 LX25 from 3% to 5%, but an increase was expected given so many additional adders.

For next steps, I’m going to concentrate on the top-level VHDL entities for further deployment to miniSpartan6+.

Thanks for reading, comments as always to @domipheus.