This site uses cookies to deliver our services and to show you relevant ads and job listings. By using our site, you acknowledge that you have read and understand our Cookie Policy , Privacy Policy , and our Terms of Service . Your use of the Related Sites, including DSPRelated.com, FPGARelated.com, EmbeddedRelated.com and Electronics-Related.com, is subject to these policies and terms.

Ok, let's make a small stack-based CPU.

I will start where the rubber meets the road - the PC/stack subsystem that I like referring to as the 'legs'. As usual, I will present a design with a twist.

Not having a large design team, deadlines and million-dollar fab runs when designing CPUs creates a truly different environment. I can actually sit at the kitchen table and doodle around with CPU designs to my heart's content. I can try really ridiculous approaches, and work without a plan, just to see what happens. When something interesting happens, I can adjust the rest of my design to fit. I am an artist, man!

The Legs

When normal people (that is, not artists :) build CPUs, they will generally designate a register as a Program Counter (PC) and use it to address memory. The PC needs to be incremented normally; in addition it must support jumps and calls, so it is generally constructed as a loadable counter.

For calls and returns, we use the Stack Pointer (SP) that addresses the memory, either the same one as the PC or a different one. SP can be either incremented or decremented.

The stack semantics dictate that the SP must be pre-decremented on push and post incremented on pop (or the reverse). In spite of its apparent simplicity, this pre/post distinction can be tricky to implement. Some minimal implementations (J1 stack processor) give up and leave the post-increment for the next instruction (for the datastack anyway), leaving it up to the assembler to deal with the complexity.

The interaction of the PC incrementor and the return address that winds up pushed onto the stack is yet another source of complexity that is hard to describe until you try to implement it. Suffice it to say that the you have to either push an incremented address or increment the popped address to avoid running the same instruction twice. It is amazing how many real processors implement the PC/stack pointer subsystem in a clumsy way.

The traditional PC/SP implementation impacts the rest of the processor in a very significant way. Both the PC and the SP need to address memory, often simultaneously. Given that requirement, we are faced with a hard choice to make - either dual-port the RAM or require multiple cycles for instructions. Traditionally, the first choice is not an option, but with FPGAs we could do it easily (although I am loath to do so for other reasons). The second choice is not attractive either, as it incurs a significant speed penalty and increases the complexity of the design.

Decoupled Stack

Luckily, there is a third alternative: decouple the stack memory. There is little reason to keep the stack in the same memory space as the code or data, for minimal processors. Especially if you are not planning on running C on it, and I have little interest in that.

A distributed RAM can be implemented very compactly on Xilinx chips: a single slice can house two sixteen-bit RAMs. This leads to a very compact stack memory - a 16-level 16-bit stack takes up only 8 slices!

But wait, it gets better. Each half-slice also has free incrementor logic. With that, we can eliminate the PC register altogether, and use the memory addressed by SP as PC.

This arrangement makes subroutine calling really easy. We don't have to push anything - the PC is on the stack to start with!

There are consequences to this decoupled approach. Since the stack memory is outside the normal memory space, it is inaccessible to regular memory reads. For instance, you cannot take an address of data on the return stack. Running out of the stack without a separate PC also makes it entirely impossible to store data on the return stack - there simply is no pathway to move data there. This is a little traumatic, as even Forth uses the return stack sometimes to store data. However, there are workarounds.

Let's implement the legs. I will break up the functionality into small modules - the map report will show 'utilization by hierarchy' to let us identify how big each module is.

/****************************************************************************** A 16-bit 16-level stack memory. Infer a RAM16_S1. We write it every cycle with DIN and output DOUT, which may be incremented. ******************************************************************************/ module STACKRAM( input C, input [3:0] A, input [15:0] DIN, output [15:0] DOUT, input inc ); reg [15:0] ram[0:15]; assign DOUT = ram[A] + inc; always @(posedge C) ram[A] <= DIN; endmodule

/****************************************************************************** A 5 bit stack pointer There is no penalty for using it as a 4-bit pointer ******************************************************************************/ module SP( input C, input push, input pop, output [4:0]dout ); reg [4:0] SP; //Stack Pointer reg [4:0] newsp; always @(push or pop) case ({push,pop}) 2'b01: newsp = SP+1; 2'b10: newsp = SP-1; default: newsp = SP; endcase always @(posedge C) SP <= newsp; assign dout = newsp; endmodule

/****************************************************************************** The complete PC/SP subsystem ******************************************************************************/ module PC( input clk, input [15:0] in, //input vector data input inc, //when set, increment PC input vec, //when set, accept vector input push, //push new value onto stack input pop, //return value (increment SP for next cycle) output [15:0] out, output[3:0] addr ); //stack pointer - 2 slices... SP mysp(clk,push,pop,addr); wire [15:0] min; wire [15:0] mout; STACKRAM mem(clk,addr,min,mout,inc); //mux between direct input or old PC/inc assign min = vec? in : mout; assign out = min; //output new address or inced old. endmodule

... reg pc_inc, pc_vec, pc_push, pc_pop; always @(posedge cpuclk) begin case (btn[3:1]) 3'b100: begin //jump pc_inc=0; pc_vec=1; pc_push=0; pc_pop=0; end 3'b010: begin //call pc_inc=0; pc_vec=1; pc_push=1; pc_pop=0; end 3'b001: begin //return pc_inc=1; pc_vec=0; pc_push=0; pc_pop=1; end default: begin //increment PC pc_inc=1; pc_vec=0; pc_push=0; pc_pop=0; end endcase end wire [3:0] sp; //switches for low 8 bits of vector PC mypc(cpuclk,{8'h00,sw[7:0]},pc_inc,pc_vec,pc_push,pc_pop,ab,sp); ...

| +mypc | | 10/24 | | ++mem | | 9/9 | | ++mysp | | 5/5 |

First, the stack memory:The Stack Pointer:And finally, the entire PC/SP module:Pretty simple. To test it I implemented the design on a Digilent Spartan S3 board. I connected the 4-digit display to the address bus, 8 sliding switches to the vector in register, and 3 buttons to signify jump, call and return. Running with a slow clock, I can watch my CPU incrementing the address, jumping or calling to a specified address, and returning to the original PC +1! The button instructions are decoded into control wires as followsThe tools report the size as 24 slices -- pretty close to optimal. SP should really fit into 2 slices...So there you have it. All that's left to do is to add the datastack, the ALUs and the instruction decoder....