In the article, Modern Forth, I focused on the impact of modern Forth compiler design on current register-oriented CPUs. In this article, I examine the relationship between software and silicon, and discuss a search for simplicity to improve performance and reduce chip size and power consumption.

For those of you whose software life is based around C and other "Pasgol" languages, shift your perspective of Forth and start thinking of it as a two-stack silicon machine. Forth compilers for conventional CPUs just map this model onto a register-oriented model. We will also see how to map the C virtual machine (VM) onto a two-stack VM.

Chips designed to run Forth well have been produced for more than 20 years, including the Novix NC4000, the Harris/Intersil RTX2000, and Silicon Composers SC32. There has been a flurry of cores for implementation in FPGAs, including MicroCore. Today, the state of the art is the 40-core SEAforth processor from IntellaSys. (Later in this article, I look at the C18 core and interconnects used in the SEAforth chips.) But first, I examine changes to the canonical Forth VM to achieve the goals of performance and code size.

Revisiting the Forth Virtual Machine

We will look again at the Forth and C VMs, see where the Forth VM is weak, and discover how to adjust it to improve execution of both languages. This leads to some understanding of why the IntellaSys C18 core is as it is.

[Click image to view at full size]

The canonical Forth virtual machine is weak in several areas:

It does not execute C well, which is important for commercial exploitation of general-purpose silicon stack machines. C requires a frame pointer for access to local variables and buffers in main memory. The two stacks are not in addressable memory.

It is weak for DSP operations, which restricts performance in embedded applications without changes to the VM or increased compiler complexity.

Without index operations, dealing with complex data structures is cumbersome, especially when a base address is passed as an argument to a word/function.

DSP operations often require three or four parameters to be manipulated. For example:

source address, destination address and length,

first source address, second source address, destination address and length.

Canonical Forth requires ugly source code to deal with these situations. Several silicon implementations provide index and scratch registers, and others have provided more access to the top of the return stack. Using the top of the return stack as a loop counter has been common for a long time; for example, the FOR ... NEXT loop structure.

The Forth community has long talked about TOS (top of data stack), NOS (next/second on data stack) and TOR (top or return stack). These are not quite enough for DSP operations. Chuck Moore's current silicon includes A and B registers which are used both as index registers and for scratch storage. Efficient execution of C requires a frame pointer, and a spare index register is always useful. We end up with the model in Figure 2.

[Click image to view at full size]

The A and B registers are used as scratch locations and for stepping through memory using auto-increment and auto-decrement addressing modes. The X and Y index registers have base+index addressing and can be used as frame and thread-local storage pointers. The X and Y registers are important for general-purpose CPUs, and are not implemented in the IntellaSys C18 core.

The impact of the A and B registers can be seen in this biquad filter implementation by Gary Bergstrom for a 16-bit embedded system. Gary commented on the previous article about Forth's return stack not getting in the way of parameters:

This has to be one of the most underrated points in Forth. Factoring words in Forth is natural and the lack of return addresses interspersed with the data allows this to be very efficient. In most languages you can't factor to the degree that you can in Forth without having severe run-time speed consequences. You can't keep passing data to lower and lower layers without building new stack frames, with the same data repeated in them, again and again.

$4000 constant +1. \ -- n \ Integer +1 in 2.14 fractional arithmetic format. : *. \ fr1 fr2 -- fr3 \ Fractional multiply. +1. */ ; : 1STEP+ \ sum -- sum' \ Perform a multiply/accumulate step, incrementing both \ pointers. [email protected]+ [email protected]+ *. + ; : 1STEP- \ sum -- sum' \ Perform a multiply/accumulate step, incrementing the \ coefficient pointer and decrementing the data pointer. [email protected]+ [email protected] *. + ; : SHIFT2 \ fr -- \ The last step of the filter. The current data item \ is shifted into the next data slot and replaced by fr. [email protected] SWAP A!+ A!+ ; : (BIQUAD) \ frx -- fry \ The core of the biquad filter operation. DUP >R [email protected]+ *. \ initial sum = B0*input 1STEP+ 1STEP- R> SHIFT2 1STEP+ 1STEP- ; : BIQUAD \ fx addr-filt addr-coef -- fry \ A single order biquad filter. >B >A (BIQUAD) DUP SHIFT2 ; : 2xBIQUAD \ fx addr-filt addr-coef -- fry \ A second order biquad filter. >B >A (BIQUAD) (BIQUAD) DUP SHIFT2 ;

In this example, the A and B registers are set up by the words A and B in BIQUAD. These registers are now parameters to the lower layers with no parameter passing overhead. Use of these registers has removed the need for local variables while permitting additional factorisation. They have also considerably reduced stack manipulation in both the source code and the compiled code. Because parameter passing is efficient, what would be inline code in other languages is encapsulated as factors, which in turn reduces code size. The importance of code density will become apparent in the next section.

The X and Y registers above show their worth in larger systems for indexed addressing into structures in memory. They will be used in a conventional Forth system to access local variables and buffers, and to provide a pointer to thread-local storage. One of them will be used as a frame pointer by a C compiler.

These changes to the Forth VM improve code density and performance in Forth. They also permit two-stack machines to run C efficiently. A more in-depth look at this VM will appear in the EuroForth 2008 conference proceedings and on the EuroForth conference website in October 2008.