Lisp Machine Inc. K-machine:

The Deffenbaugh, Marshall, Powell, Willison architecture

as remembered by Joe Marshall

Abstract: The LMI K-machine was the last processor designed and built by Lisp Machine, Inc. Unlike other Lisp Machines, the K-machine is not descended from Tom Knight's original CONS architecture; the K-machine is an original design. This paper provides an overview of the more interesting elements of the K-machine architecture.

Introduction

The LMI K-machine was the last processor designed and built by Lisp Machine, Inc. The K-machine was designed in late 1985 - early 1986 and the first instructions were executed December 31, 1986. The hardware was significantly debugged by April 1986 and an effort to port the legacy Lisp Machine code to the K-machine was underway. Financial difficulties kept the K-machine from reaching the marketplace. Lisp Machine's successor, Gigamos, was beset by legal difficulties which again prevented marketing of the K-machine processor. Nonetheless, several K-machine board sets were produced and debugged before the demise of Lisp Machine, Inc.

The K-machine processor is notable for several reasons: Unlike other Lisp Machines, the K-machine is not descended from Tom Knight's original CONS architecture; the K-machine is an original design. The K-machine is designed to run Lisp, and Lisp only. As a result, the performance of the K-machine on the Gabriel benchmarks is significantly higher than that of any of the several contemporary competing products of the mid-to-late 1980's, among them the TI Explorer chip, the Symbolics Ivory, and Lucid Common Lisp. In fact, the K-machine was competitive with Lisp implementations on the Cray 1 and the IBM 3670. Since the project was kept relatively secret during its development, very few people know the details of its architecture.

The authors of this paper were the primary architects of the original K-machine as designed and built by Lisp Machine, Inc. The hardware implementation of the architecture was primarily designed by Robert Powell, Bruce Deffenbaugh, and Kent Hoult. Subsequent minor hardware changes were made by the staff of Gigamos. Many people were involved in the project.

This paper provides an overview of the more interesting elements of the K-machine architecture. The K-machine contained its share of mundane elements such as instruction caches, virtual memory hardware, etc. Details about the mundane hardware is beyond the scope of this paper.

Architectural Principles

In any processor design there are fundamental "principles" from which the rest of the architecture derives. For the LMI K-machine, these were as follows:

Optimistic instruction execution

The processor was designed with the assumption that certain operations, are common enough that no time should be wasted processing them. For example, when two numbers are added it is assumed that they are both fixnums. If this turns out not to be the case, the addition is aborted and the computation retried with the generic arithmetic code. This is in contrast to the LMI LAMBDA which did a type dispatch on arithmetic operands before attempting to operate upon them.

The processor was designed with the assumption that certain operations, are common enough that no time should be wasted processing them. For example, when two numbers are added it is assumed that they are both fixnums. If this turns out not to be the case, the addition is aborted and the computation retried with the generic arithmetic code. This is in contrast to the LMI LAMBDA which did a type dispatch on arithmetic operands before attempting to operate upon them.



Fully type-safe operation

All user-operations are fully type safe. Speed and safety flags are ignored by the compiler and full safety is provided by the hardware.

All user-operations are fully type safe. Speed and safety flags are ignored by the compiler and full safety is provided by the hardware.



Non-expert user assumption

A typical user of an LMI LAMBDA does not have the time necessary to learn how to squeeze high performance from that processor. We therefore designed the machine to optimize the patters of usage found in code written by non-expert users. This is not to say that inexperienced programmers can expect high performance from poorly written code, but a pedagogic and abstract style of code need not be abandoned when coding a high-performance application.

A typical user of an LMI LAMBDA does not have the time necessary to learn how to squeeze high performance from that processor. We therefore designed the machine to optimize the patters of usage found in code written by non-expert users. This is not to say that inexperienced programmers can expect high performance from poorly written code, but a pedagogic and abstract style of code need not be abandoned when coding a high-performance application.



32-bit objects

As in the CONS, CADR, and LMI LAMBDA, the K-machine supports 32-bit objects. In contrast, the Symbolics 3600, had 36-bit objects. In both the 3600 and the LMI LAMBDA, the type tag is kept in the pointer, so the address space, as defined by the length of a pointer, is significantly smaller in the LMI machines.



A 32-bit object admits a larger range of stock hardware upon which to base an implementation, and allows Lisp objects to be stored and manipulated on non-specialized hardware, for instance, the standard 32-bit bus found in most modern computers. Furthermore, the smaller address space simplifies the implementation of the virtual memory system and ephemeral GC by allowing a single level of page tables.



The drawbacks of using 32-bit objects are an inability to use full IEEE floating point, integers are required to have less than 32-bits precision, and the address space is quite a bit smaller. Additionally it is much more difficult to ensure that all objects are tagged at all times; generally, the tag bits must be stripped off objects before they are passed through the ALU, and the objects must be re-tagged before they are stored back in memory. This increases the burden upon the system programmer to ensure that the garbage collector only sees valid pointers.

As in the CONS, CADR, and LMI LAMBDA, the K-machine supports 32-bit objects. In contrast, the Symbolics 3600, had 36-bit objects. In both the 3600 and the LMI LAMBDA, the type tag is kept in the pointer, so the address space, as defined by the length of a pointer, is significantly smaller in the LMI machines. A 32-bit object admits a larger range of stock hardware upon which to base an implementation, and allows Lisp objects to be stored and manipulated on non-specialized hardware, for instance, the standard 32-bit bus found in most modern computers. Furthermore, the smaller address space simplifies the implementation of the virtual memory system and ephemeral GC by allowing a single level of page tables. The drawbacks of using 32-bit objects are an inability to use full IEEE floating point, integers are required to have less than 32-bits precision, and the address space is quite a bit smaller. Additionally it is much more difficult to ensure that all objects are tagged at all times; generally, the tag bits must be stripped off objects before they are passed through the ALU, and the objects must be re-tagged before they are stored back in memory. This increases the burden upon the system programmer to ensure that the garbage collector only sees valid pointers.



RISC-like architecture

Certain RISC principles were adhered to: Instructions are atomic; there are no "instruction prefixes" or "string operations". There are no complicated addressing modes (although that the low order bit can be forced to 1 during a read cycle to be able to take CDR's without having to perform an add). All instructions complete in a single fixed-time clock cycle.



Other RISC principles, such as a uniform instruction set, were discarded. The K-machine has a way to fit an ALU operation into virtually any instruction. There are even "short" conditional branches for tight loops that have enough bits in them to execute an arithmetic (but not a shift/rotate) instruction while branching.



Unlike some RISC machines, while the instruction set was "reduced" as above, complicated instructions that performed several operations in parallel were introduced so long as they could be completed in one clock cycle.



In retrospect, we inherited many of the advantages and disadvantages of RISC processors. We had simpler and cheaper hardware, interrupt processing was easier, and the compiler was easy to target. However, we also had very wide instructions (64-bits), and often the compiler could not find useful work to be done while waiting for condition codes to become available. This caused several NOP s to be inserted. We also found that the performance of the machine can be limited by the rate at which the instruction cache is loaded.

Certain RISC principles were adhered to: Instructions are atomic; there are no "instruction prefixes" or "string operations". There are no complicated addressing modes (although that the low order bit can be forced to 1 during a read cycle to be able to take CDR's without having to perform an add). All instructions complete in a single fixed-time clock cycle. Other RISC principles, such as a uniform instruction set, were discarded. The K-machine has a way to fit an ALU operation into virtually any instruction. There are even "short" conditional branches for tight loops that have enough bits in them to execute an arithmetic (but not a shift/rotate) instruction while branching. Unlike some RISC machines, while the instruction set was "reduced" as above, complicated instructions that performed several operations in parallel were introduced so long as they could be completed in one clock cycle. In retrospect, we inherited many of the advantages and disadvantages of RISC processors. We had simpler and cheaper hardware, interrupt processing was easier, and the compiler was easy to target. However, we also had very wide instructions (64-bits), and often the compiler could not find useful work to be done while waiting for condition codes to become available. This caused several s to be inserted. We also found that the performance of the machine can be limited by the rate at which the instruction cache is loaded.



50ns clock cycle

Our target was a 50ns clock cycle. Each instruction was to complete within this time. The actual machine came in at 80ns. It turns out that the AMD 29332 computes a stable output some 20ns or so after the inputs become stable except for one crucial line: the integer overflow line can take up to 47ns to stabilize. The critical timing loop is between the ALU receiving correct inputs to the ALU output stabilizing, a couple of gate delays to recognize an error condition, and the time necessary to abort the current instruction fetch and set the program counter to 0.

Our target was a 50ns clock cycle. Each instruction was to complete within this time. The actual machine came in at 80ns. It turns out that the AMD 29332 computes a stable output some 20ns or so after the inputs become stable except for one crucial line: the integer overflow line can take up to 47ns to stabilize. The critical timing loop is between the ALU receiving correct inputs to the ALU output stabilizing, a couple of gate delays to recognize an error condition, and the time necessary to abort the current instruction fetch and set the program counter to 0.



3-address machine

The CONS, CADR, LAMBDA, and Symbolics 3600 emulated a stack machine with the microcode. While a stack machine is an easy architecture to compile to, there is a performance penalty to be paid. For instance, in the evaluation of a simple addition, 3 instructions must be used:

push arg1

push arg2

add

Where a register machine may accomplish the same operation in one instruction. A 3 address machine, however, requires a much wider instruction than a stack machine because the arguments to a stack machine operation are implicit. The K-machine has a 64-bit instruction word. An instruction word of this size creates some instruction bandwidth difficulties, but there are compensatory advantages in that the instruction can specify a number of operations to be done in parallel.

The CONS, CADR, LAMBDA, and Symbolics 3600 emulated a stack machine with the microcode. While a stack machine is an easy architecture to compile to, there is a performance penalty to be paid. For instance, in the evaluation of a simple addition, 3 instructions must be used: Where a register machine may accomplish the same operation in one instruction. A 3 address machine, however, requires a much wider instruction than a stack machine because the arguments to a stack machine operation are implicit. The K-machine has a 64-bit instruction word. An instruction word of this size creates some instruction bandwidth difficulties, but there are compensatory advantages in that the instruction can specify a number of operations to be done in parallel.



No CDR codes

Analysis of the storage on existing Lisp Machines indicated that only about 20 percent of the storage was actually stored in lists (the rest being vectors, structures, or classes), and that about 20 percent of the lists benefited from CDR codes. CDR codes take their toll in increasing memory references (the CAR must be examined before the CDR can be located), and reduced address space (2-bits per pointer reserved for every object in the system). One CDR-code bit was added to the address space, and the other to the tag.

Analysis of the storage on existing Lisp Machines indicated that only about 20 percent of the storage was actually stored in lists (the rest being vectors, structures, or classes), and that about 20 percent of the lists benefited from CDR codes. CDR codes take their toll in increasing memory references (the CAR must be examined before the CDR can be located), and reduced address space (2-bits per pointer reserved for every object in the system). One CDR-code bit was added to the address space, and the other to the tag.



"Level" implementation technology

The entire K-machine is implemented in standard CMOS and TTL. No ECL or other high-performance parts were used. The point of doing this is to be able to easily incorporate faster technology by simply replacing all the parts with their faster equivalent.





The final data layout is a 6-bit tag at the most-significant end of the (4-byte) word, and a 26-bit word pointer in the least significant end. This makes a 226 word (64 MWord) address space, or a 1/4 GB address space. Again, this is smaller than Symbolics, but nothing to sneeze at. The machine is correct-endian, i.e. little.

The Pipeline

The K machine has a 4 stage pipeline. While this is transparent to the end user, and mostly transparent to the system programmer, there are occasions where the pipeline is exposed. Notes about this will be interspersed below.

The clock cycle to the machine registers run at twice the speed of the main processor clock and alternate between read and write cycles. This means that each instruction can both read from the registers and write back to them.

The four pipelines stages are as follows:

Fetch

The instruction is fetched from the instruction cache.



Read

The instruction source data are fetched from the register file, functional input bus, or instruction register (for immediate values).



ALU

The data are passed through the ALU and latched at the output register.



Write

The value in the output register is written back to the register file or to a functional destination.

An instruction may be aborted at any time before the output register value is written.

A four-stage pipeline generally means that branch instructions are not taken immediately. Usually, there are "branch delay instructions" that are unconditionally executed after the branch is specified. The K-machine branch delay is unusual in that the condition codes are set in the instruction before the conditional branch instruction and the branch target is specified in the instruction after the conditional branch. Thus, when the condition codes are generated by the ALU, the conditional branch is in the instruction register and the machine can then decide whether to follow the branch at this point. This ameliorates the effect of branch delay by spreading the branch across several instructions. Of course, the compiler hides the details of conditional branches.

The Primary Data Paths

Within the processor itself, the data paths are 33 bits wide. The extra "boxed" bit allows us to manipulate 32-bit non-pointer data (such as floating point numbers) directly. The 33rd bit is constructed on the fly from context when a pointer is read into the machine, and appropriate action is taken to "box" non-pointer data when it is written to memory.

The register file

Unlike the CONS derivatives, but much like a RISC machine, the K-machine has a register file consisting of 4096 33-bit words. This register file is duplicated to allow 2 simultaneous reads from different registers. Writes to the register file always update both copies. The output of the register file is latched and presented only to the ALU. In order to use the contents of a register as a virtual address or to write it to memory, it must be passed through the ALU. This incurs a one-cycle delay.

The register file is both read and written during each machine instruction, and thus requires fast RAM. Static RAM is used both for speed, and to allow single stepping of the K-machine hardware for debugging purposes.

The register file is addressed by the function calling hardware. This will be described in detail below.

The ALU

Integer unit

The K-machine was designed to use the AMD 29332 ALU. This is a 32-bit ALU that has quite a bit of functionality including a barrel shifter (for tag extraction), and 8, 16, 24, and 32-bit modes of operation. The decision was made to make fixnums 24-bits wide to take advantage of the natural ability of the ALU to handle this. This was a step backward, as the LMI-lambda had 25-bit fixnums, but the gain in performance was considered more important.

Floating point unit

In parallel with the integer ALU (the AMD 29332), is a floating point adder and a floating point multiplier. The output register, which holds the value to written back to the registers, may receive its value from any of these sources.

Type checking unit

Neither the CADR or the LMI LAMBDA had hardware for checking the types of operands in parallel with the actual computation done by the ALU. It is my understanding that the Symbolics 3600 has a type checking unit, but we are unfamiliar with it. We invented a simple, but effective, type checking hardware module that operates in parallel with the arithmetic units. This hardware consists of a 256Kbit x 1bit RAM. This RAM is addressed by 7bits each from the left and right operands of the ALU (the boxed and tag bits), two bits from the instruction to specify the type check operation (for instance, operands must be boxed fixnums, or left operand is an aggregate data structure and the right operand is a fixnum), and two bits from a mode register (this allowed the program to change type checking strategies when garbage collecting, for instance). These 7+7+2+2 are concatenated to form a 218 address for the type check RAM. This single bit output indicates whether the operation should succeed or cause a type check error. This turns out to be an elegant, simple, and inexpensive mechanism for parallel type checking.

The type check RAM is loaded at boot time by presenting all possible combinations to the RAM (by sending synthesized data through the ALU). Traps are turned off during programming, and a special control register toggles the write line at the appropriate time as the data flies by. It is tricky, but it avoids the necessity of adding special data paths in order to program the type checking hardware. By implementing the checking in a RAM, we can experiment with which types are represented, and even add new hardware tags on the fly.

Boxed unit

A small piece of logic computes the boxed bit for the ALU result. The instruction can specify that the result is forced to be boxed, forced to be unboxed, is to have the same boxedness as the left-hand operand, or is to have the opposite boxedness as the left-hand operand (this last one has no discernible use, but it is what the hardware does).

Interconnection

As mentioned previously, the output of the register file is latched and presented to the inputs of the ALU. The output of the ALU is similarly latched (by the output register) and written back to the register file in the next instruction. Thus the primary data paths in the machine are composed of a two-stage pipeline.

A set of address comparators enable a pass-around path when the result of an ALU operation is needed immediately for the next computation. This relieves the compiler from part of the instruction scheduling burden. In general, the compiler can ignore the effect of the pipeline except when functional sources and destinations are involved.

There are additional points to inject and extract data from the main data paths. Functional destinations can be written with the value of the output register. These consist of various control registers, the memory maps, and the virtual memory address and data registers. Functional sources may be injected into the right hand side of the ALU input registers. Additionally, the virtual memory data may be injected directly into the output register (see below for why).

Like the CADR, but unlike the LAMBDA, the functional busses (called the MFI and MFO bus in the CADR), are not connected. The MFI bus passes data destined for the processor, the MFO bus passes data sourced from the processor. This admits the possibility that some inconsistency may arise in a read-write register (for instance, the memory data register), but the difficulty in "turning around" the bus, quite evident in the LMI LAMBDA, is easily avoided in this manner.

Because the functional destinations are located beyond the output register, the effect of the pipeline is exposed to the compiler and care is taken to ensure that enough time is allowed for the data to arrive at its destination before an attempt is made to use it; in particular, it is important to not attempt to use the virtual memory data register until two instructions after initiating the read. Like the CADR and LAMBDA, the K-machine has interlocks to stall the pipeline while the processor awaits data from the memory system; the two instruction rule is to make sure that the request to read from the memory system precedes the request for the value read.

As mentioned before, the virtual memory data can be injected into the pipeline at the output register rather than at the right hand input to the ALU. The unusual placement of the injection point of the memory data saves time; memory data is usually written to a register and this will bypass a pointless trip through the ALU.

Since the functional destinations are delayed by a clock tick, certain operations, such as loading the type checking RAM, can take advantage of this delay by specifying that the data is to be written, And providing said data in the next instruction in the stream. The data will therefore be at the correct point in the pipeline when the write pulse is generated by the delayed functional destination. This is a bit of a hack, but it allows us to use a baroque set of instructions to program the hardware without requiring any extra data paths. (Of course this can only be done with interrupts off).

The instruction register can also be injected into the right hand side of the ALU. This allows us to embed immediate constants into the instruction stream. However, since the immediate constants had to be part of an instruction, there is no room for an ALU operation to be specified when this mechanism is used. The hardware forces the ALU to perform a pass right instruction, so all immediate instructions must be loaded into a register before any operation can take place (note that if the immediate was loaded to a functional destination it would simply pass through the ALU on its way to the functional output bus).

A special case is made for immediate fixnums. These are represented in the instruction stream as untagged immediate values. The hardware generates the fixnum tag on the fly as it is injected, and the 8 bits saved (remember that fixnums are 24 bits), allow for an ALU operation to be specified.

The PC can be loaded from the output of the ALU. This allows computed branches to be executed.

Function calling hardware

While the LMI Lambda was advertised as having special hardware to speed function calls, there is no particular piece of hardware devoted to rapid function calling. No machine that I am aware of had function calling hardware like the K-machine.

Register addressing

The function calling mechanism is intimately tied to the register addressing scheme. A register address consists of 6 bits. Two bits select a "frame" and 4 bits provide an offset within that frame. Each frame therefore has 16 general purpose registers.

There are 4 frames available to each instruction (more below). Analysis of the LMI LAMBDA showed that most functions have one or two arguments and one or two local parameters. The unusual functions that overflow the register frame are handled in software. The compiler generates the calls to the register spilling code.

The register frames are not overlapped as in other RISC machines. This allows us to allocate and free register frames in a non-stack (LIFO) manner. The advantages of this are discussed below.

Unlike the CADR, LAMBDA, and 3600, registers do not have a virtual address. In the LAMBDA, virtual memory accesses to the stack cause a load from the stack cache. In the K-machine, it is not possible to address register values in this way. Thus it is not possible to create a pointer into the stack.

The 4 frames available to each instruction are as follows:

Global

A set of 16 global frames is allocated for use by all functions. Each instruction includes a 4 bit global frame selector, so only 16 of the 256 global registers are available within a single instruction. Global registers contain data such as commonly used constants (like 1, 0, NIL, -1, etc.) to avoid using the costly load immediate instruction, pointers to wired memory (like the page tables), etc.

A set of 16 global frames is allocated for use by all functions. Each instruction includes a 4 bit global frame selector, so only 16 of the 256 global registers are available within a single instruction. Global registers contain data such as commonly used constants (like 1, 0, NIL, -1, etc.) to avoid using the costly instruction, pointers to wired memory (like the page tables), etc. Active

This frame contains the arguments, locals, and temporaries of the currently executing function.

This frame contains the arguments, locals, and temporaries of the currently executing function. Open

This frame is the one currently being prepared for the next function call.

This frame is the one currently being prepared for the next function call. Return

This frame is the one that was used by the last function call. Multiple values can be returned in this frame, but single values are returned via a different mechanism. In retrospect, this frame causes more problems than expected as it needs to be saved and restored on each interrupt.

Call sequence

The K-machine uses a "3-point" calling sequence. A new open frame is allocated, then it is filled with arguments, then control is transferred to the callee. When the callee returns, the open frame that was allocated is discarded (actually moved to the RETURN frame). While it may not sound like it, this implements a callee saves convention.

When a function is called as a subproblem, a continuation must be created (and pushed) in order for the function to return a value. Note, however, that the continuation does not need to be in a register. In fact, it is rare for a continuation to be manipulated as a value, so a separate "hidden" stack is used to hold it. This allows us to manipulate continuations in parallel with operations that use the register set.

The continuation has 4 parts: the caller's active frame, the caller's previous open frame, the return PC, and the return destination. The actual control sequence is complicated, but I'll make a stab at explaining it:

Open a frame

The open instruction saves the current values of the OPEN and ACTIVE register frame pointers on the hidden stack and loads the OPEN register with a new frame pointer.



Frame pointers are kept in a hardware freelist. This freelist is popped when a frame is opened, and it is pushed when a function returns. When the freelist is close to empty, an interrupt is generated to dump the stack to memory. The oldest frame on the stack contains a trampoline that reloads the stack from memory.

The instruction saves the current values of the OPEN and ACTIVE register frame pointers on the hidden stack and loads the OPEN register with a new frame pointer. Frame pointers are kept in a hardware freelist. This freelist is popped when a frame is opened, and it is pushed when a function returns. When the freelist is close to empty, an interrupt is generated to dump the stack to memory. The oldest frame on the stack contains a trampoline that reloads the stack from memory.



Multiple open statements are allowed. Each one will push the previous value of the OPEN register on the hidden stack. This means that temporaries necessary for building a stack frame need not be allocated from the currently active frame, or copied from temporary locations to the argument locations at call time.



Call a function

The call instruction saves the return PC on the hidden stack, loads the ACTIVE frame pointer from the OPEN frame pointer, and transfers control to the new function.

The instruction saves the return PC on the hidden stack, loads the ACTIVE frame pointer from the OPEN frame pointer, and transfers control to the new function.



Return from a function

The return instruction moves the current RETURN frame pointer to the frame freelist, moves the ACTIVE frame pointer to the RETURN register, restores the callers OPEN and ACTIVE registers, restores the PC and transfers control back to the callee. Note: when a function is called, the callee is responsible for destroying its own frame, thus the value of the OPEN register in the caller is the value it had before the OPEN statement that created the callee's frame.





Return destination

In addition to all the above activity, the call instruction specifies a "return destination" for the computed value. Rather than return a value to a canonical location and then move the value to the desired register, a register address is specified in the call instruction, and the return value is loaded into that register when the function returns.

The open , call , and return instructions each can be executed in one clock cycle.

For the sake of illustration, here is how the compiler might generate some code:

LISP: (defun xxx (f0 f1 z0 b1 b2 f3) (foo f0 f1 (bar (baz z0) b1 b2) f3)) OPEN ; open a frame for call to foo, O0 <- A0 ; load open register 0 with contents of active ; register 0 (f0) O1 <- A1 ; load open register 1 with contents of active ; register 1 (f1) OPEN ; open a frame for call to bar, OPEN ; open a frame for call to baz, O0 <- A2 ; load open register 1 with contents of active ; register 2 (z0) O0 <- CALL BAZ ; call function BAZ, result to appear in slot 0 ; in the frame for BAR. O1 <- A3 ; BAZ's open frame is gone, result of call is in ; open register 0, load open register 1 ; with contents of active register 3 (b1) 02 <- A4 ; load open register 2 with contents of active ; register 4 (b2) O2 <- CALL BAR ; call function BAR, result to appear in slot 2 ; of frame for FOO O3 <- A5 ; BAR's open frame is gone, result of call is in ; open register 2, load open register 3 ; with contents of active register 5 (f3) CALL FOO ; and away we go.

12 instructions. Now, you might think that this is pretty good code, but we can do better. The K-machine supports the following optimizations:

Call hardware operations in parallel with register operations. The register fetches occur before the call hardware operation completes, and the register writeback occurs in the instruction after the call hardware operation completes. After pondering this for a while, we convinced ourselves that this is the desired behavior.





open-call instructions can open a frame in parallel with calling the function. This, in combination with the parallel moves allow one argument functions to be called in one instruction.

instructions can open a frame in parallel with calling the function. This, in combination with the parallel moves allow one argument functions to be called in one instruction.



A special OPEN on RETURN destination opens a new frame when returning from a call. This, in combination with the parallel moves, allows a lazy opening strategy; the frame opening happens when the first argument to be placed in the frame is computed.

destination opens a new frame when returning from a call. This, in combination with the parallel moves, allows a lazy opening strategy; the frame opening happens when the first argument to be placed in the frame is computed.



Tail recursion is implemented in hardware. The special TOPEN instruction opens a new frame without pushing the current OPEN and ACTIVE frame pointers on the hidden stack (in essence, not creating a new continuation), and the special tail-call instruction transfers control to the callee (like a jump), but because no RETURN is executed for the current frame, the current ACTIVE frame is immediately returned to the hardware freelist. This is necessary in order not to leak frames.



Note that a tail recursive call done like this violates the LIFO order of the stack, thus the hardware freelist is truly a freelist and the frames may become arbitrarily shuffled.



Also note that it is not necessary to smash the arguments in the current frame to implement tail recursion. In the worst case, this can add several move instructions to a tail-recursive procedure. By using the hardware, it is often the case that tail recursive calls are faster than the equivalent loop that mutates the arguments.



Finally note that the return destination is stored in the hidden stack. This means that not only does the callee not know where the returned data is going, but also that the caller does not know what function returned the value! In other words, arbitrary tail recursion is supported by the hardware.

Here is what the code looks like with the above optimizations:

LISP: (defun xxx (f0 f1 z0 b1 b2 f3) (foo f0 f1 (bar (baz z0) b1 b2) f3)) TOPEN, 00 <- A0 ; open a frame for tail recursive call to foo, ; in parallel, load open register 0 with contents of active ; register 0 (f0) O1 <- A1 ; load open register 1 with contents of active ; register 1 (f1) NEW-OPEN0 <- OPEN CALL BAZ, O0 <- A2 ; open a frame for a call to baz, ; load open register 0 with the contents ; of active register 2 (z0), result to appear ; in slot 0 of open frame to be allocated upon ; return from BAZ. O1 <- A3 ; load open register 1 with contents of ; active register 3 (b1) O2 <- CALL BAR, O2 <- A4 ; call bar, returning value to slot 2 in FOO's ; frame. ; in parallel, load open register 2 with contents ; of active register 4 (b2) TCALL FOO, O3 <- A5 ; tail-recurse to foo, in parallel move ; contents of active register 5 (f3) into ; open register 3

The K-machine function calling hardware is implemented as a maze of registered 2-to-1 multiplexers and RAM. The only restriction that the implementation placed upon programmers is that two return operations cannot be performed in sequence. In other words, when a function returns, it's caller is not allowed to return in the next instruction. This is not a problem because a return immediately followed by a return is a tail-recursive call and is handled differently.

The K-machine function calling hardware is easy to describe, but many hours were spent proving that it operated correctly even in the presence of interrupts. The hardware implementation evolved through several versions. Each had equivalent functionality, but quite different approaches to getting the data to the right place at the right time.

To compare the K-machine call hardware with the LMI LAMBDA we counted the number of microinstructions executed by a simple function call. The LMI LAMBDA executes a minimum of 100 microinstructions to implement a function call. These instructions execute at about 250 ns each. The K machine executes a function call in a single 80ns instruction. This results in a factor of 300 in performance. The LMI LAMBDA takes 7 seconds to run the tak benchmark. The K-machine executes the same code in .03 seconds. This is faster than PCL on a Cray-1.

Multiple values

There are two mechanisms for returning arguments. When a value is returned, the special return functional destination makes sure the return value ends up in the appropriate slot in the appropriate frame. This implements the standard case of the caller expecting one return value and the callee providing one return value.

When a callee wishes to return multiple values, the first value is returned via a mechanism that differs from the single value return only in that it sets a flag indicating that additional values are present. Callers expecting a single value never check this flag.

When a caller is prepared to accept multiple values, it checks the multiple value return flag. If this flag is clear, the caller sets the remaining multiple values to NIL. If this flag is set, the multiple values are specified by special info left in the stack frame of the callee. This is available to the caller in the RETURN frame.

Some functions, such as unwind protect, handle multiple values in a pass-through mode. There is a return path that leaves the multiple-value-flag unchanged.

Linking

As you may have noticed in the code above, functions are statically linked. Functions are re-linked on the fly to simulate value cells. Special hardware support is provided to enable "lazy linking".

Each instruction had a link bit that was normally set to 0. When a function is re-defined, the old definition is "killed" by setting the link bit to 1. No other action is taken at this point.

The link bit is noticed by the hardware when an instruction is fetched. Because of the pipeline, the processor will still be executing code within the caller. If a function is linked to a dead function, the interrupt handler calls the linker. It is important to interrupt at instruction fetch time because a call to a tail-recursive function leaves no trace of where it came from.

If a function is linked to dead function, that does not mean it should be unconditionally re-linked to the new definition. Imagine this scenario:

(defun foo () 'foo) (setf (symbol-function bar) foo) (defun test1 () (foo)) (defun test2 () (bar))

test1

test2

Now suppose foo is re-defined as follows:

(defun foo () 'new-foo)

Test1

test2

test1

test1

foo

test1

The situation is different for test2 however. In this case, the linker determines that the code for bar needs to be unshared from that associated with foo. The linker "resurrects" a copy of the dead code associated with foo and re-links bar to this new code. Essentially, if a function is called via a function cell, the linker relinks to the new definition, however if the function is called via a function pointer, the linker relinks to a copy of the old definition.

I have been told that this is similar to UUO-LINKS used in ITS, but we developed this mechanism independently.

Interrupt handling

The CONS, CADR, and LAMBDA machines polled for interrupts at the microcode level. The K-machine, not having microcode, does not poll for interrupts. Interrupts are either synchronous, for instance integer overflow or page fault, or asynchronous, such as the 16 millisecond clock.

Interrupts are indicated by a bit vector stored in a functional source. If multiple interrupts occur at the same time, the interrupt code can check this source to decide what order in which to process the interrupts.

The Interrupt Handler

When interrupts are enabled, an interrupt causes several things to happen:

Interrupts are disabled. This prevents recursive interrupts and is "the right thing". It is under programmer control to re-enable interrupts should recursive interrupts be desired, but the race condition implicit in leaving the interrupts enabled are avoided.





The register writeback is aborted.





The condition codes are copied to a special location.





The PC is forced to 0. This effects an unconditional transfer to location 0. The address of the instruction that would have been executed is saved in a special location.





The pipeline registers are frozen; the clock for the pipeline is temporarily disabled.

As the next few instructions proceed, various parts of pipeline are re-enabled. If the right code is in place at location 0, the pipeline state will be correctly saved in a global frame reserved for this purpose. The interrupt sequence goes something like the following (remember that the pipeline registers are not being clocked until further notice):

The output of the ALU is saved, then the output register clock is enabled.



The condition codes are read from a functional source. This allows us to restart conditional branches should the processor be interrupted during one of the "branch delay" instructions. The saved condition code clock is re-enabled.



The right hand side of the ALU is passed through the ALU to be saved in the global frame, then the right hand side clock is enabled.



The left hand side of the ALU is passed through the ALU to be saved in the global frame, then the left hand side clock is enabled.



The PC of the interrupted instruction is read and stored in the global frame.



At this point, the entire state of the pipeline has been saved and execution proceeds as normal. The interrupt functional source contains the value of the interrupt, so the correct interrupt handler can be located.

Note that there is no other state to indicate that the processor has been interrupted. Thus it is not necessary to balance every interrupt with an interrupt restore.

When the interrupt has been processed, there are three paths back into the interrupted code. All of them reload the pipeline in a similar way, differing only in the last instruction. As the pipeline is reloaded, the pipeline clocks are incrementally disabled in the same order that they were enabled for unloading the pipeline. Pipeline reload happens something like this:

The saved ALU output is passed through and captured in the output register. The output register is then disabled.



The saved left hand and right hand operands are read and latched into the ALU input registers.



At this point, the pipeline is reloaded. All the pipeline clocks are enabled and control is transferred to the saved PC. In addition, the saved condition codes are used to specify a conditional branch. Should a pending branch be present when restarting, it will be taken as if the condition codes had been generated normally.

As mentioned before, there were different interrupt crawlout routines. The standard one, described above, was used for asynchronous interrupts such as the clock. In this case, the pipeline state after the interrupt crawlout is identical to that when the interrupt occurred, and the interrupted instruction is re-executed.

Another interrupt crawlout is used for synchronous interrupts. These interrupts are caused by the program and generally indicate an error-like condition that can be fixed (something like integer overflow). For these interrupts, the ALU output register is not enabled until one instruction after the interrupted one. To see why this is interesting, consider the case of integer overflow.

When integer overflow occurs, an interrupt is taken. The interrupt handler conses a bignum result and proceeds to restart the instruction. If the instruction were simply re-executed, the integer overflow would occur again. Instead, the bignum value is loaded into the ALU output register during the crawlout. When the instruction is re-executed, the overflow condition and the output of the ALU are ignored, and the bignum is used instead. This presents the illusion that the addition of two fixnums resulted in a bignum pointer. Thus we are able to abstract away the limitations of the ALU.

One more crawlout is provided. This causes an immediate interrupt before the execution of a single instruction. This seemingly useless ability allows the computer to read the contents of the instruction cache without actually executing instructions.

The critical positioning of the instructions in the interrupt crawlin and crawlout routines forced us to write them by hand. Other than a few pieces of code that clearly benefited from hand tuning (for instance, a mapcar would speculatively fetch the CDR while testing the CAR), there was no code in the system that was not written in Lisp.

Single Stepping

The interrupt logic could be programmed to re-interrupt after a single instruction. This allows a process to single step another for debugging purposes.

Interrupt disabling

It is important to be able to disable interrupts, and the instruction to do so must be atomic. A functional source is used to disable interrupts. Reads from this source clear the interrupt enable and return the value that the interrupt enable flag had before the source was read. This allows reliable control of interrupts, even if an asynchronous interrupt interrupts the disable interrupts instruction.

Synchronous interrupts are discarded when interrupts are disabled, but asynchronous interrupts are simply deferred. When interrupts are re-enabled, any pending asynchronous interrupts are immediately taken. Synchronous interrupts may be disabled independently.

Bibliography

Many sources were consulted for the design of the K-machine. Unfortunately, a complete list of sources has been lost in the mists of time. For those forgotten, be assured that no intellectual dishonesty is intended. Among the sources consulted were:

The Lisp Machine Manual

Abelson, Sussman, and Sussman's "Structure and Interpretation of Computer Programs".

the MIT AI Lab technical reports relating to the Lisp Machine Project

Papers relating to the MIPS processor

Almost all of Guy Steele's papers.

Tom Knight's class at MIT on how to build high-performance hardware.

The work of Jonathan Rees, David Kranz, and others of the Yale T Project, and specially the Orbit compiler

Appendix

What ever happened to the hardware?

During the investigation of Guy Montpetit, owner of Gigamos, for fraud surrounding his use of Japanese investment capital, the Canadian government requested that the United States assets be seized pending resolution of the case. Coopers and Lybrand held the assets. Gigamos eventually filed for bankruptcy in the U.S., and Coopers and Lybrand sought out a buyer. After a period of time, when no investors were found, the material assets of Gigamos, including the K-machine board sets, specifications, schematics, and printed circuit artwork were sold for salvage to Eli Hefron and Sons in Cambridge, MA. I purchased these materials from Eli Hefron and Sons and they are currently in my possession.