ForwardCom

Proposal for a forward compatible instruction set architecture

Contents:

Introduction

Highlights

A flexible instruction format

Variable length vector registers

A new efficient type of loops

Efficient memory management

Security features

Development tools

Visions for the application of ForwardCom

Current status

Resources



Introduction

ForwardCom is a project for development of a new open instruction set architecture and the corresponding hardware and software standards for high performance microprocessors. The intention is to make experiments and investigate what an ideal computer architecture may look like and to develop a complete computer system that is more efficient than the currently prevailing systems, such as x86, ARM, etc. Starting from scratch and making a complete vertical redesign allows us to learn from the history of past mistakes and get rid of the heritage of old quirks that hamper contemporary systems.





Highlights

The ForwardCom instruction set is neither RISC nor CISC, but a new paradigm with the advantages of both. ForwardCom has few instructions, but many variants of each instruction. A consistent template system with few instruction sizes combines the fast and streamlined decoding and pipeline design of RISC systems with the compactness and more work-done-per-instruction of CISC systems.

An instruction can do multiple things, but only if it fits into the pipeline system. There is no need for microcode.

The ForwardCom design is scalable to support small embedded systems as well as large supercomputers and vector processors without losing binary compatibility.

The instruction set is fully orthogonal. The same instruction can be coded with integer operands of different sizes and floating point operands of different precisions. The operands can be scalars or vectors of any length. One operand of each instruction can be a register, a memory operand with different addressing modes, or an immediate constant. The other operands must be registers.

Vector registers of variable length are provided for efficient handling of large data sets.

Array loops are implemented in a new flexible way that automatically uses the maximum vector length supported by the microprocessor in all but the last iteration of a loop. The last iteration automatically uses a vector length that fits the remaining number of elements. No extra code is needed to deal with remaining data and special cases. There is no need to compile the code separately for different microprocessors with different vector lengths.

No recompilation or update of software is needed when a new microprocessor with a different vector register length becomes available. The software is guaranteed to be forward compatible and take advantage of the longer vectors of new microprocessor models without recompilation.

Strong security features are a fundamental part of the hardware and software design.

Memory management is simpler and more efficient than in traditional systems. Various techniques are used for avoiding memory fragmentation. There is no memory paging and no translation lookaside buffer (TLB). Instead, there is a memory map with a limited number of sections with variable size. All code is position-independent.

There are no dynamic link libraries (DLLs) or shared objects. Instead, there is only one type of function libraries that can be used for both static and dynamic linking. Only the part of the library that is actually used is loaded and linked. The library code is kept contiguous with the main program code in almost all cases. An executable file can be re-linked to update a function library or to adapt the program to a particular hardware configuration, operating system, or user interface framework.

A mechanism for calculating the required stack size is provided. This can prevent stack overflow in most cases without making the stack bigger than necessary.

A mechanism for optimal register allocation across program modules and function libraries is provided. This makes it possible to keep most variables in registers without spilling to memory. Vector registers can be saved in an efficient way that stores only the part of the register that is actually used.

Standards for software tools, ABI, file formats, system libraries, etc. are defined in order to establish compatibility between different programming languages and different platforms. It is possible to code different parts of a program in different programming languages.





A flexible instruction format

The ForwardCom instruction set is based on a consistent and flexible modular format suitable for fast superscalar processors. Each instruction uses one, two, or three 32-bit words. It is possible to add still longer instructions for application-specific purposes. Often-used instructions can also be coded in a tiny format, where a 32-bit instruction word contains two tiny instructions. Tiny instructions are always paired.

The basic instruction word is 32 bits, divided into the following fields:

Instruction length Tells whether the instruction uses one or more 32-bit words. Mode Tells which template is used, what the different fields are used for, whether the instruction uses general purpose registers or vector registers, whether there is a memory operand, and which addressing mode is used. Operation Tells which instruction to do. There can be up to 64 multi-format instructions. A multi-format instruction can have many different formats, instruction lengths, and addressing modes. In addition, there can be a large number of single-format instructions. One operation code in ForwardCom corresponds to multiple different operation codes in other systems because it can have several different operand types, register types, vector lengths, masks, addressing modes, etc. Destination register There are 32 general purpose registers and 32 vector registers. The register specified in this field is used for the destination (output) of the instruction. The same register is also used as source (input) if there are not enough source registers in the other fields. Operand type The operands can be 8-bit, 16-bit, 32-bit, and 64-bit integers and half, single and double precision floating point numbers. There is optional support for 128-bit integers and quadruple precision floating point numbers. Source register There can be one source register when template B is used or two source registers when template A is used. Instructions with double length can have three source registers. These can be general purpose registers or vector registers. They can also be used for memory pointers, array index, or vector length. Mask A register can be used as a mask or predicate to enable or disable the operation and to specify various options. Masks are particularly useful for vector operations where an operation can be enabled or disabled for each vector element separately. Data Data fields can be used for immediate operands and for relative addresses. Instructions with double length can have 32-bit data fields. Instructions with triple length and 64-bit data fields are optionally supported. Data fields can contain integer or floating point numbers or option bits. Data can be compressed into the smallest field size that fits the actual value.





Variable length vector registers

Vector registers are used for handling multiple data simultaneously. The computer systems that are commonly used today have vector registers with fixed lengths. Every time a new CPU model with longer vectors comes on the market, the software has to be recompiled using a new instruction set extension that supports the new vector size. Software developers have to develop a new version of their software every time a new CPU model comes on the market, and they have to maintain and support several different versions of their software for the different CPU models if they want to use all CPU models optimally. This is so expensive that it is hardly ever done. Most of the software that is sold today is optimized for CPU models that are already obsolete.

A further problem with current designs is that it is impossible to make your software save a vector register in a way that will be compatible with future extensions of the vector length, because the instructions for doing so have not yet been defined.

The need to solve these problems was a strong motivation for developing ForwardCom. The ForwardCom architecture has variable-length vector registers. The software can use the maximum vector length supported by the CPU it is running on, or it can specify any vector length less than this. The length of a vector register is stored in the register itself. This is useful when a vector register is saved to memory and you don't want to save more data than the register actually contains. It is possible to make software that automatically uses the maximum vector length that the CPU supports, even if this vector length was not supported at the time the software was written. This is what we call forward compatibility.

The variable-length vector registers can be used in a new and very efficient type of loops that automatically uses the optimal vector length. This is described in the next section.





A new efficient type of loops

Let's consider a simple loop that does something with an array of 10 floats. It may look something like this:

float my_array[10];

for (int i = 0; i < 10; i++) {

do_something(my_array[i]);

}

A simple implementation will use i as an index relative to the start address of the array while counting i up to 10, and load one element at a time into a register:





A vector implementation in a current system will load a number of consecutive array elements, e.g. four, into a vector register, and increment i by four for each iteration of the loop:





In this example, the loop will iterate two times and handle four array elements in each iteration. There are two remaining elements in the end because the length of the array is not divisible by the vector length. These remaining elements must be handled separately outside the loop.

The ForwardCom system can make this loop in a more efficient way. We are using a backward index from the end of the array. The backward index counts down from 10 so that it always contains the remaining number of array elements to handle. The backward index is also used for specifying the desired vector length. If we ask for a longer vector than the CPU supports, then we will automatically get the maximum vector length. In this example the maximum length is four elements. In the first iteration we ask for ten elements and get four. The backward index is now decremented by four. In the next iteration we ask for six elements and get four. In the last iteration we ask for two elements and get two.





This method has several advantages. First, we don't need any extra code to handle the remaining array elements if the array length is not divisible by the vector length. And second, it adjusts automatically to the maximum vector length of the CPU it is running on. If we run the same code on a CPU with a maximum vector length of 8 then the loop will run two iterations, handling 8 elements in the first iteration and 2 elements in the second iteration. If the maximum vector length is 16 then the loop will run only one iteration with a vector length of 10 elements.

The ForwardCom instruction set has a special addressing mode to support this loop method. It has a memory operand with a pointer register containing the end address and a backward index register that is subtracted from this pointer. A vector memory operand always uses an extra register to specify the length of the vector. We can use the same register for backward index and vector length, because we will get the maximum vector length when the specified length is more than the maximum length.

The loop may contain function calls. Assume, for example, that the code in our example involves the calculation of the logarithm of each vector element. The logarithm function is contained in a standard math function library. Now, this function uses a vector register for input and a vector register for output. The information about the vector length is contained in the vector register itself. Therefore, the logarithm function can handle a vector of any length and calculate the logarithms of all vector elements simultaneously. A scalar (single element) parameter is simply handled by the function as a vector with one element. This makes it easy for an optimizing compiler to convert scalar code to vector code, even if the code contains function calls.





Efficient memory management

The ForwardCom system includes standards for the application binary interface (ABI), binary file format, memory organization, etc. These standards are designed so that memory fragmentation can be avoided, or at least minimized. A typical running application will have only three memory blocks: program code, read-only data, and read/write data (including static data, stack and heap). This makes memory management more efficient. The number of memory blocks that a running process or thread has access to is so small that it all can be contained in a memory map inside the CPU chip. This is very different from most common systems that have very large page tables. A large page table requires fixed-size memory pages in order to make table lookup simple. But if we can keep the number of table entries small then it is feasible to have variable-size table entries. The ForwardCom design has the goal of keeping all code or data that a process has access to contiguous and to avoid memory fragmentation as much as possible. This may make it possible to replace the huge multi-level page tables and translation-lookaside-buffers of current systems with a small on-chip memory map. Each process and each thread has its own memory map.

Some of the techniques that are used for keeping data contiguous are:

All addresses are relative and all code is position-independent. Code is addressed relative to the instruction pointer. Static data are addressed relative to a special register called the data section pointer. Code address and data address are independent of each other.

The stack size is calculated by the compiler and linker so that the necessary stack size is known in advance, except when the code contains recursive function calls.

The heap size may be predicted by statistical methods. The heap is expanded exponentially if the required size exceeds the predicted size.

There are no dynamic link libraries (DLLs) or shared objects. A new re-linking feature is provided instead. There is only one type of function libraries which can be used for both static and dynamic linking. Function libraries are kept contiguous with the program that calls them, even in the case of dynamic linking.





Security features

Security is an integral part of the hardware and software design. This includes the following planned features:

A flexible and efficient memory protection mechanism.

Separation of call stack and data stack so that return addresses cannot be compromised by buffer overflow.

Jump tables and function pointer tables are placed in read-only memory.

Features for array bounds checking are built in.

Optional methods for checking integer overflow.

Each thread can have its own protected memory space, which is not accessible to parent and sibling threads within the same process.

Device drivers and system functions have carefully controlled access rights. These functions only have access to a specific block of memory that the calling process chooses to give it access to. A device driver has only access to a controlled range of input/output ports and system registers.

Application programs have only access to specific resources as specified in the executable file header and controlled by the system.

Mandatory standardized procedure for installing and uninstalling programs.

There is no "undefined" behavior. There is always a limited set of permissible responses to an error condition.





Development tools

High-level assembler . The assembly language for ForwardCom looks like C or Java. It understands all common operators and C-style branches and loops.

. The assembly language for ForwardCom looks like C or Java. It understands all common operators and C-style branches and loops. Disassembler . The output of the disassembler can be assembled again to functional code in most cases.

. The output of the disassembler can be assembled again to functional code in most cases. Linker . The ForwardCom linker supports relinking of executable files. Other features include communal sections, function-level linking, and weak symbols.

. The ForwardCom linker supports relinking of executable files. Other features include communal sections, function-level linking, and weak symbols. Library manager . The libraries produced by the library manager can be used for both static linking, relinking, and dynamic linking.

. The libraries produced by the library manager can be used for both static linking, relinking, and dynamic linking. Emulator . A ForwardCom executable program can be emulated under Windows, Linux, or other systems.

. A ForwardCom executable program can be emulated under Windows, Linux, or other systems. Debugger . The emulator can also be used as a debugger. There is no interactive debugging feature yet, but the debugging process produces a list of executed instructions and their results.

. The emulator can also be used as a debugger. There is no interactive debugging feature yet, but the debugging process produces a list of executed instructions and their results. Libraries . A standard C library includes the most common C functions. A math library currently contains only a few functions for demonstration purposes, including trigonometric functions and numerical integration. The same mathematical functions can be used with scalars and vectors as parameters.

. A standard C library includes the most common C functions. A math library currently contains only a few functions for demonstration purposes, including trigonometric functions and numerical integration. The same mathematical functions can be used with scalars and vectors as parameters. Code examples . A selection of code examples are provided as a starting point for experimentation.





Visions for the application of ForwardCom

ForwardCom will not readily replace the commonly used systems, even if it is better, because the users need compatibility with existing hardware and software. However, the development of an ideal instruction set architecture and a complete redesign of the ecosystem of hardware and software standards is a worthwhile exercise in itself which may produce useful results and unexpected new discoveries. This project has already generated so many valuable ideas that it is worth pursuing further.

Let's assume that the need for a new instruction set will arise in the future, for whatever reason. Then it will be good to have a ready proposal that has been through a long development process rather than starting from scratch with a limited time budget and end up with a suboptimal solution. An open ongoing development process with inputs from anybody interested is likely to generate better results than the usual closed industry process with its short-term commercial priorities.

ForwardCom may, for example, be useful for the following purposes:

Supercomputers with very long vector registers.

Applications where the security features of ForwardCom are needed.

Niche products where compatibility with older systems is not required.

Applications where the patent and license restrictions of other architectures would be an obstacle.

Real-time systems where the efficient memory management and fast task switching of ForwardCom is useful.

Applications that need application-specific instruction set extensions.

Some of the new ideas generated by the ForwardCom project may be applied to other systems.





ForwardCom will also be useful as a sandbox for university projects and experiments with new ideas such as:

Testing the concept of forward compatibility.

Hardware development and research on the compromise between RISC and CISC.

Research on control flow decoupling, as discussed in chapter 8.1 of the manual.

Custom instructions with on-chip FPGA.

Testing the efficiency of large vectors, variable vector length, and efficient array loops.

Testing the efficiency of memory management without translation lookaside buffer (TLB), and methods for minimizing memory fragmentation.

Re-linkable executable files as an alternative to DLLs and plugins.

Secure design to prevent common software attacks and vulnerabilities. This includes: separation of call stack from data stack, code pointers in read-only memory, private memory space for each thread, limited access rights for device drivers, specific access rights for each executable file, and standardized software installation procedure.

Experiments with improved NAN propagation as discused in chapter 6.3 of the manual.

Testing the efficiency of half-precision floating point vectors.

Research on metaprogramming. The ForwardCom assembler includes planned and partially implemented metaprogramming features.

Calculation of required stack size by the linker.

Optimization of register allocation by providing information about register use in object files and library files.





Current status of the ForwardCom project

The ForwardCom project is in a stage of development. The basic instruction set architecture has been designed and a complete set of application-level instructions is defined. Some system-level instructions are not fully developed yet.

The structure of the binary file format for object files, function libraries, and executable files has been defined in details.

The details of application binary interface standards (ABI), memory management standard, etc. have been defined.

The following binary tools have been developed: high-level assembler, disassembler, linker, library manager, emulator, and debugger.

Hardware implementations in FPGA are being discussed.





Discussion forum

A discussion forum for ForwardCom development is provided at www.forwardcom.info/forum.

Resources

Comparison of ForwardCom with other instruction sets

Public repository on Github

Agner's optimization resources, mainly for x86 microprocessors

115131

By Agner Fog, 2017 - 2018.