Checkout is an esoteric programming language created by User:ais523 in 2011. It is designed to be lower-level than assembler or even machine code, by matching the way modern processors work more closely than machine language does (machine code matches the way processors used to work decades ago, rather than the way they work nowadays). Thus, it makes operations like memory transfers (which take up the most time on a modern processor) explicit; this leads to the language's name, as memory needs to be "checked out" via copy or move instructions in order to be able to use it. The secondary effect of this is that efficient code tends to be shorter and simpler than inefficient code, although it can sometimes be harder to see how it works.

Perhaps unfortunately, typical modern CPUs, especially ones based on x86, do not allow control of their operation at that low a level, which makes Checkout a little difficult to implement efficiently on a CPU. Modern GPUs (ones released since around 2005 or so) often do, however, and it is at those that Checkout is mostly targeted, especially as they tend to be faster than CPUs for programs that can fully utilise their resources. It is envisaged that GPU-based Checkout programs will be orders of magnitude faster than C, given a modern computer and the right sort of program.

Another unusual feature of Checkout is "pseudo-portability". Different implementations of Checkout will typically not admit the same programs, requiring programs to be special-cased to each individual implementation. However, the difference between programs designed for different interpreters are mostly mechanically translatable; the language is designed such that Checkout plus a standard preprocessor (the Checkout preprocessor) can allow portable programs to be written on it that work in any implementation. The differences are in the value of implementation-defined parameters; in Checkout, these are values that an implementation can decide for itself, but must document in a machine-readable fashion, such that a preprocessor can use them to determine how to modify a program. These parameters do not actually have to be consistent within an implementation between level 5 units; implementation-dependent parameters that can have different values in different level 5 units are called profile-dependent.

The Checkout Hierarchy

Checkout implementations are conceptually made out of units; and those units are often made out of other lower-level units, although need not be (and level 1 units cannot be). At some levels, it is possible for the makeup of units (that is, which units it is made out of) to change dynamically during a program; at other levels, the makeup of a unit is fixed during the unit's lifetime. Each subunit is numbered with an identifier which is unique among other subunits of the unit it belong to; and the identifiers always go from 0 inclusive up to the number of the subunits in the unit exclusive.

Pretty much everything in Checkout – memory, commands, and units – is identified with a level, which is notated by using a slash then a number. (Constants do not belong to any level.) So for instance, nop/1 is a command, and nop/2 is also a command, but they are different commands due to acting at different levels. (Most commands have restrictions as to which levels they can act at.) The level of a memory address or command controls which unit it is scoped to. For instance, if a memory address like [0]/1 (meaning "address 0 within the level 1 unit") is given as an argument to a level 1 command, then each level 1 unit within which that command is run would refer to different memory (memory which "belonged" to that unit). Giving it to a level 2 command would typically cause a compile error, as such commands do not belong to level 1 units, and so it would make no sense (there are exceptions, explained later, where level 2 commands can refer to level 1 memory at the price of undefined behaviour if that memory happens not to have a consistent value between all level 1 subunits of the relevant level 2 unit). Giving a memory address like [0]/3 to a level 1 command would cause all level 1 units within the same level 3 unit to refer to the same memory address, but level 1 units that belonged to different level 3 units would refer to different memory addresses; commands typically (but not always) have restrictions to disallow that sort of access, though.

The level of commands has another incidental effect, that of synchronization. The correct way to think of, say, a level 3 command, is as a command which all subunits of a given level 3 unit cooperatively run simultaneously, and thus they need to each be at the same point in their programs in order to do that. This makes the nop command useful at certain levels; for instance, nop/3 would have to be executed by all level 1 subunits of a given level 3 unit simultaneously, and thus they would all have to run their programs up to the point where the level 3 command occured in order for any of them to run past that point.

There are six levels in all, each of which has different restrictions:

Level 1 contains more or less nothing but arithmetic commands (it also contains the inefficient abstain/1 flow control command). It also has a relative lack of available memory (typically only a few words, perhaps a few tens of words on some profiles), although move instructions between levels 1 and 3 can run quickly, or even in zero time in many cases, making this less of an issue, as data can be moved to level 1 only when needed and moved back just afterwards. Level 1 units cannot contain subunits, and all level 1 subunits within a given level 4 unit must have identical code.

contains more or less nothing but arithmetic commands (it also contains the inefficient flow control command). It also has a relative lack of available memory (typically only a few words, perhaps a few tens of words on some profiles), although instructions between levels 1 and 3 can run quickly, or even in zero time in many cases, making this less of an issue, as data can be moved to level 1 only when needed and moved back just afterwards. Level 1 units cannot contain subunits, and all level 1 subunits within a given level 4 unit must have identical code. Level 2 contains few commands, but the ones it does have are relatively common and useful, including the most useful of the move instructions, and an actual if that can operate in reasonable time. It has two unusual restrictions: level 2 memory does not exist (and thus, e.g. if/2 has to refer to level 1 memory, with undefined behaviour if the level 1 memory isn't consistent across level 1 subunits); and the number of level 1 subunits a level 2 unit contains is fixed to a profile-dependent value, and not configurable as at the other levels. All level 2 subunits within a given level 4 unit must have identical code.

contains few commands, but the ones it does have are relatively common and useful, including the most useful of the instructions, and an actual that can operate in reasonable time. It has two unusual restrictions: level 2 memory does not exist (and thus, e.g. has to refer to level 1 memory, with undefined behaviour if the level 1 memory isn't consistent across level 1 subunits); and the number of level 1 subunits a level 2 unit contains is fixed to a profile-dependent value, and not configurable as at the other levels. All level 2 subunits within a given level 4 unit must have identical code. Level 3 is unusual in its complete lack of commands (other than nop/3 , and a few synonyms for commands that exist at other levels), but is useful due to containing moderate amounts of memory (profile-dependent, but typically a few thousand words) that the level 1 and 2 commands can work on. Level 3 memory can be copied and moved to level 1 very quickly, in zero time in some cases, and orders of magnitude faster than level 5 memory. Writing efficient Checkout programs is mostly a task of trying to make level 3 units as self-contained as possible, to minimize the number of moves and copies needed to higher levels. All level 3 subunits within a given level 4 unit must have identical code, and each contain the same number of level 2 subunits.

is unusual in its complete lack of commands (other than , and a few synonyms for commands that exist at other levels), but is useful due to containing moderate amounts of memory (profile-dependent, but typically a few thousand words) that the level 1 and 2 commands can work on. Level 3 memory can be copied and moved to level 1 very quickly, in zero time in some cases, and orders of magnitude faster than level 5 memory. Writing efficient Checkout programs is mostly a task of trying to make level 3 units as self-contained as possible, to minimize the number of moves and copies needed to higher levels. All level 3 subunits within a given level 4 unit must have identical code, and each contain the same number of level 2 subunits. Level 4 contains Checkout's major control flow structure, parloop/4 , which creates a number of level 3 subunits that execute in parallel, each of which contains a number of level 2 subunits that execute in parallel, each of which contains a number of level 1 subunits that execute in parallel. Like at level 2, there is no such thing as level 4 memory. This is the first level at which subunits can be created and destroyed dynamically; in fact, that's all level 4 is for, with its only other command being the profile-dependent nop/4 .

contains Checkout's major control flow structure, , which creates a number of level 3 subunits that execute in parallel, each of which contains a number of level 2 subunits that execute in parallel, each of which contains a number of level 1 subunits that execute in parallel. Like at level 2, there is no such thing as level 4 memory. This is the first level at which subunits can be created and destroyed dynamically; in fact, that's all level 4 is for, with its only other command being the profile-dependent . Level 5 is useful mostly for its large stores of memory (profile-dependent, but typically a few million words; it may however start with a much smaller amount and require dynamic allocation in order to get more), and is the lowest level at which various common memory-manipulation instructions work, such as dynamic allocation. copy and move instructions targeting level 5 memory tend to be much slower than ones that work within levels 1 to 3, though, typically by a factor of over 100, and thus although it can be used for storage, it is near-useless for efficient calculation. (Although the memory is large, it is limited; for storing huge amounts of data, level 6 memory must be used.) It may also contain things like I/O commands (typically on exactly one profile). Level 4 subunits of level 5 units all have the same profile, and cannot run truly in parallel; some profiles may also place very severe restrictions on the number of level 4 subunits a level 5 unit can have, potentially restricting them to just one level 4 subunit at a time, although if multiple level 4 subunits are available, they don't have to run the same code.

is useful mostly for its large stores of memory (profile-dependent, but typically a few million words; it may however start with a much smaller amount and require dynamic allocation in order to get more), and is the lowest level at which various common memory-manipulation instructions work, such as dynamic allocation. and instructions targeting level 5 memory tend to be much slower than ones that work within levels 1 to 3, though, typically by a factor of over 100, and thus although it can be used for storage, it is near-useless for efficient calculation. (Although the memory is large, it is limited; for storing huge amounts of data, level 6 memory must be used.) It may also contain things like I/O commands (typically on exactly one profile). Level 4 subunits of level 5 units all have the same profile, and cannot run truly in parallel; some profiles may also place very severe restrictions on the number of level 4 subunits a level 5 unit can have, potentially restricting them to just one level 4 subunit at a time, although if multiple level 4 subunits are available, they don't have to run the same code. Level 6 is the only level that can do the sorts of control flow needed for Turing-completeness ( parloop/4 is not Turing-complete by itself even if its profile-dependent limits are ignored, and while/2 can only access a finite amount of memory), and has an amount of memory dynamically adjustable during the program, with the possibility to gain more memory than any other level (conceputally an infinite amount; in practical implementations, it of course will have to be limited, but typically billions of words of level 6 memory should be available on a modern system). Efficient Checkout programs avoid referencing level 6 where possible, because move/5 , copy/5 and rocopy/5 , each slow commands, are needed in order to be able to access level 6 memory from other levels (and even then, they only move or copy it to level 5, where it needs to be moved or copied again to reach the scope of, say, the arithmetic commands at level 1. Level 5 subunits of a level 6 unit each have different profiles, meaning that only a finite number are available (the number of profiles that the implementation supports, typically 2).

Syntax

Checkout programs (after being preprocessed; the preprocessor is not part of the language proper, although necessary to write portable programs) are merely lists of commands, each of which is the command's name followed by zero or more arguments; whitespace is used to separate commands, command names and arguments, and because no command is a legal argument, it's always unambiguous where each command starts. (It's recommended that arguments are separated from each other and command names via spaces, and commands are separated from each other by newlines, to make programs easier to read, though.) Command names are identifiers (sequences of letters and underscores, case-sensitive), followed by a slash and a decimal digit representing the level of the command, such as nop/1 . Arguments can be of three forms: constants, memory locations, or lists of commands enclosed in braces. What sort of arguments are allowed depends on the individual command. Constants are integers or floating-point numbers, with an implementation-dependent bitwidth (typically 32 or 64 bits); they use the same notation as in C89, except that trailing letters such as f or l to represent width are not used, and floating-point numbers must contain decimal points to distinguish them from integers. (Note that Checkout floating-point numbers might be, but are not necessarily, IEEE floating-point numbers; in particular, they do not need to support denormalised numbers, or any exceptional values such as infinity or NaN, and any operation that would produce such values is undefined behaviour. Note also that integers and floating-point numbers are the same width as each other.) Memory locations are notated as an integer constant or memory location in square brackets representing which word within the level they are, followed by a slash and a decimal digit representing the level of the memory; there is a limit on how deep memory locations can be indirected by using them as addresses (e.g. [[0]/1]/1 ), which is that indirect memory addresses cannot be indirected further, and individual commands generally also have restrictions on which levels indirect memory addresses can come from. (Memory locations just store bit-patterns, and whether they are interpreted as integers or floating-point numbers depends on the command.)

Only certain commands are allowed in the top-level program itself, and in lists that serve as arguments (depending on which command they are arguments to); this restriction is enforced at compile time and technically a syntactic restriction, although implementations are not required to enforce it in the parser (and would probably be foolish to). In particular, the top-level program can contain only level 6 commands. (Restrictions on lists of commands that are arguments depend on the individual command, as explained below.) The meaning of a list of commands is to run the first command, then the second, then the third, and so on, in order (flow control is done via commands which conditionally or repeatedly run lists of commands in their arguments).

Commands

No-ops

All profiles must allow the following instructions: nop/1 , nop/2 , nop/3 , nop/6 . nop/4 may also exist; it is profile-dependent whether it does or not. Each takes zero arguments, and has no effect; however, the level at which the command executes may have side-effects that cause synchronization of units, as explained above. ( nop/5 would be useless, because it is syntactically banned from level 4 and lower subunits, and thus there is not much point in including it in the language. nop/4 is optional because it may be hard to implement efficiently on some sorts of processor.) Each such command ideally takes zero time (or as little as possible, if synchronization in zero time is impossible on the architecture in question; nop/2 takes zero time on every system), apart from nop/1 which is specified to take a small but positive length of time (and thus should never be optimised out of programs); such commands may, however, cause other commands to slow down because opportunities to parallelise them are lost (e.g. on GPUs, nop/3 can slow down move/2 between level 3 and 5 memory by reducing opportunities to do unrelated processing during memory access requests).

Checkouts

The checkout commands that give their name to the language are move , copy , and rocopy . There are several variants of the commands, each of which runs at a different level and/or with different arguments; each takes three arguments, with the first two representing the memory location to checkout from, and the memory location to checkout to, in that order (the first memory location of the checkout is given, and the rest of the checkout is contiguous from there). The third argument typically represents the amount of memory to checkout, although the units it is in can differ between different checkout instruction variants, and there are exceptions where it means something entirely different.

Checkout has a slightly unusual view of memory, in that memory may contain something or nothing (and all values, including 0, count as "something"). Attempting to read or write to memory that contains nothing is undefined behaviour; instead, the checkout commands are used to "fill" memory that contains nothing, causing it to contain something (and trying to checkout to memory that contains something is also undefined behaviour). ( discard , explained later, is used to make memory contain nothing again, and move also discards memory in this fashion.) At the start of the program, there is no level 6 memory (although the amount can be changed via level 5 commands), and memory at other levels doesn't exist because there are no subunits. When memory is created, it contains "nothing". (Using mov/1 with a constant as its first argument, followed by a whole series of checkouts, is possibly the simplest way to initialise memory with "something" so that it's possible to start using it.)

The semantics of the checkout commands are as follows:

move moves data from one location to another, causing the location the data was moved from to contain nothing, and the location it was moved to to contain something (the same something the location it was moved from contained before the move). It's conceptually equivalent to a copy followed by a discard, but is often more efficient.

moves data from one location to another, causing the location the data was moved from to contain nothing, and the location it was moved to to contain something (the same something the location it was moved from contained before the move). It's conceptually equivalent to a copy followed by a discard, but is often more efficient. copy is like move except that it doesn't change the location the data is copied from. It tends to be less efficient than moving.

is like except that it doesn't change the location the data is copied from. It tends to be less efficient than moving. rocopy is like copy , except that it causes undefined behaviour if the location the data is copied to is changed or moved before it is next discarded; the memory it is copied into is considered "read-only", with discarding it the only way to remove that status. (Checkout compilers are encouraged to produce warnings if they notice that this undefined behaviour could happen; luckily, Checkout is quite easy to statically analyse, and it may even be possible to determine this definitively without solving the halting problem due to the limitations on where commands can be used.) Although no more efficient than a copy itself, commands that take read-only data as arguments may well be faster than commands that take read-write data.

As mentioned before, checkout instructions ignore the normal restrictions on memory levels. However, they have abnormal restrictions of their own, with only certain combinations of command and memory levels being allowed (here checkout is used to refer to any of the three checkout commands):

checkout /5 can move or copy data between level 5 and level 6 memory, in either direction. If indirect memory addresses are used, the memory address that contains the address to move or copy must be in level 6 (except that certain profiles may read the memory address from level 5 instead). Read-only copies are allowed only from level 6 to level 5, not the other way round. These commands are very slow, not to mention annoying to use as you have to get rid of all your level 4 or lower subunits in order to mention them, and efficient Checkout programs should try to use them as little as possible. The checkout also has a third argument, the number of words to move (a constant, or a memory address in level 6 (or profile-dependently, level 5) which is interpreted as an integer), and the length of time it takes is proportional to the number of words to move plus a constant. (Therefore, checking out large blocks is faster than checking out small blocks.)

can move or copy data between level 5 and level 6 memory, in either direction. If indirect memory addresses are used, the memory address that contains the address to move or copy must be in level 6 (except that certain profiles may read the memory address from level 5 instead). Read-only copies are allowed only from level 6 to level 5, not the other way round. These commands are very slow, not to mention annoying to use as you have to get rid of all your level 4 or lower subunits in order to mention them, and efficient Checkout programs should try to use them as little as possible. The checkout also has a third argument, the number of words to move (a constant, or a memory address in level 6 (or profile-dependently, level 5) which is interpreted as an integer), and the length of time it takes is proportional to the number of words to move plus a constant. (Therefore, checking out large blocks is faster than checking out small blocks.) checkout /2 can move or copy data between level 3 and 5 memory, again in either direction. This is a little faster than checkout/5 , but still very slow. (Copying or read-only-copying read-only data from level 5 to level 3, though, can be much faster, especially if the same memory location was copied recently. It might not be, though, and it's unspecified and likely to be inconsistent even within an implementation. Nonetheless, it's going to be at least as good as copying non-read-only data, and potentially much faster, so is always worth it.) Read-only copies are also allowed, although unlikely to be any faster than normal copies. The amount of data checked out is given by the third argument, which must be a constant rather than a memory location, and is measured in slabs, where a slab is a unit of memory equal to the size of a word, times the number of level 1 subunits of a level 2 unit (which is profile-dependent). Indirect memory addresses can be used for the first two arguments; if they are, the memory address that contains the address must be in level 1, and consistent between all level 1 subunits of the level 2 unit executing the checkout instruction (such that it doesn't matter which of the possible versions of that memory address are chosen). The behaviour is undefined if a level 1 memory address is used to indirect and it happens not to be consistent between level 1 subunits. The behaviour is also undefined if the level 5 access is misaligned; that is, the level 5 address must be evenly divisible by the number of level 1 subunits of a level 2 unit. (It's OK for the level 3 address to be misaligned, though.)

can move or copy data between level 3 and 5 memory, again in either direction. This is a little faster than , but still very slow. (Copying or read-only-copying read-only data from level 5 to level 3, though, can be much faster, especially if the same memory location was copied recently. It might not be, though, and it's unspecified and likely to be inconsistent even within an implementation. Nonetheless, it's going to be at least as good as copying non-read-only data, and potentially much faster, so is always worth it.) Read-only copies are also allowed, although unlikely to be any faster than normal copies. The amount of data checked out is given by the third argument, which must be a constant rather than a memory location, and is measured in slabs, where a slab is a unit of memory equal to the size of a word, times the number of level 1 subunits of a level 2 unit (which is profile-dependent). Indirect memory addresses can be used for the first two arguments; if they are, the memory address that contains the address must be in level 1, and consistent between all level 1 subunits of the level 2 unit executing the checkout instruction (such that it doesn't matter which of the possible versions of that memory address are chosen). The behaviour is undefined if a level 1 memory address is used to indirect and it happens not to be consistent between level 1 subunits. The behaviour is also undefined if the level 5 access is misaligned; that is, the level 5 address must be evenly divisible by the number of level 1 subunits of a level 2 unit. (It's OK for the level 3 address to be misaligned, though.) A different form of checkout /2 can move or (possibly readonly) copy between level 1 memory and level 3 memory. The semantics of this need a little explanation, as referring to level 1 memory from a level 2 command is mostly meaningless. What happens is that the command refers to one slab of level 3 memory, and one word of level 1 memory in each subunit, which comes to the same amount (as the size of a slab is that necessary for each level 1 subunit to get one word of it). The access is allowed to be misaligned. In fact, the slab of level 3 memory does not even need to be contiguous in an absolute sense; rather, it has to be contiguous in the segmented sense that if memory is divided into a set of power-of-2-sized blocks each of which wraps around, it's contiguous from the point of view of some block. (So, for instance, in a hypothetical system with 8 level 1 subunits per level 2 unit, [27]/3 [28]/3 [29]/3 [30]/3 [31]/3 [16]/3 [17]/3 [18]/3 would be contiguous in this sense, with block size 16.) The third argument gives the block size needed for the block in question to be considered contiguous, and must be a constant integer that's a power of 2, and at least as great as the number of level 1 subunits of a level 2 unit. (The level 1 subunit with identifier 0 gets the first word of the slab, [27]/3 in the example above, the subunit with identifier 1 gets the second, and so on.) Indirect memory addresses can be given for the first two arguments, with the same restrictions as in the previous case. This instruction is very fast compared to other checkout instructions, taking around twice as long to execute as arithmetic instructions. Additionally, two move/2 instructions with the same arguments, the first from level 3 to level 1 and the second in the other direction, can take less time between them to execute than either would individually; the condition for this to happen is that the block size must be set to the amount of level 3 memory per unit (or higher), that they are separated by nothing but arbitrary level 1 instructions and level 2 checkout instructions, and that the addresses that the arguments refer to can be determined at compile time not to change (either by being direct addresses, or indirect addresses where the address of the address is readonly).

can move or (possibly readonly) copy between level 1 memory and level 3 memory. The semantics of this need a little explanation, as referring to level 1 memory from a level 2 command is mostly meaningless. What happens is that the command refers to one slab of level 3 memory, and one word of level 1 memory in each subunit, which comes to the same amount (as the size of a slab is that necessary for each level 1 subunit to get one word of it). The access is allowed to be misaligned. In fact, the slab of level 3 memory does not even need to be contiguous in an absolute sense; rather, it has to be contiguous in the segmented sense that if memory is divided into a set of power-of-2-sized blocks each of which wraps around, it's contiguous from the point of view of some block. (So, for instance, in a hypothetical system with 8 level 1 subunits per level 2 unit, would be contiguous in this sense, with block size 16.) The third argument gives the block size needed for the block in question to be considered contiguous, and must be a constant integer that's a power of 2, and at least as great as the number of level 1 subunits of a level 2 unit. (The level 1 subunit with identifier 0 gets the first word of the slab, in the example above, the subunit with identifier 1 gets the second, and so on.) Indirect memory addresses can be given for the first two arguments, with the same restrictions as in the previous case. This instruction is very fast compared to other checkout instructions, taking around twice as long to execute as arithmetic instructions. Additionally, two instructions with the same arguments, the first from level 3 to level 1 and the second in the other direction, can take less time between them to execute than either would individually; the condition for this to happen is that the block size must be set to the amount of level 3 memory per unit (or higher), that they are separated by nothing but arbitrary level 1 instructions and level 2 checkout instructions, and that the addresses that the arguments refer to can be determined at compile time not to change (either by being direct addresses, or indirect addresses where the address of the address is readonly). There's also a form of checkout/2 for moving or (possibly readonly) copying between level 1 and level 5 memory. This is a combination of the preceding two cases, with an even more insane third argument. It again moves one slab (this time, from level 5 memory) to one word in each level 1 subunit of the level 2 unit; however, this time, the slab itself must be correctly aligned. The third argument can be a constant or memory address in level 1 (with the usual consistency restriction, that it must have the same value in every level 1 subunit of the level 2 unit); it is bitwise-XORed with the identifier of the level 1 subunit to determine which element of the slab corresponds to which level 1 subunit. (Thus, despite the alignment restriction, it's possible to checkout any level 5 word to any level 1 word by picking an appropriate value to XOR with.) This has the same performance properties as the 3-to/from-5 transfer, being very slow unless copying or read-only-copying from read-only level 5 memory, where it's very slow except fast if the memory was accessed recently, for certain values of "recently" which are far from obvious and do not have to be documented. Read-only copies into level 5 memory are not allowed via this method.

Checkout instructions cannot copy memory around within a level (not even if it's discarded first). However, level 1 memory can be copied using the "arithmetic" command mov/1 , and level 3 memory can be copied by checking it out into level 1, doing the mov there, and checking it out in the other direction again, thus triggering the faster checkout in which two checkouts can be faster than one.

Discard

The discard/n command discards data, where n can be 1, 2, 5, or 6. It operates on memory at the same level as the command itself (except that discard/2 operates on level 3 memory), causing it to become nothing rather than something. (If the memory already was nothing, it's undefined behaviour.) It takes two arguments: the (possibly indirect, using only memory on the same level to determine the indirection) address of the first element to discard, and the number of words to discard, which must be either a constant or a memory address on the same level holding an integral number of words to discard.

Arithmetic

Arithmetic takes place on level 1, and has many different commands (mostly split into two sets, those which calculate using (signed) integers, and those which calculate using floating-point). For bitwise operations, integers are interpreted as two's-complement. Floating-point calculations are at least as fast as integral calculations, and possibly faster. Overflow or underflow causes undefined behaviour (whether integral or floating-point). There are two-argument versions, which take a constant or memory location a and memory location b, overwriting their second argument to store the result; there are also three-argument versions, which take a constant or memory location a and a constant or memory location b, storing their result in the third argument, a memory location (which can be "nothing" or "something" beforehand). All memory locations involved must be in level 1; whether indirect memory locations are allowed is profile-dependent, and if they are the memory location used to do the indirection must also be in level 1.

"Unary" instructions (which still take arguments as above, but for which the three-argument form is somewhat useless, and where b is allowed to be nothing rather than something in the two-argument form): mov/1 has result a . The value of b is simply ignored. cnvi/1 has the result produced by converting the floating-point a to an integer. cnvf/1 has the result produced by converting the integral a to a floating-point number. iszi/1 has the result 1 if a is the integer 0, or 0 otherwise. isni/1 has the result 1 if a is a negative integer, or 0 otherwise. (This is usable with subi/1 to do a less-than or similar operations.)

is allowed to be nothing rather than something in the two-argument form): Integer instructions: addi/1 has result a + b . subi/1 has result a - b . muli/1 has result a * b . divi/1 has result a / b , rounding towards 0. modi/1 has a result equal to the remainder when a is divided by b . (Its behaviour is undefined if a or b is negative.) andi/1 has result a bitwise-and b . iori/1 has result a bitwise-inclusive-or b . xori/1 has result a bitwise-exclusive-or b . (Use this to complement integers.) lshi/1 has result b times 2 to the power of a . (Undefined behaviour if a is negative.) rshi/1 has result b divided by 2 to the power of a . (Rounding towards 0; and undefined behaviour if either a or b is negative.)

Floating-point instructions: addf/1 has result a + b . subf/1 has result a - b . mulf/1 has result a * b . divf/1 has result a / b .



Profile-dependent, other floating-point instructions, such as exponentiation or trigonometric instructions, may exist. They don't have standard names or semantics, though (yet).

Identification

The id/n instruction (available for n = 1, 2, 3, and 5) is used to determine which subunit is being executed. It takes one argument, a memory location in level 1, 1, 1 or 3, and 5 respectively, and writes an integer into that memory location (which could contain nothing or something beforehand) containing the identifier of the level n subunit within the level n+1 unit (i.e. 0 for the first subunit within the unit, 1 for the second, 2 for the third, etc.) There's also idthree/1 and idtwo/1 instructions that do the same thing as id/3 and id/2 respectively, but with different synchronization properties due to the different level number; idthree/1 can only take a level 1 memory address as an argument ( id/3 can also take a level 3 memory address as an argument).

Imperative control flow

There are three branching instructions, abstain/1 (that branches on lists of level 1 instructions), if/2 (that branches on lists of level 1 and 2 instructions), and if/6 (that branches on lists of level 6 instructions). They all take two arguments; the first is a memory address (in level 1, 1, 6 respectively; constants are not allowed as it would be a little silly), and the second is a list of commands (enclosed, as always with command list arguments, in braces) that contains only commands of the appropriate level. (It can indirectly contain commands of other levels, if those are allowed within command lists that are arguments to commands that are of the right level; this only happens with if/6 .) They also all have the same semantics: commands inside the list given by the second argument are executed if and only if the first argument contains an integer other than 0. (If the first argument contains "nothing" rather than a value, it causes undefined behaviour, as always.) The difference is in the timing properties when the value happens to be 0; if just skips the command list, which is almost as fast as arithmetic, whereas abstain "executes" every command in the list but causes it to do nothing, which can be much slower despite coming to the exact same thing. (Implementations might optimise abstentions into more typical conditionals in special cases; a common one is when all the level 1 subunits of a level 2 unit happen to abstain from the code simultaneously, and programs that require abstain/1 should probably be written to maximise the probability of that happening.) As usual with level 2 commands which refer to level 1 data, the behaviour is undefined if the level 1 data is not consistent across level 2 subunits of that level 1 unit. Note that despite being control flow, if/2 has the usual cooperative meaning, with all its subunits collaborating on the calculation of the conditional, then continuing on with their program; it's just that the program they continue with may or may not contain the command list, depending on the branch taken. (This is why, say, nop/3 is banned within if/2 ; it just wouldn't make any sense if some level 2 subunits of a level 3 unit took the branch, and others didn't.)

There are also three-argument versions of the above commands, which take another command list as their third argument. It is executed only if the second argument isn't, i.e. it's executed if the first argument, interpreted as an integer, does contain 0.

As well as conditionals, there are imperative loops, while/2 and while/6 , which work the same way as the corresponding if statements except that if the branch is taken, then after the command list has executed the while command executes again (the same way "while loops" typically work in other languages). There is no level 1 equivalent. There is also, profile-dependently, the possibility for if/5 and while/5 to exist, which use level 5 memory addresses and level 5 commands in their command lists, but these commands are optional and may not exist on many profiles.

Parallelization

The command parloop/4 is really the heart of Checkout, and is the preferred way to do looping when possible, despite not being completely general. It takes three arguments; the first two are constant integers or memory locations in level 6 (not 5!) holding integers, and the third is a list of commands (allowing any level 1, 2, or 3 command, and also nop/4 if it exists).

What the command does is to set up a number of level 3 subunits, each of which contains a number of level 2 subunits, each of which contains a number of level 1 subunits; and then waits for them all to finish executing. The command list is the common command list which they all run, in the normal cooperative manner (i.e. a level 1 command is run by each level 1 subunit independently, whereas a level 2 command is run independently by each level 2 unit, but the level 2 units run it via cooperation of the level 1 subunits they contain, so they can all be thought of as running a "fraction" of the level 2 command at the same time, with the subunits cooperating to produce the effect of a level 2 command.) The number of level 1 subunits of a level 2 unit is fixed to a profile-dependent value; the other two numbers, though, of level 2 subunits of a level 3 unit and level 3 subunits of the level 4 unit that run the command, are choosable by the program (although they each have profile-dependent maximums), and are the first and second arguments to parloop/4 respectively.

The performance properties of parloop/4 are rather bizarre. Although all the subunits involved must behave "effectively in parallel", it is unlikely that for large numbers of subunits they will all be able to run truly in parallel (that is, at the same time), and in fact with a typical GPU architecture it would be inefficient to do so. Instead, what happens is that level 1 subunits of a running level 2 unit run truly in parallel, but level 2 subunits of a level 3 unit run in a pre-emptive context-switched fashion that will be familiar to people used to threads running on CPUs (that is, one runs until its timeslice expires or it is waiting for a time-consuming command like checkout/2 from level 5 data or a command that requires synchronization like nop/3 to complete, then another starts running, and so on, until all are complete). Level 3 subunits of a level 4 unit can use a mix of these two approaches, with a small number running truly in parallel, and context-switched in and out as necessary. Additionally, level 3 subunits of a level 4 unit might even be serialised, with a small group running until completion, then another small group running to completion, and so on; profiles that do this will obviously not contain a nop/4 instruction, as it would be impossible to implement. The result of all this is that parloop/4 generally takes a length of time longer than true parallelisation, but faster than an imperative loop doing the whole thing would take; where it is on that scale depends on the profile and the values involved, and is perhaps best optimised via experimenting with the subunit counts (particularly the count of level 2 units in a level 3 unit) to see what values work best.

Interleaving

Similar to parallelisation, interleave/5 executes multiple pieces of code simultaneously, each of which is given as an argument, as a list of level 4 commands; a level 4 unit is created to execute each one, and destroyed again when it finishes executing, with the interleave/5 command itself finishing when all the subunits are gone again. This is done in a pre-emptive context-switched fashion, so it takes approximately the same length of time to run as it would take to run them in series, but is potentially slightly faster due to the ability for one to run while the others are waiting. The maximum number of arguments interleave/5 can have is profile-dependent, but typically very small, and might even be just 1. (It is nonetheless useful even in this case, though, because it is the only command capable of creating level 4 units, and thus is required to run parloop/4 and be able to create level 3, 2, or 1 units.)

Likewise, interleave/6 is the only way to create level 5 subunits, but it always takes a fixed number of arguments, one per profile (how many and what profiles exist, and which arguments correspond to which profile, are implementation-dependent). Different profiles are often very different, with different values for different profile-dependent restrictions. The arguments are, as could probably be guessed, lists of level 5 and 6 commands, and other than the fact that each level 5 unit is created with a different profile, this works analogously to interleave/5 .

Memory allocation

Memory allocation is done via the malloc/6 and free/6 commands. malloc/6 takes two arguments. The first is an integer constant or level 5 memory location containing an integer; what the command does is to create an amount of level 6 memory whose length is the number of words given by the first argument, and place the address of the first word of the created memory (which might not necessarily be the next "apparently available" value, because level 6 addresses need not start from 0) in the level 5 memory location given as the second argument (in every level 5 unit that exists; the location is thus typically [0]/5 ). It takes a level 5 memory location because it's entirely possible that no level 6 memory exists when it is called. free/6 does the opposite, destroying level 6 memory, and takes one argument, a level 6 memory address containing the return value from the malloc/6 instruction that created it (and it's undefined behaviour if the memory in question has already been destroyed, or the argument isn't a malloc/6 return value). There is no error condition for being out of memory, as level 6 memory is conceptually infinite; practical implementations will have to error out if it happens.

malloc/5 and free/5 may also (profile-dependently) exist, and do the same thing as malloc/6 and free/6 respectively, except that they create and destroy level 5, rather than level 6, memory, and all the memory locations involved are level 5 locations (and only refer to the one level 5 unit, not all of them). Unlike level 6 memory, a fixed amount of level 5 memory is created as a level 5 unit is. However, there are two (profile-dependent) possibilities: either a large amount is created (in which case malloc/5 and free/5 typically wouldn't exist), or a small amount is created, in which case these commands are available in order to get more, up to a point. If the requested amount of level 5 memory cannot be created, an integer 0 is written into the memory address that malloc/5 's second argument refers to.

It is undefined (and quite possibly very bad, especially on a GPU-based implementation, where memory leaks can be relatively fatal) behaviour for memory created via these instructions to not be destroyed again before the program ends.

I/O

Whether I/O commands exist are profile-dependent. (Checkout implementations will typically have an "I/O profile", which contains the I/O commands, but is rather locked down in terms of its capability to do anything else.) I/O is very simple, with just two commands: in/5 which reads a character from some source of input and stores it in a level 5 memory location (the only argument), and out/5 which writes a character to some sort of output based on the value in a level 5 memory location or constant. The mapping from memory location values to characters, including whether the locations are interpreted as integer or floating point (although integer seems more likely...), are profile-dependent.

Other I/O commands may also exist, again in a profile-dependent way, although their names and semantics have not been standardised yet. In particular, an FFI to other languages would seem useful, allowing the use of Checkout for very high performance sections, and other higher-level languages for more readable and understandable code.

Implementation considerations

Checkout is designed to map quite well to the way GPU architecture works. In particular, some levels correspond naturally to hardware concepts in a GPU (level 1 to a thread, 2 to a warp or halfwarp, 6 to an entire computer system), and others to software concepts commonly used in GPU programming (level 3 to a block, 4 to a kernel, 5 to a stream in terms of commands, or an individual CPU or GPU in terms of data). Other parallel architectures may also fit onto the Checkout model quite well, though; in particular, the way a modern CPU works is quite similar, but potentially has other restrictions on its various levels of abstraction. (Both modern CPUs and GPUs share the slow "checkout instruction" problem, incidentally; modern CPUs work by automatically generating checkouts, which is hard to do in an efficient manner. CPUs have a tendency to spend most of their time waiting for data to arrive from memory, which is a notable "distance" away in terms of response speed.)

Computational class

Checkout is pretty clearly a bounded-storage machine (or Turing-complete allowing bignum words, which would rather defeat the point of the language but which isn't barred by the specification). It contains all the control flow needed for Turing-completeness ( if/6 and while/6 combined with arithmetic), is capable of doing arbitrary effects at an arbitrary point (just chain a huge string of interleave , parloop , and checkout instructions, then abstain all the level 1 subunits you aren't using, in order to run commands of any level, although that's ugly and inefficient), and can access near-unlimited memory (although conceptually infinite, malloc/5 obviously eventually breaks down, if due to nothing else than running out of integers to use as addresses in a non-bignum implementation).

See also