JUN88: A SIMPLE DECOMPILER

A SIMPLE DECOMPILER

William May

William May works for MassMicro Systems. He is involved with Macintosh II video products and can be reached at 303A Ridgefield Circle, Clinton, MA 01510.

On the surface a decompiler sounds like a hackers' tool, a utility for snooping through code. Although programs such as Steve Jasik's MacNosy for the Macintosh are certainly useful, the decompiler I describe in this article has more prosaic, less imaginative uses. In my case its purpose was to speed up a product that included a spreadsheet-like component.

Fast recalculation speed is an important objective of a spreadsheet, and I felt this would best be achieved by converting the user's expressions into a form that could be computed efficiently---that is, a compiled form. Although the compIlation process is straightforward, a problem arises when the user wants to see and/or change an expression: The application must be able to recreate the original expression, or source code."

A simple way to satisfy this need would be to store both the source and the compiled expressions. This method is used in Smalltalk-80, which also provides interactive viewing and modification of compiled expressions. I felt, however, that with the limited syntax of spread sheet expressions, I should be able to recreate the source code from the compiled code itself.

The problem turned out to have been solved already, and the algorithm was described in an article by PJ. Brown.<fn1> As you will note from the publication date (1976), the algorithm was devised before the introduction of spreadsheets or of microcomputers. Brown developed the algorithm as a component of an interpreted Basic system for minicomputers, where it was used to decompile Basic statements into source form. Given its heritage, it is clear that the algorithm is capable of handling a far wider range of expressions than my spreadsheet is.

The algorithm has a few points that should make it of interest to DDJ readers. First, as I said, it can handle an extremely wide range of expressions, which allows it to find use in many applications. Second, the algorithm is driven by a table that is easy to set up and maintain. With its flexibility and simple setup, the algorithm is one of those tools that, once available, seems to find many unexpected uses. Finally, the subject of decompilers is worth exploring.

The remainder of this article covers my implementation of Brown's algorithm. The code, developed on a Macintosh in MPW C, should be portable to virtually any standard C compiler. In the remainder of this article I assume that readers are familiar with reverse Polish notation (RPN) and have some slight familiarity with compilers.

Example 1, page 51, shows some examples of using Brown's algorithm, with the object code on the left and the recreated source on the right. For presentation purposes, the object code uses printable characters and standard symbols, which would not usually be the case in an application. Besides handling the normal arithmetic and logical operators, Brown's algorithm can handle Lotus 1-2-3's @ functions, BASIC's Let...=...; and if...then...else... statements, and many others.

One important characteristic of Brown's algorithm is that it is forward scanning, meaning that it starts at the first byte of the RPN expression and works its way toward the end. The alternative, as you might guess, is backward scanning. Forward scanning has two advantages over backward scanning: the scanning algorithm can identity variable-length tokens, and the decompiler can emit (print) its results as it proceeds.

Backward scanning has the advantages of greater speed and better worst-case performance. A backward-scanning algorithm, however, produces its results in reverse, so a second pass is needed to unwind the results and this eliminates some of the speed advantage.

A more critical drawback of backward scanning is the difficulty in handling variable-length tokens. Because variable-length tokens are the norm, this is a serious problem. A spreadsheet, for example, will typically have tokens for operators (1 or 2 bytes) cell references (varying formats for absolute addresses, relative addresses, and so on), integers (4 bytes on a Macintosh), floating-point numbers (412 bytes depending on type and format), strings (many formats), and so on.

At some point programmers using a source-recreation algorithm must decide how faithful the recreation will be to the original text. A normal part of the compiling process will remove all "nonessential" elements, such as spaces, tabs, line feeds, comments, and so on.

There are two solutions to the question of recreation fidelity. The first option is to attempt to achieve strict fidelity, and the second is to recreate expressions in a canonical or standard, format. Note that the approaches are not mutually exclusive. Idiosyncratic formatting can be kept for some elements and ignored for others.

In my application---a spreadsheet---the scope for idiosyncratic formatting is limited. There are no lines, tabs, or comments; however, there is a problem with parentheses. The placement of parentheses in an expression can vary enormously from user to user. Yet this information is lost during the conversion from infix to postfix notation. Users would certainly be dissatisfied if no parentheses were recreated and would undoubtedly prefer that their original placement be maintained.

An approach to handling this difficulty is to compile information on the location of parentheses into the object code. An expression evaluator would ignore the parentheses, and Brown's algorithm would use the information to recreate their exact placement. By making a very small increase in object code size, the process of compiling and decompiling expressions becomes almost invisible to the spreadsheet user.

The core of Brown's algorithm is a transformation table that describes how operators in the input stream are to be converted back into source code. Example 2, this page, is an example of such a table. The example shows no particular "language" but illustrates several types of transformations that the algorithm can handle.

Column 1 of the table is an opcode, or identifier, symbol generated by the compiler-that is, the token the algorithm will see in the input stream. In the example I have used printable characters that correspond to their common usage-for example, + indicates addition. Generally the opcodes are encoded as 1- or 2-byte integers. Column 2 indicates the number of operands associated with the opcode. The examples show both unary and binary operators. The table could also include nonary operators (no operands) as well as more complex operators, such as if...then...else or @pv(pmt, int, term).

The following columns, 3 through 5, describe the strings that are to be emitted by the decompiler and their position relative to the operands. For example, + is a binary operator with no prefix__1, a prefix__2 (+), and no suffix. This means that + will apply to two operands, that there is no prefix to the first operand but there is a prefix to the second (+), and that there is no suffix. The expression op1 op2 + will thus be transformed into the string op1 + op2, as expected. Another example is =, which is a binary operator with a prefix__1 (let), a prefix__2 (=), and a suffix (;). In this case, op1 op2 = is transformed into let op1 = op2;, the classic Basic let statement.

Source Recreation Object code Result abc++ a+b+c bcdfe+*/= let b=c/d*f+e qabc+(cd,,$ let g=sum(a*(b+c),c,d

typedef struct decomp-row { char ident short lex_type; char *prefix_1; char *prefix_2; char *suffix; } decomp_row; static decomp_row table[] = { {'!', UNARY_OP, "-", NIL, NIL}, {'+', BINARY_OP, NIL, "+", NIL}, {'-', BINARY_OP, NIL, "-", NIL}, {'/', BINARY_OP, NIL, "/", NIL}, {'*', BINARY_OP, NIL, "*", NIL}, {'^', BINARY_OP, NIL, "^", NIL}, {'(', UNARY_OP, "(", NIL, ")"}, {'$', UNARY_OP, "sum(", NIL, "}"}, {'=', BINARY_OP, "let ", "=", ";"}, {',', BINARY_OP, NIL, ",", NIL}, };

Probably the best way to get a feel for Brown's algorithm is to step through an example. The example 1 will follow is the expression xy!z/=, which will be transformed into the string let x = -y/z;. My example will use the table in Example 2 as its basis.

The first token found in the scan is the operand x. Upon seeing the operand, the algorithm scans forward to find any operators that give rise to a prefix for the operand. In this case the next token is y, which is also an operand, not an operator. A level count is incremented (it is now 1).

The following token, !, is a unary operator. According to a general rule, for an unary operator the level is decremented by n-1. Because ! is a unary operator, the level is decremented by 0 (no change). Another rule is then applied, which states that if the current level is n, push prefix__(-n + 1) for the operator. Because the level is 1, the algorithm should push prefix__0, which does not exist. Therefore nothing is pushed on the stack for this operator.

The scan continues, finding the operand z, which increases the level (now 2), and then the binary operator /. Applying the decrementing rule, the level is decremented to 1. Using the prefix rule, the algorithm is once again called on to push prefix__0, which still does not exist. Finally, the operator = is seen. This is defined as a binary operator, so the level is decremented once again, to become 0. This indicates that the algorithm should push the prefix__1 for the = operator, which is the string let.

Finally, upon reaching the end of the input string, the stack elements are popped and the associated strings emitted and then the operand itself-x. At this point the output string is let x

The scanning flow you have just observed is the basic pattern to Brown's algorithm. For each operand the algorithm scans ahead for operators that give rise to prefixes for the operand, pushing these prefixes onto the stack. When the forward scan is complete, the prefixes are popped from the stack and emitted and the scan proceeds to the next token.

The scan is now looking at the next operand-y. The same pattern is repeated. First, while the level is 0, the operator ! is seen. This results in pushing prefix__1 for !, which is a -. At the end of the string, the = is seen once again, and the prefix__2 (=) is pushed on the stack. Now the forward scan terminates, the stack elements are popped and emitted, and finally the operand itself---y---is emitted. The output string is now let x = -y. Note that the use of the stack results in printing prefixes in the reverse of the order in which they are seen.

Once again the scan restarts, now at the operator !. When an operator is seen, the algorithm prints its suffix. In this case there isn't one, so you proceed. Next in line is the operand z. The usual routine is followed, resulting in the prefix__2 for /being printed. The output string is now let x=-y/z. Because z is followed by the operator /, which has no suffix, you proceed. Finally, you encounter the operator =, which has the suffix ;, which is emitted. The input string has been entirely scanned, so the algorithm is complete, with the output string being let x=-y/z;.

Simple as this example is, it shows that the algorithm is inefficient for large volumes of data. The tail of the string may be scanned once for each operand in the string. This inefficiency, however, is not evident in an application such as a spreadsheet, where expressions are limited in size, or in a line-by-line inter prefer such as some forms of Basic.

The decompiler (see Listing One) uses several tools to provide supporting mechanics. These tools fall into three classes: input scanning, table handling, and stack handling.

Input scanning is done by the function advance__to_next__elem(). Scanning is much easier for a decompiler than it is for a compiler. Because it is scanning object code, not source, there is no need to handle white space, ends of lines, numeric conversions, and so on. The format of the input stream is much more predictable. Even so, a real scanner can get much more complicated than that shown here because it will need to identify a variety of operand formats, including integers, floating point, cell references, and so on.

Accessing the transformation table is the next important set of functions. The table-handling functions are responsible for searching the table and returning pointers to associated strings. I have used a very simple if inefficient design based on linear searches. A slightly more sophisticated scheme might sort the table and use binary searches.

Finally, a stack is needed to store the results of the forward scans performed for each operator. These results are then unwound when the scan is complete. The stack functions shown can push and pop pointer-size objects. Although simple, this stack implementation is quite sufficient for this algorithm.

An application program sees only one externally visible function: deparse(instr, outstr). The input string and output string cannot be the same. Some simple enhancements can be added if needed, such as a length check on the output string. Very little error checking should be needed in the decompiler. In theory at least, incorrectly formed expressions should have been caught by a compiler before they get to this stage. Bad things can happen, however, if an incorrectly formed expression is submitted for decompiling. In particular, if the RPN expression does not end with an operator, the algorithm will not terminate correctly, if it terminates at all.

In my implementation the transformation table is compiled into the program. Others, who desire more flexibility, could modify the code to pass a pointer to the table as one of the parameters, allowing use of multiple tables or making changes at run time.

<fn1>P.J. Brown, "More on the Re-creation of Source Code from Reverse Polish," Software---Practice and Experience 7 (1976): 545-551.