I'm Duncan White, an experienced and professional programmer, and have been programming for well over 30 years, mainly in C and Perl, although I know many other languages. In that time, despite my best intentions:-), I just can't help learning a thing or two about the practical matters of designing, programming, testing, debugging, running projects etc. Back in 2007, I thought I'd start writing an occasional series of articles, book reviews, more general thoughts etc, all focussing on software development without all the guff.

Call this Code Generator Tactic 2 . It usually applies to larger, more complicated, (Not so Little) Languages where a parser and lexer are needed to parse the input.

Call this Code Generator Tactic 1 . It usually applies to Little Languages that are so simple that the input processing is trivial.

The first whenshape example turns into the code that we'd otherwise have to write, exactly as we saw above in our first version of nleaves() , annotated with a useful comment:

Although this does the job, it's rather long-winded compared to the beauty of the Haskell version.

Ok, in our second version, we're actually going to need to parse the %when line, check that it's syntactically valid, and then work out what C code to produce.

The obvious question is how will we parse the line? We're working in Perl, so we can't use Yacc. However, Perl has an equivalent parser generator, the module Parse::RecDescent. This could very easily do the work, but sometimes it's more fun to do simple parsing ourselves via Perl regexps. So let's do that:-)

In particular, let's gradually remove prefixes from the current line as we match tokens. This means that our "whatsleft" string will always be a suffix of the current line, and gives us a simple way of positioning the error message beautifully (the offset is the difference in lengths of the original line and the part we have left). Let's write an error reporting routine first: # # fatal( $whatsleft, $msg ); # Given $whatsleft (a suffix of $currline) and a message $msg, print # a standard-formatted fatal error, pointing to the correct place in # the line (using length(currline) - length(whatsleft) as the basic # source of indentation information), and die. # fun fatal( $whatsleft, $msg ) { $currline =~ s/^\t/ /; # expand tabs to spaces $whatsleft =~ s/^\t/ /; my $pos = length($currline) - length($whatsleft) - 1; my $indent = ' ' x $pos; my $err = "$currline$indent^ Error at line $lineno: $msg

"; die "

$err

"; } (Note that we have to expand leading tabs to spaces beforehand, which means that we have to assume hard tabs of some specific width - I've chosen 8 spaces).

Next, let's offload the work of handling a %when line to a separate handle_when() function. Start by modifying handle_line() to call it: fatal( $line, "%when expected" ) unless $line =~ /^%when/; handle_when( $line, $indent, $ofh );

Then write handle_when()'s skeleton, filling it with code that calls a second new function parse_when() that will parse the %when line into it's component pieces, then writing the line (as a comment) to $ofh: # # handle_when( $line, $indent, $ofh ); # Ok, $line starts with a %when (still in the line), and we've already # removed any leading indentation (in $indent). Handle the %when line # and [eventually] it's following '{', printing valid C output to $ofh. # fun handle_when( $line, $indent, $ofh ) { my( $command, $type, $var, $shape, $arglist ) = parse_when( $line ); print "debug: found $command type=$type, var=$var, shape=$shape, ". "arglist=$arglist

"; # produce the %when comment line print $ofh "$indent// $line:

"; }

Parsing the %when line

Right, now let's work out how to parse our %when line, recall that the grammar of a %when line is: when : '%when' type varname 'is' shape [ '(' arglist ')' ] arglist : arg | arg ',' arglist arg : type paramname where type, varname and shape are all simple identifiers, and arglist is a comma separated list of arguments, where each argument is a typename followed by a parameter name (another simple identifier).

To start parsing this, let's start writing parse_when(), it's interface is already set by the example call above, and implement the first stanza of regex-based parse code to extract and remove the %when command itself: # # my( $command, $type, $var, $shape, $arglist ) = parse_when( $line ); # After checking that $line starts with a %when, parse the rest of the # line. If it parses return (command, type, var, shape, arglist) # otherwise die via fatal() # # '%when' TYPE(ID) VAR(ID) 'is' SHAPE(ID) [ '(' ARGLIST ')' ] # where ARGLIST is a comma separated list of typename paramname pairs, # where the typename is usually an ID, but can be a '-' # fun parse_when( $line ) { $line =~ s/^(%\S+)\s*//; my $command = $1; $command =~ s/shape$//; # %when or %whenshape etc.. reduce to %when print "debug: parse_when: command=$command, line=$line

"; return ( $command, "", "", "", "" ); }

If you'd like to follow along, I've prepared a series of intermediate stages in building cpm-v2. Download the cpm-v2.tgz tarball, extract it and cd inside the cpm-v2 directory.

cpm-v2-stage1 comprises this first stage of development. Run it via: ./cpm-v2-stage1 nleaves.cpm and you'll see it display various debugging messages including: debug: parse_when: command=%when, line=tree t is node( tree l, tree r ) debug: found %when type=, var=, shape=, arglist= These show that parse_when() correctly split the %when command off from the rest of the line, returned just the command, and that handle_when() received the %when command, with the type, variable, shape and arglist all empty.

Next, we want to extract the next three tokens, or words, from the line: The first word is the typename, the second is the variable name, and the third is the plain word 'is'.

In Perl, the regex pattern \w+ matches a word (a non-empty maximal length sequence of alphanumeric characters), and we see that there must be some whitespace \s+ immediately after that. So $line =~ s/^(\w+)\s+// will match a word and some following whitespace at the beginning of the line, remove both, and remember the word that was matched in $1. This allows us to replace the last 2 statements in parse_when() with: my $sofar = $command; fatal( $line, "ID (type name) expected after <>" ) unless $line =~ s/^(\w+)\s+//; my $type = $1; $sofar .= " $type"; fatal( $line, "ID (var name) expected after <>" ) unless $line =~ s/^(\w+)\s+//; my $var = $1; $sofar .= " $var"; fatal( $line, "'is' expected after <>" ) unless $line =~ s/^is\s+//; return( $command, $type, $var, "", "" );

Running this version (cpm-v2-stage2) on nleaves.cpm, via: ./cpm-v2-stage2 nleaves.cpm You will see the encouraging: debug: found %when type=tree, var=t, shape=, arglist= This shows that handle_when() has now received the typename (tree) and the variable name (t).

Ok, now we must tackle the shape and it's optional argument list: The next token (another word) is the shape (or constructor) name. That's easy: fatal( $line, "ID (constructor name) expected after <>" ) unless $line =~ s/^(\w+)\s*//; my $shape = $1; $sofar .= " $shape"; (Note that this time there is optional whitespace following the shape name, so the regex is \s* not \s+).

After the shape name, there are two possibilities. Either the line comprises nothing but whitespace, and we're done (no arguments): # that may be all.. return( $command, $type, $var, $shape, "" ) if $line =~ /^\s*$/;

Or we need a bracketed argument list. Check for, and remove, both brackets: # or we need '(' arglist ')' fatal( $line, "'(' expected after <>" ) unless $line =~ s/^\(\s*//; fatal( substr($line,-1,1), "')' expected at end of line" ) unless $line =~ s/\s*\)$//;

Now all we have left in $line is the argument list. Of course, we should probably check that the argument list has the correct syntax, but noticing that our parse_when() function has the goal of splitting the %when line down to command, typename, variable name, shapename and argument list (as a single string), we may observe that we've met our goal. (Remember: tools don't have to be perfect).

So let's simply declare sucess and return: # should have an arglist left now. should really check it's # syntactically valid but let's not bother... return( $command, $type, $var, $shape, $line );

This gives us the finished version of parse_when($line): # # my( $command, $type, $var, $shape, $arglist ) = parse_when( $line ); # After checking that $line starts with a %when, parse the rest of the # line. If it parses return (command, type, var, shape, arglist) # otherwise die via fatal() # # '%when' TYPE(ID) VAR(ID) 'is' CONS(ID) ( '(' ARGLIST ')' ) # where ARGLIST is a comma separated list of typename paramname pairs, # where the typename is usually an ID, but can be a '-' # fun parse_when( $line ) { $line =~ s/^(%\S+)\s*//; my $command = $1; $command =~ s/shape$//; # %when or %whenshape etc.. reduce to %when my $sofar = $command; fatal( $line, "ID (type name) expected after <>" ) unless $line =~ s/^(\w+)\s+//; my $type = $1; $sofar .= " $type"; fatal( $line, "ID (var name) expected after <>" ) unless $line =~ s/^(\w+)\s+//; my $var = $1; $sofar .= " $var"; fatal( $line, "'is' expected after <>" ) unless $line =~ s/^is\s+//; $sofar .= " is"; fatal( $line, "ID (constructor name) expected after <>" ) unless $line =~ s/^(\w+)\s*//; my $shape = $1; $sofar .= " $shape"; # that may be all.. return( $command, $type, $var, $shape, "" ) if $line =~ /^\s*$/; # or we need '(' arglist ')' fatal( $line, "'(' expected after <>" ) unless $line =~ s/^\(\s*//; fatal( substr($line,-1,1), "')' expected at end of line" ) unless $line =~ s/\s*\)$//; # should have an arglist left now. should really check it's # syntactically valid but let's not bother... return( $command, $type, $var, $shape, $line ); }

This gives us cpm-v2-stage3. Running: ./cpm-v2-stage3 nleaves.cpm Gives the encouraging: debug: found %when type=tree, var=t, shape=node, arglist=tree l, tree r We've now finished parsing the %when line.

Generating some code

Next, we need to start generating some code. To refresh our memory, handle_when() currently reads: # # handle_when( $line, $indent, $ofh ); # Ok, $line starts with a %when (still in the line), and we've already # removed any leading indentation (in $indent). Handle the %when line # and [eventually] it's following '{', printing valid C output to $ofh. # fun handle_when( $line, $indent, $ofh ) { my( $command, $type, $var, $shape, $arglist ) = parse_when( $line ); print "debug: found $command type=$type, var=$var, shape=$shape, ". "arglist=$arglist

"; # produce the %when comment line print $ofh "$indent// $line:

"; }

Now we modify this, removing [eventually] from the comment, commenting out the debug statement, and appending some text at the end, giving: # # handle_when( $line, $indent, $ofh ); # Ok, $line starts with a %when (still in the line), and we've already # removed any leading indentation (in $indent). Handle the %when line # and it's following '{', printing valid C output to $ofh. # fun handle_when( $line, $indent, $ofh ) { my( $command, $type, $var, $shape, $arglist ) = parse_when( $line ); #print "debug: found $command type=$type, var=$var, shape=$shape, ". # "arglist=$arglist

"; # produce the %when comment line print $ofh "$indent// $line:

"; # produce the if-line my $test = "${type}_kind($var) == ${type}_is_${shape}"; print $ofh "${indent}if( $test )

"; # get the next line, and check that it's a bare '{', print it out my $line = nextline(); fatal( $line, "$command: { expected at eof" ) unless defined $line; fatal( $line, "$command: bare { expected at same indentation, " ) unless $line =~ /^$indent\s*\{\s*$/; print $ofh $line; # then we will need to generate code to "take the object apart" # For now, a placeholder: my $takeapart = "// TAKE APART CODE GOES HERE

"; print $ofh "${indent}\t$takeapart" if $takeapart; }

This gives us cpm-v2-stage4. Running: ./cpm-v2-stage4 nleaves.cpm Prints a bit of debugging, and terminates. If we look at the generated nleaves.c the body of nleaves now reads: int nleaves( tree t ) { // %when tree t is leaf(string name): if( tree_kind(t) == tree_is_leaf ) { // TAKE APART CODE GOES HERE return 1; } // %when tree t is node( tree l, tree r ): if( tree_kind(t) == tree_is_node ) { // TAKE APART CODE GOES HERE return nleaves(l) + nleaves(r); } }

We're nearly there! The if-statements look perfect, and our placeholder comment is inserted at the correct position inside the then-block, and properly indented.

The missing take apart code that we need to take a tree leaf(string name) apart is: string name; get_tree_leaf( t, &name ); We would like to generate this code from the type name (tree), the constructor name (leaf) and the arglist (string name).

Similarly, the code to take a tree node( tree l, tree r ) apart is: tree l; tree r; get_tree_node( t, &l, &r );

We want to generate this code from the type name (tree), the shape name (node) and the arglist (tree l, tree r). So the code we wish to generate comprises a set of local variable declarations, followed by a call to break the node apart.

Ok, so that seems relatively easy to code. Let's invent a new function take_object_apart() to do this job, first we replace the placeholder code with a call: # then inject the takeapart line $takeapart my $takeapart = take_object_apart( $type, $var, $shape, $arglist ); print $ofh "${indent}\t$takeapart" if $takeapart;

Then we define the skeleton of take_object_apart(), which will start splitting the $arglist apart, first splitting on commas (with optional whitespace before and after), giving the array of arguments, and then splitting each argument on whitespace into two pieces - the type and the name: # # my $breakdown = take_object_apart( $type, $var, $shape, $arglist ); # Generate a single long line of C code that will take the object # apart, into it's component arguments: # - Declare variables for all the arguments in $arglist. # - then call the get_{type}_{shape}() deconstructor with the # addresses of each of the argument variables. # Return the take apart string.. # fun take_object_apart( $type, $var, $shape, $arglist ) { # if the shape has no arguments, then we don't need to take it apart return "" unless $arglist; # ok, we have one or more arguments, comma separated: my @arg = split(/\s*,\s*/, $arglist ); foreach my $arg (@arg) { my( $argtype, $argname ) = split( /\s+/, $arg, 2 ); print "debug: toa: type $type, shape=$shape, ". "argtype=$argtype, argname=$argname

"; } return "// TAKE APART CODE GOES HERE"; }

This gives us cpm-v2-stage5. Running: ./cpm-v2-stage5 nleaves.cpm Gives us, among the debugging: debug: toa: type tree, shape=leaf, argtype=string, argname=name debug: toa: type tree, shape=node, argtype=tree, argname=l debug: toa: type tree, shape=node, argtype=tree, argname=r This shows that we can correctly split the argument list into pieces.

Ok, to generate the "take the object apart" string, we need to build two things: A list of all the argument names, so that we can form the get_ call with the address of each argument name.

call with the address of each argument name. A string containing all the argument variable declarations. We build them both as follows: my $declns = ""; my @argname; foreach my $arg (@arg) { my( $argtype, $argname ) = split( /\s+/, $arg, 2 ); $declns .= "$arg; "; push @argname, $argname; }

Then we generate the final "take the object apart" code by: my $argstr = join( ', ', map { "&$_" } @argname ); my $decons = "get_${type}_${shape}( $var, $argstr );"; my $result = "$declns$decons

"; return $result;

Our finished version of take_object_apart() now reads: # # my $breakdown = take_object_apart( $type, $var, $shape, $arglist ); # Generate a single long line of C code that will take the object # apart, into it's component arguments: # - Declare variables for all the arguments in $arglist # - then call the get_{type}_{shape}() deconstructor with the # addresses of each of the argument variables. # Return the take apart string.. # fun take_object_apart( $type, $var, $shape, $arglist ) { # if the shape has no arguments, then we don't need to take the object apart return "" unless $arglist; # ok, we have one or more arguments, comma separated: my @arg = split(/\s*,\s*/, $arglist ); my $declns = ""; my @argname; foreach my $arg (@arg) { my( $type, $name ) = split( /\s+/, $arg, 2 ); $declns .= "$arg; "; push @argname, $name; } my $argstr = join( ', ', map { "&$_" } @argname ); my $decons = "get_${type}_${shape}( $var, $argstr );"; my $result = "$declns$decons

"; return $result; }