Grammars are better if you also can add (some) semantics inline with the grammar rules. In particular, for the Python parser I’m building I need to control the AST node returned by each alternative, since the AST format is given.

[This is part 6 of my PEG series. See the Series Overview for the rest.]

Many grammars use the convention of allowing actions to be added to rules, usually a block of code inside {curly braces}. More precisely, actions are attached to alternatives. The code in an action block is usually written in the language that the rest of the compiler is written in, like C, augmented with some facility for referring to items in the alternative. In Python’s original pgen I did not add this feature, but for the new project I’d like to have it.

Here’s how we do it for the simplified parser generator I’m developing for this series of blog posts.

The syntax for actions is usually like this:

rule: item item item { action 1 } | item item { action 2 }

Because this makes the grammar more verbose, parser generators typically allow splitting rules across lines, e.g.

rule: item item item { action 1 }

| item item { action 2}

It makes the grammar parser a tad more complicated, but it’s important for readability, so I’ll use it.

An eternal question is when to execute the action block. In Yacc/Bison this is done once the rule is recognized by the parser, since there’s no backtracking. Executing each action exactly once means that it’s safe for actions to have global side effects (such as updating a symbol table or other compiler data structure).

In PEG parsers, with their unlimited backtracking, we have some choices:

Delay all actions until all is parsed. This would not be useful for my purpose, since I want to build up an AST during the parse.

Execute an action whenever its alternative is recognized, and require the action code to be idempotent (i.e. having the same effect no matter how many times it’s executed). This means an action may be executed but its result ultimately discarded.

Cache the action result, so each action is only executed the first time its alternative is recognized at a given position.

I’m going with the third option — we’re caching anyway using the packrat algorithm, so we might as well cache action results too.

Regarding what goes inside the {curlies}, tradition is to use C code with a $-based convention for referencing items in the recognized alternative (e.g. $1 to refer to the first item), and assignment to $$ to indicate the result of the action. This sounds awfully old-fashioned to me (I have memories of using assignment to the function name in Algol-60 to specify a return value) so I’ll go with something more Pythonic: inside the brackets you’ll need to place a single expression, whose value is the action value, and references to items are simple names giving the text of the item. As an example, here’s a simple calculator that can add and subtract numbers:

start: expr NEWLINE { expr }

expr: expr '+' term { expr + term }

| expr '-' term { expr - term }

| term { term }

term: NUMBER { float(number.string) }

When we run this with an input like 100 + 50 — 38 — 70 , it will compute the answer as it is recognizing the pieces, calculating ((100 + 50) — 38) — 70 , which of course comes out to 42 .

One little detail: in the action for term , the variable number holds a TokenInfo object, so the action must use its .string attribute to get the token in string form.

What do we do when an alternative has multiple occurrences of the same rule name? The parser generator gives each occurrence a unique name adding 1 , 2 , etc. to subsequent occurrences within the same alternative. For example:

factor: atom '**' atom { atom ** atom1 }

| atom { atom }

The implementation is boring, so I invite you to check out the repo and play with the code. Try this:

python3.8 -m story5.driver story5/calc.txt -g story5.calc.CalcParser

The visualization now allows you to go back and forth using the left and right arrow keys!

License for this article and the code shown: CC BY-NC-SA 4.0