Introduction

In this post I will demonstrate how to parse and calculate an arithmetic expression a general-purpose parser. When we’re done, we’ll be able to handle expressions such as 1 + 2 * -(-3+2)/5.6 + 3 , and hopefully you’ll have gained the tools to handle much more.

My motivation is to provide a simple and fun lesson in parsing and formal grammars, as well as to show-case PlyPlus, a parser interface I’ve been working on-and-off on for the past few years. As a bonus, the result is a safe arithmetic alternative to eval().

If you want to follow the examples on your computer, you can install PlyPlus with pip install plyplus .

Working knowledge of Python is required for the implementation section.

Grammars

For those of you who don’t know how parsing and formal grammars work, here is a quick overview: Formal grammars are a hierarchy of rules for parsing text. Each rule matches some portion of the input text, by describing the rules that it’s made of.

Here an example of a pseudo-grammar for parsing 1 + 2 + 3 + 4 :



Rule #1 - add IS MADE OF add + number OR number + number



Or in EBNF:

add: add '+' number | number '+' number ;

Each pass of the parser will look for either add+number or number+number, and if it finds one, convert it to add. Basically, every parser aims to climb the hierarchy as much as possible.

Here are the steps the parser will take:

number + number + number + number

first pass turns all numbers into a ‘number’ rule [number + number] + number + number

the parser found its first pattern! [add + number] + number

after converting the pattern, it finds the next one [add + number] add

The sequence of symbols has turned into a hierarchy of two simple rules: number+number and add+number , and if we tell the computer how to solve each of them, it can solve the entire expression for us. In fact, it can solve any sequence of additions, no matter how long! That is the strength of formal grammars.

Operator Precedence

Arithmetic expressions are not just a linear progression of symbols. Their operators create an implicit hierarchy, which makes them the perfect target for formal grammars:

1 + 2 * 3 / 4 - 5 + 6



Is equivalent to:

1 + (2 * 3 / 4) - 5 + 6



We can express this structure in the grammar by nesting the rules:

add: add + mul | mul '+' mul ; mul: mul '*; number | number '*' number ;

By telling add that it operates on mul, and not on numbers, we are giving multiplications the precedence.

Let’s pretend-run this grammar on 1 + 2 * 3 * 4 with our magical parser that is in my head:

number + number * number * number number + [number * number] * number

the parser doesn’t know what a number+number is, so this is his next pick number + [mul * number] number + mul ???

Now we are in a bit of a pickle! The parser doesn’t know what to do with number+mul. We can tell it, but if we keep looking we’ll find that there are many possibilities we didn’t cover, such as mul+number, add+number, add+add, etc.

So what do we do?

Luckily, we have a trick up our sleeve: We can say that a number by itself is a multiplication, and a multiplication by itself is an addition!

This method might seem strange at first, but it makes total sense:

add: add '+' mul | mul '+' mul | mul ; mul: mul '*' number | number '*' number | number ;

But if mul can become add, and number can become mul, we have extra lines that do nothing. Removing them, we get:

add: add '+' mul | mul ; mul: mul '*' number | number ;

Let’s pretend-run on 1 + 2 * 3 * 4 again with this new grammar:

number + number * number * number

There’s no rule for number*number now, but the parser can “get creative” number + [number] * number * number number + [mul * number] * number number + [mul * number] [number] + mul [mul] + mul [add + mul] add

Success!!!

If this looks like magic to you, try pretend-running on different arithmetic expressions, and see how the expression resolves itself in the correct order every time. Or wait for the next section and let the computer run it for you!

Running the parser

By now we have a fairly good idea of how we want our grammar to work. Let’s apply it and write an actual grammar:

start: add; // This is the top of the hierarchy add: add add_symbol mul | mul; mul: mul mul_symbol number | number; number: '[\d.]+'; // Regular expression for a decimal number mul_symbol: '\*' | '/'; // Match * or / add_symbol: '\+' | '-'; // Match + or -

You might want to brush up on regular expressions a bit, but otherwise this grammar is pretty straight-forward. Let’s run it on an expression!

>>> from plyplus import Grammar >>> g = Grammar("""...""") >>> print g.parse('1+2*3-5').pretty() start add add add mul number 1 add_symbol + mul mul number 2 mul_symbol * number 3 add_symbol - mul number 5

So pretty!

Study the tree and see what hierarchy the parser chose.

If you want to play with the parser, and feed it expressions by yourself, you can! All you need is Python. Run pip install plyplus and paste the above commands inside python (make sure to put the actual grammar instead of ‘…’ 😉 ).

Shaping the tree

Plyplus automagically creates a tree, but it’s not very optimal. While putting number inside mul and mul inside add was useful for creating a hierarchy, now that we already have a hierarchy they are just a burden. We can tell Plyplus to “expand” (i.e. remove) rules by prefixing them. A @ will always expand a rule, a # will flatten it, and a ? will expand it if and only if it has one child. In this case, ? is what we want.

start: add; ?add: add add_symbol mul | mul; // Expand add if it's just a mul ?mul: mul mul_symbol number | number; // Expand mul if it's just a number number: '[\d.]+'; mul_symbol: '\*' | '/'; add_symbol: '\+' | '-';

Here’s how the tree looks with the new grammar:

>>> g = Grammar("""...""") >>> print g.parse('1+2*3-5').pretty() start add add number 1 add_symbol + mul number 2 mul_symbol * number 3 add_symbol - number 5

Ooh, that is so much cleaner, and I dare say, quite optimal!

Parenthesis and Other Features

We are missing some obvious features: Parenthesis, unary operators (-(1+2)) , and the ability to put spaces inside the expression. These are all so easy to add at this point that it would be a shame not to.

The important concept is to add a new rule, we’ll call atom. Everything inside the atom (namely parenthesis and unary operations) happens before any additions or multiplications (aka binary operations). Since the atom is only a hierarchical construct with no semantic significance, we’ll make sure it’s always expanded, by adding @ to it.

The obvious way to allow spaces is with something like add: add SPACE add_symbol SPACE mul | mul; , but that’s tedious and unreadable. Instead, we will tell Plyplus to always ignore whitespace.

Here’s the final grammar, with all of these features:

start: add; ?add: (add add_symbol)? mul; ?mul: (mul mul_symbol)? atom; @atom: neg | number | '\(' add '\)'; neg: '-' atom; number: '[\d.]+'; mul_symbol: '\*' | '/'; add_symbol: '\+' | '-'; WHITESPACE: '[ \t]+' (%ignore);

Make sure you understand it, so we can proceed to the next step: Calculating!

Calculating!

We can already turn an expression into a hierarchical tree. To calculate it, all we need is to collapse the tree into a number. The way to do it is branch by branch.

This is the part we start writing code, so I need to explain two things about the tree.

Each branch is an instance with two attributes: head, which is the name of the rule (say, add or number), and tail, which is the list of sub-rules that it matched. By default, Plyplus removes unnecessary tokens. In our example, the ‘(‘ and ‘)’ will already be removed from the tree, as well as neg’s ‘-‘. Those of add and mul won’t be removed, because they have their own rule, so Plyplus knows they’re important. This feature can be turned off to keep all tokens in the tree, but in my experience it’s always more elegant to leave it on and change the grammar accordingly.

Okay, we are ready to write some code! We will collapse the tree using a transformer, which is very simple. It traverses the tree, starting with the outermost branches, until it reaches the root. It’s your job to tell it how to collapse each branch. If you do it right, you will always run on an outermost branch, riding the wave of its collapse. Dramatic! Let’s see how it’s done.

>>> import operator as op >>> from plyplus import STransformer class Calc(STransformer): def _bin_operator(self, exp): arg1, operator_symbol, arg2 = exp.tail operator_func = { '+': op.add, '-': op.sub, '*': op.mul, '/': op.div }[operator_symbol] return operator_func(arg1, arg2) number = lambda self, exp: float(exp.tail[0]) neg = lambda self, exp: -exp.tail[0] __default__ = lambda self, exp: exp.tail[0] add = _bin_operator mul = _bin_operator

Each method corresponds to a rule name. If a method doesn’t exist, __default__ is called. In our implementation, we left out start, add_symbol, and mul_symbol, all of which should do nothing but return their only sub-branch.

I use float() to parse numbers, because I’m lazy, but I can implement it using the parser as well.

I use the operator module for syntactic beauty. operator.add is basically ‘lambda x,y: x+y’, etc.

Alright, let’s run the transformer and see how it turns out.

>>> Calc().transform( g.parse('1 + 2 * -(-3+2) / 5.6 + 30')) 31.357142857142858

What does eval() think?

>>> eval('1 + 2 * -(-3+2) / 5.6 + 30') 31.357142857142858

Success!

Final Step: The REPL

For aesthetic reasons, let’s wrap it up in a nice calculator REPL:

def main(): calc = Calc() while True: try: s = raw_input('> ') except EOFError: break if s == '': break tree = calc_grammar.parse(s) print calc.transform(tree)

You can see the full source code here: https://github.com/erezsh/plyplus/blob/master/examples/calc.py

Share this: Twitter

Facebook

Like this: Like Loading... Related