Parsing is one of the difficult subjects in computer science. Conventionally, we reduce programming language into a simple mathematics model -- context-free grammar rules; We developed a mathematical language -- BNF; And we solved it mathematically -- LL(1), LALR(1), LL(*), GLR ...

Math is hard. For example, have you tried to learn C from its BNF syntax?

translation_unit : external_decl | translation_unit external_decl ; external_decl : function_definition | decl ; function_definition : decl_specs declarator decl_list compound_stat | declarator decl_list compound_stat | decl_specs declarator compound_stat | declarator compound_stat ; decl : decl_specs init_declarator_list ';' | decl_specs ';' ; decl_list : decl | decl_list decl ; decl_specs : storage_class_spec decl_specs | storage_class_spec | type_spec decl_specs | type_spec | type_qualifier decl_specs | type_qualifier ; storage_class_spec : 'auto' | 'register' | 'static' | 'extern' | 'typedef' ; type_spec : 'void' | 'char' | 'short' | 'int' | 'long' | 'float' | 'double' | 'signed' | 'unsigned' | struct_or_union_spec | enum_spec | typedef_name ; type_qualifier : 'const' | 'volatile' ; struct_or_union_spec : struct_or_union id '{' struct_decl_list '}' | struct_or_union '{' struct_decl_list '}' | struct_or_union id ; struct_or_union : 'struct' | 'union' ; struct_decl_list : struct_decl | struct_decl_list struct_decl ; init_declarator_list : init_declarator | init_declarator_list ',' init_declarator ; init_declarator : declarator | declarator '=' initializer ; struct_decl : spec_qualifier_list struct_declarator_list ';' ; spec_qualifier_list : type_spec spec_qualifier_list | type_spec | type_qualifier spec_qualifier_list | type_qualifier ; struct_declarator_list : struct_declarator | struct_declarator_list ',' struct_declarator ; struct_declarator : declarator | declarator ':' const_exp | ':' const_exp ; enum_spec : 'enum' id '{' enumerator_list '}' | 'enum' '{' enumerator_list '}' | 'enum' id ; enumerator_list : enumerator | enumerator_list ',' enumerator ; enumerator : id | id '=' const_exp ; declarator : pointer direct_declarator | direct_declarator ; direct_declarator : id | '(' declarator ')' | direct_declarator '[' const_exp ']' | direct_declarator '[' ']' | direct_declarator '(' param_type_list ')' | direct_declarator '(' id_list ')' | direct_declarator '(' ')' ; pointer : '*' type_qualifier_list | '*' | '*' type_qualifier_list pointer | '*' pointer ; type_qualifier_list : type_qualifier | type_qualifier_list type_qualifier ; param_type_list : param_list | param_list ',' '...' ; param_list : param_decl | param_list ',' param_decl ; param_decl : decl_specs declarator | decl_specs abstract_declarator | decl_specs ; id_list : id | id_list ',' id ; initializer : assignment_exp | '{' initializer_list '}' | '{' initializer_list ',' '}' ; initializer_list : initializer | initializer_list ',' initializer ; type_name : spec_qualifier_list abstract_declarator | spec_qualifier_list ; abstract_declarator : pointer | pointer direct_abstract_declarator | direct_abstract_declarator ; direct_abstract_declarator: '(' abstract_declarator ')' | direct_abstract_declarator '[' const_exp ']' | '[' const_exp ']' | direct_abstract_declarator '[' ']' | '[' ']' | direct_abstract_declarator '(' param_type_list ')' | '(' param_type_list ')' | direct_abstract_declarator '(' ')' | '(' ')' ; typedef_name : id ; stat : labeled_stat | exp_stat | compound_stat | selection_stat | iteration_stat | jump_stat ; labeled_stat : id ':' stat | 'case' const_exp ':' stat | 'default' ':' stat ; exp_stat : exp ';' | ';' ; compound_stat : '{' decl_list stat_list '}' | '{' stat_list '}' | '{' decl_list '}' | '{' '}' ; stat_list : stat | stat_list stat ; selection_stat : 'if' '(' exp ')' stat | 'if' '(' exp ')' stat 'else' stat | 'switch' '(' exp ')' stat ; iteration_stat : 'while' '(' exp ')' stat | 'do' stat 'while' '(' exp ')' ';' | 'for' '(' exp ';' exp ';' exp ')' stat | 'for' '(' exp ';' exp ';' ')' stat | 'for' '(' exp ';' ';' exp ')' stat | 'for' '(' exp ';' ';' ')' stat | 'for' '(' ';' exp ';' exp ')' stat | 'for' '(' ';' exp ';' ')' stat | 'for' '(' ';' ';' exp ')' stat | 'for' '(' ';' ';' ')' stat ; jump_stat : 'goto' id ';' | 'continue' ';' | 'break' ';' | 'return' exp ';' | 'return' ';' ; exp : assignment_exp | exp ',' assignment_exp ; assignment_exp : conditional_exp | unary_exp assignment_operator assignment_exp ; assignment_operator : '=' | '*=' | '/=' | '%=' | '+=' | '-=' | '<<=' | '>>=' | '&=' | '^=' | '|=' ; conditional_exp : logical_or_exp | logical_or_exp '?' exp ':' conditional_exp ; const_exp : conditional_exp ; logical_or_exp : logical_and_exp | logical_or_exp '||' logical_and_exp ; logical_and_exp : inclusive_or_exp | logical_and_exp '&&' inclusive_or_exp ; inclusive_or_exp : exclusive_or_exp | inclusive_or_exp '|' exclusive_or_exp ; exclusive_or_exp : and_exp | exclusive_or_exp '^' and_exp ; and_exp : equality_exp | and_exp '&' equality_exp ; equality_exp : relational_exp | equality_exp '==' relational_exp | equality_exp '!=' relational_exp ; relational_exp : shift_expression | relational_exp '' shift_expression | relational_exp '<=' shift_expression | relational_exp '>=' shift_expression ; shift_expression : additive_exp | shift_expression '<>' additive_exp ; additive_exp : mult_exp | additive_exp '+' mult_exp | additive_exp '-' mult_exp ; mult_exp : cast_exp | mult_exp '*' cast_exp | mult_exp '/' cast_exp | mult_exp '%' cast_exp ; cast_exp : unary_exp | '(' type_name ')' cast_exp ; unary_exp : postfix_exp | '++' unary_exp | '--' unary_exp | unary_operator cast_exp | 'sizeof' unary_exp | 'sizeof' '(' type_name ')' ; unary_operator : '&' | '*' | '+' | '-' | '~' | '!' ; postfix_exp : primary_exp | postfix_exp '[' exp ']' | postfix_exp '(' argument_exp_list ')' | postfix_exp '(' ')' | postfix_exp '.' id | postfix_exp '->' id | postfix_exp '++' | postfix_exp '--' ; primary_exp : id | const | string | '(' exp ')' ; argument_exp_list : assignment_exp | argument_exp_list ',' assignment_exp ; const : int_const | char_const | float_const | enumeration_const ;

Did you know typedef is syntactically the same as static ?

When you think you know the language, you hit upon lines of code such as:

char *(*(**foo[][8])())[];

huh???

On the other hand, programming languages are the simplest kind of languages. There is effort in parsing natural languages, and the conclusion is -- it is way too hard.

But this is counter intuitive. Do you feel that your daily language is harder than a programming language? Also, doesn't it feel wrong that we are solving a mathematics problem in parsing? As you read this blog, or as you read your code, do you feel the same as solving 3+(5*7)?

Lately I start to think that the general approach to parsing in the past is mis-directed. We have been approaching the problem declaratively -- start with rules or constraints, and solve it mathematically. However, I think in nature, languages are evolved without rules. In fact, the so-called syntax/grammar is more like patterns that occur as by-products of the evolution, a posteriori phenomena rather than a priori rules. And in nature, a language is typically parsed imperatively. Observe, do you look for a noun or do you react after you encounter a noun (then trying to figure out whether it is a subject or object or even an adjective)?

Starting with this article, I am going to experiment with a new style of writing parsers -- an imperative style. The core of the method is based on classic shift-reduce method. More specifically, operator precedence based parsing (which I believe is a fundamentally imperative algorithm). To differentiate from conventional method, it does not use look-ahead tokens to decide whether to shift or reduce, rather it is based on the current token to decide whether to reduce previous tokens. With imperative style, it naturally introduces the context dependency. It is subject to further research, but to my intuition, imperative style may have trouble parsing certain syntax that depend on look-ahead tokens, but it should also enlarge its ability to a class of context dependent languages. The part of language that is difficult to parse probably lies in the same class as the C declaration example -- an intuitively difficult kind that we should try to avoid.

In this article, I am going to present a parser for a simple calculator, which is often the "hello world" kind of example in a typical parsing textbook. It will be presented in Python. At some point, I will post the same examples in C or Perl, demonstrating how to apply the same method across languages. I will also follow up with examples of parsing CSV and HTML. CSV is the most basic class that does not really require parsing (simply lexing would do). HTML is the next level up; it has no nested syntax rules other than a basic bracket type. Both of these examples are practical yet sufficiently simple. The main purpose is to familiarise with the basic parsing technique such as lexing and parsing stack, as well as the framework I am using -- MyDef. However, MyDef is just for convenience; the ideas that I will be discussing are more general. For the convenience of readers, I also supply the raw python code in each example in the repository so interested readers could simply run the python code to experiment with the material. I will further expand the calculator example into a full-expression parser, in particular, a Javascript expression subset. It will be continued to add features until we can parse a full Javascript expression. Finally, we will have a full JavaScript parser.

The code was prepared in a meta-layer -- MyDef, which enables me to layout the code into more comprehensible form. For this simple calculator example, there is not much difference between the MyDef version and the vanilla Python version. In future more complicated examples, you'll find the version in MyDef much more manageable. To familiarize you with MyDef (and to hit you in one-go), I list both version of calculator code side-by-side (you may have to scroll a bit):

page : calc , basic_frame module: python print calc( "1+2*-3" ) fncode : calc (src) src_len=len(src) src_pos=0 precedence = {'eof':0, '+':1, '-':1, '*':2, '/':2, 'unary': 99} DUMP_STUB regex_compile macros : type: stack[$1][1] atom: stack[$1][0] cur_type: cur[1] cur_atom: cur[0] stack=[] $while 1 $do $if_match \s+ continue $if_match [\d\.]+ num = float(m.group(0)) cur=( num, "num" ) break $if_match [-+*/] op = m.group(0) cur = (op, op) break $if src_len>=src_pos cur = ('', "eof" ) break t=src[0:src_len]+ " - " +src[src_len:] raise Exception(t) $do $if $(cur_type) == "num" break $if len(stack)<1 or $(type:-1) != "num" cur = (cur[0], 'unary') break $if len(stack)<2 break $if precedence[ $(cur_type) ]<=precedence[ $(type:-2) ] $call reduce continue $if $(cur_type) != "eof" stack.append: cur $else $if len(stack)>0 return stack[-1][0] $else return None subcode : reduce $if $(type:-2) == "unary" t = - $(atom:-1) stack[-2:]=[(t, "num" )] $map reduce_binary, +, -, *, / subcode : reduce_binary (op) $elif $(type:-2) ==' $(op) ' t = $(atom:-3) $(op) $(atom:-1) stack[-3:]=[(t, "num" )] import : re def main(): print(calc( "1+2*-3" )) def calc(src): src_len=len(src) src_pos=0 precedence = {'eof':0, '+':1, '-':1, '*':2, '/':2, 'unary': 99} re1 = re.compile(r "\s+" ) re2 = re.compile(r "[\d\.]+" ) re3 = re.compile(r "[-+*/]" ) stack=[] while 1: while 1: # $do m = re1.match(src, src_pos) if m: src_pos=m.end() continue m = re2.match(src, src_pos) if m: src_pos=m.end() num = float(m.group(0)) cur=( num, "num" ) break m = re3.match(src, src_pos) if m: src_pos=m.end() op = m.group(0) cur = (op, op) break if src_len>=src_pos: cur = ('', "eof" ) break t=src[0:src_len]+ " - " +src[src_len:] raise Exception(t) break while 1: # $do if cur[1]== "num" : break if len(stack)<1 or stack[-1][1]!= "num" : cur = (cur[0], 'unary') break if len(stack)<2: break if precedence[cur[1]]<=precedence[stack[-2][1]]: if stack[-2][1] == "unary" : t = -stack[-1][0] stack[-2:]=[(t, "num" )] elif stack[-2][1]=='+': t = stack[-3][0] + stack[-1][0] stack[-3:]=[(t, "num" )] elif stack[-2][1]=='-': t = stack[-3][0] - stack[-1][0] stack[-3:]=[(t, "num" )] elif stack[-2][1]=='*': t = stack[-3][0] * stack[-1][0] stack[-3:]=[(t, "num" )] elif stack[-2][1]=='/': t = stack[-3][0] / stack[-1][0] stack[-3:]=[(t, "num" )] continue break if cur[1]!= "eof" : stack.append(cur) else : if len(stack)>0: return else : return if __name__ == "__main__" : main()

Too narrow? Try cross out the right-side "recent posts" pane (click on the top [x]).

For better view, The code is also available at github: calc.def and calc.py.

To work with the mydef version, you need install both MyDef and the output_python module, just follow the instructions.

The basic construct of the parser is to keep matching tokens off the head of input stream (in a while 1 loop) until end ( eof ), then according to each token, we either push onto a stack or reduce the stack according to an operator precedence table.

The regex is limited to re.match , an always anchored version, which ensures efficiency. For complex syntax, we may desire to combine multiple regex into a bigger pattern, sort of what typical lexer generator such as flex does, but then we sacrifice readability. I view it as an optional optimization. The bottle neck of typical compilers seldom lies at the parser, so I believe these optimizations are not critical. Similarly, I opted for infinite loops and relies on sequential if tests and break or continue statements, which we may optimize but opted for flexibility (and maintainability).

All tokens are represented as a tuple with its second member always a type. I believe a tuple is faster than a dictionary or array in Python. In C, this probably will be a struct. Tuple is not very readable, so I introduced a few macros. Macros in MyDef is used both for convenience (less typing) and better readability (a name rather than a series of tokens that require parsing).

DUMP_STUB in MyDef is a place holder where we can inject code, regex compilation in this case. The custom syntax of ($if_match) inject code to regex_compile stub. It is a convention (albeit arbitrary). If we accidentally omit this DUMP_STUB , the injected re compilation code will be missing, which will certainly break the code and thus easy to spot and fix. I did not choose to fix this stub at a global location so we can choose where best to place them. $if_match always uses variable src and src_len , again an arbitrary decision of convention.

It should be noted that I am not introducing any new algorithm here. Operator precedence based parsing is a very old algorithm. What I introduce here is a new type of style, where we do not lay out all the lexer token, nor do we declare a full set of grammar; rather, we construct the lexer on the go with a regex syntax, and program the logic of whether to reduce or shift as we lex. On one hand, it lacks the rigor, e.g. the lexer and the reduction or shift logic is highly dependent on the order of test; unless we test, there is no telling whether the grammar as we parsed is ambiguous or not. On the other hand, it is much more flexible; for example, it is easy to add test of context in the lexer and parse the string into different token according to the context. Adding context will increase the complexity, but on the other hand, trying to fit a complex logic into a simple model may introduce even more complexity.

Nor do I claim my code is correct or efficient. The purpose is not to show off my programming skills, but to explore or experiment and hope I could solicit some feedbacks or interested buddies.