Why Perl is Just Not That Into Your Syntax

In a previous post, I noted that Perl often cannot precisely locate syntax errors in its scripts. Still less can it identify the exact problem. In this post, I will demonstrate an experimental utility which does pinpoint Perl syntax errors, precisely indicating where and what the problem is.

Here's my example from the previous post.

my $lyric = 'Sloopy wears a red dress, yeah As old as the hills but when sloopy wears that red dress, yeah you know it gives me the chills Sloopy when I see you walking, walking down the street I say don\'t worry sloopy, girl You belong to me'~!@$%^&*()_+; print "$lyric

";

And here is Perl's output for the error:

Bareword found where operator expected at bomb2.pl line 9, near ")_" (Missing operator before _?) syntax error at bomb2.pl line 9, near "You belong to me'~" (Might be a runaway multi-line '' string starting on line 2) Execution of bomb2.pl aborted due to compilation errors.

perl

Pinpointing the Error

As I said in that previous post clearly has very little idea where things went wrong -- it's guessing.At this point, let me give away the ending. The point of failure is the very first special symbol: the tilde. When I ran my fingers from left to right across the top of my keyboard, I was hoping to produce a more complicated example. But perhaps it is just as well I did not.

Here, from my experimental Marpa-based utility, is what Perl is looking for when it encounters the tilde: The dot in the rules indicates how far the parse has already progressed.

line -> label sideff . SEMI sideff -> expr . IF expr sideff -> expr . UNLESS expr sideff -> expr . WHILE expr sideff -> expr . UNTIL iexpr sideff -> expr . FOR expr sideff -> expr . WHEN expr or_expr -> or_expr . OROP and_expr or_expr -> or_expr . DOROP and_expr and_expr -> and_expr . ANDOP argexpr argexpr -> argexpr . COMMA argexpr -> argexpr . COMMA term term_listop -> term_cond . ASSIGNOP term_listop term_assign -> term_cond . ASSIGNOP term_assign term_cond -> term_dotdot . QUESTION term_cond COLON term_cond term_dotdot -> term_oror . DOTDOT term_oror term_oror -> term_oror . OROR term_andand term_oror -> term_oror . DORDOR term_andand term_andand -> term_andand . ANDAND term_bitorop term_bitorop -> term_bitorop . BITOROP term_bitandop term_bitandop -> term_bitandop . BITANDOP term_eqop term_eqop -> term_relop . EQOP term_relop term_relop -> term_uniop . RELOP term_uniop term_shiftop -> term_shiftop . SHIFTOP term_addop term_addop -> term_addop . ADDOP term_mulop term_mulop -> term_mulop . MULOP term_matchop term_matchop -> term_matchop . MATCHOP term_uminus term_powop -> term_increment . POWOP term_powop term_increment -> term_arrow . POSTINC term_increment -> term_arrow . POSTDEC term_arrow -> term_arrow . ARROW method LPAREN listexprcom RPAREN term_arrow -> term_arrow . ARROW method subscripted -> term_hi . ARROW LSQUARE expr RSQUARE subscripted -> term_hi . ARROW LCURLY expr SEMI RCURLY subscripted -> term_hi . ARROW LPAREN RPAREN subscripted -> term_hi . ARROW LPAREN expr RPAREN

The names of the symbols are based on those in perly.y . Operators are not shown symbolically, but are indicated with the name in caps: "POSTINC" instead of "++". Terms are suffixed with their precedence: "term_assignop" is the symbol for terms with the same precedence as the assignment operator. "term_hi" is the symbol for terms at the highest precedence level.

A tilde, when it is a single-character Perl operator, is always a prefix unary operator. Tildes also form part of several multi-character operators, but that is not the case here. Here is what Perl is looking for when it encounters the tilde:

A semicolon that would end the statement. (Statements are called line 's internally by perly.y .)

's internally by .) The postfix conditonal keywords.

A lot of binary operators.

A couple of postfix unary operators. These, if found, will produce a compile-time complaint about incrementing a constant.

No prefix unary operator is in the above list, and the parse fails here.

About the Utility Used in This Post

Finding the exact point of failure and the exact reasons would seem like something that you'd want in a parser. But in fact, production languages have tended to be like Perl -- they try to indicate the general area of a syntax problem and to make a good guess as to its nature. But they leave it to the programmer to figure out exactly where they failed and why.

Marpa, then, is unusual, in that for any grammar you can write in BNF, and any input, it will either produce a parse, or a precise characterization of the failure. Marpa::XS::Perl is still experimental and under development. As I tackle tasks (like preparing this post, for example) I add the necessary capabilities. An example of what my utility cannot yet do is deal with floating point constants. (They're not hard, I just haven't encountered them yet in a test case.)

My original intent with Marpa::XS::Perl was to use it for snippets, and for academic and toy examples, and it cannot yet deal with production Perl code. My purpose so far has been to demonstrate that Marpa could be the basis of a practical Perl parsing utility.

Notes

Note 1: Of course, in one sense, the exact nature of the problem depends on what the person writing the script intended, and on this my utility has not a clue. In this post, "finding the exact problem" means finding the exact location of a parse failure, and finding exactly what perl was looking for when perl did not find what perl wanted to find.

In determining the "exact location of parse failure", I also avoid mind-reading. I use a definition taken from the parsing literature: In a rejected token stream, the point of failure is the first token which made a successful parse impossible. In other words, if you encounter a token which cannot possibly be part of a successful parse, given the input you've already read, that token is the point of failure. Looking at it from the opposite point of view, if you can find some additonal input that makes the parse succeed, you have not yet found a point of failure.

Note 2: To be precise, the output in this post was automatically generated by my utility, then edited for readability. Specifically, the edits removed those lines which were for rules with the dot at the end, and removed rule numbers and token numbers from the beginnings of the lines. Since rules with the dot at the end are completed, they do not generate any expectations for future tokens, and are irrelevant here. Similarly, in this context the internal rule numbers and location numbers would be clutter. While I made these readability edits by hand, they were rote and could easily have been automated.