In my previous blog post, I discussed how recursive-regex has been

maturing but that it still wasn’t and was not intended to compete with

actual parser toolkits. I wanted to quantify this assumption and

present a head-to-head analysis of using recursive-regex vs using

cl-yacc.

I chose cl-yacc because



I already had a working css3-selector parser implemented. My

existing cl-yacc implementation is based on

the published

CSS3-Selectors Grammar (but the modified some to get it to work).

CL-YACC Parser

I found implementing the parser in cl-yacc to be fairly tedious,

time consuming and error prone, even for such a small language as

css-selectors. It doesn’t help that I tried to do it using the

published css3 grammar and lex files which are somewhat awkward (eg:

open parens are in the lexer and close parens are in the grammar).

I wrote a not very well documented (f)lex file reader to read in the

existing CSS3 flex

file and build a lexer compatible with cl-yacc. After getting a

valid lexer, I started working with the published grammar to convert

it to a format cl-yacc would approve of. Along the way I was finding

the syntax for cl-yacc to be pretty cumbersome, so I used some

read time execution to turn forms like the first below, into forms like

the second (macro expanded) one below which are valid cl-yacc

productions. (This is not a great idea or great code, but it did simplify

the parser def for me.)

( or-sel #. ( rule ( comb-sel :|,| spaces or-sel ) ( list :or comb-sel or-sel )) #. ( rule ( comb-sel ) comb-sel )) ( or-sel ( comb-sel :|,| spaces or-sel #' ( lambda ( comb-sel |,| spaces or-sel ) ( declare ( ignorable comb-sel |,| spaces or-sel )) ( list :or comb-sel or-sel ))) ( comb-sel #' ( lambda ( comb-sel ) ( declare ( ignorable comb-sel )) comb-sel )))

The difficulties I had in implementation were mostly concerning where

white space could appear, in which productions, at which levels in the

grammar. It seemed like it was quite a task to get an unambiguous

grammar and have all my extra unimportant white space everywhere still

be parsable. After I had finally managed to rid my grammar of

reduce/reduce conflicts, I was able to parse my language and it seemed

like a fairly peppy parser.

The only problem I really have with this implementation is that it

seems like it is totally illegible after the fact. Even knowing about

parsers and having written this one, I don’t feel comfortable

modifying the language. It seems like it would be difficult to get

working again. Thankfully, I don’t anticipate having to do much

language rewriting in this case.

CL-YACC parser definition (lexer not shown)

( yacc:define-parser *css3-selector-parser* ( :start-symbol selector ) ( :terminals ( :|,| :|*| :|)| :|(| :|>| :|+| :|~| :|:| :|[| :|]| :|=| :|-| :S :IDENT :HASH :CLASS :STRING :FUNCTION :NTH-FUNCTION :INCLUDES :DASHMATCH :BEGINS-WITH :ENDS-WITH :SUBSTRING :integer )) ( :precedence (( :left :|)| :s :|,| :|+| :|~| )) ) ( selector #. ( rule ( or-sel ) or-sel )) ( or-sel #. ( rule ( comb-sel :|,| spaces or-sel ) ( list :or comb-sel or-sel )) #. ( rule ( comb-sel ) comb-sel )) ( comb-sel #. ( rule ( and-sel combinator comb-sel ) ( list combinator and-sel comb-sel )) #. ( rule ;; need to handle trailing spaces here ;; to avoid s/r ( and-sel spaces ) and-sel )) ( combinator ( :s ( constantly :child )) ( spaces :|>| spaces ( constantly :immediate-child )) ( spaces :|~| spaces ( constantly :preceded-by )) ( spaces :|+| spaces ( constantly :immediatly-preceded-by ))) ( and-sel #. ( rule ( and-sel simple-selector ) ( list :and and-sel simple-selector )) #. ( rule ( simple-selector ) simple-selector )) ( simple-selector #. ( rule ( :HASH ) ` ( :hash , ( but-first hash ))) #. ( rule ( :CLASS ) ` ( :class , ( but-first class ))) #. ( rule ( :IDENT ) ` ( :element , ident )) ( :|*| ( constantly :everything )) ( attrib #' identity ) ( pseudo #' identity )) ( attrib #. ( rule ( :|[| spaces :IDENT spaces :|]| ) ` ( :attribute , ident )) #. ( rule ( :|[| spaces :IDENT spaces attrib-value-def spaces :|]| ) ` ( :attribute , ident , attrib-value-def )) ) ( attrib-value-def #. ( rule ( attrib-match-type attrib-value ) ( list attrib-match-type attrib-value ))) ( attrib-match-type #. ( rule ( :|=| ) :equals ) #. ( rule ( :includes ) :includes ) #. ( rule ( :dashmatch ) :dashmatch ) #. ( rule ( :begins-with ) :begins-with ) #. ( rule ( :ends-with ) :ends-with ) #. ( rule ( :substring ) :substring )) ( attrib-value #. ( rule ( :ident ) ident ) #. ( rule ( :string ) ( but-quotes string ))) ( pseudo #. ( rule ( :|:| :IDENT ) ( list :pseudo ident )) #. ( rule ( :|:| :FUNCTION spaces selector :|)| ) ( list :pseudo ( but-last function ) selector )) #. ( rule ( :|:| :NTH-FUNCTION spaces nth-expr spaces :|)| ) ` ( :nth-pseudo , ( but-last nth-function ) ,@ nth-expr ))) ( nth-expr #. ( rule ( :ident ) ( cond (( string-equal ident "even" ) ( list 2 0 )) (( string-equal ident "odd" ) ( list 2 1 )) ( T ( error "invalid nth subexpression" )))) #. ( rule ( nth-sign :integer ) ( list 0 ( if ( string-equal nth-sign "-" ) ( * -1 ( parse-integer integer )) ( parse-integer integer )))) #. ( rule ( nth-sign :integer :ident ) ( let ( extra-num ) ( cond (( string-equal "n" ident ) T ) ;; this is because our lexer will recogince n-1 as a valid ident ;; but n+1 will hit the rule below (( alexandria:starts-with-subseq "n" ident ) ( setf extra-num ( parse-integer ( subseq ident 1 )))) ( T ( error "invalid nth subexpression in (what is ~A)" ident ))) ( list ( or ( if ( string-equal nth-sign "-" ) ( * -1 ( parse-integer integer )) ( parse-integer integer )) 0 ) ( or extra-num 0 )))) #. ( rule ( nth-sign :integer :ident nth-sign :integer ) ( when ( and integer-1 ( null nth-sign-1 )) ( error "invalid nth subexpression 2n+1 style requires a sign before the second number" )) ( list ( or ( if ( string-equal nth-sign-0 "-" ) ( * -1 ( parse-integer integer-0 )) ( parse-integer integer-0 )) 0 ) ( or ( if ( string-equal nth-sign-1 "-" ) ( * -1 ( parse-integer integer-1 )) ( parse-integer integer-1 )) 0 )) )) ( nth-sign #. ( rule ( :|+| ) :|+| ) #. ( rule ( :|-| ) :|-| ) #. ( rule () ())) ( spaces ( :S ) ( )))

Recursive-Regex Parser

After having written recursive-regex, I wanted a way to beat some of

the bugs out as well as have a good example of what I could use this

tool for, and with what performance characteristics. To accomplish

this task, I converted the CL-YACC grammar and lex file into a single

parser definition and



wrote some code to turn that file into the recursive dispatch functions.

REX: a recursive expression file format based (loosely) on lex

I had a really shoddy lexish reader for the existing lex based lexer.

I converted this to a similar tool (without the c blocks) for defining

named recursive expressions. These files have the .rex extension.

They consists of options, inline-definitions, and named productions.



The definitions for css-selectors are in css3.rex.

Once I had my recursive expression definition, getting the parser was

a fairly easy task. Along the way I added some code to minimize the

parse tree results by promoting children, when a parent only had a

single child that matched the same length as the parent. I also

improved the parse tracing, so that I could observer and debug what

the expression was doing while it was matching. With the tree minimized

I also had to revise many of my tests.

css3.rex: css3-selectors parser definition

%option case-insensitive

h => [0-9a-f]

nmstart => [a-z]

nmchar => [a-z0-9-]

nl =>

|\r

|\r|\f

string1 => \"([\t !#$%&(-~]|\\{nl}|\')*\"

string2 => \'([\t !#$%&(-~]|\\{nl}|\")*\'

id => [-]?{nmstart}{nmchar}*

name => {nmchar}+

int => [0-9]+

num => [0-9]+|[0-9]*\.[0-9]+

string => {string1}|{string2}

url => ([!#$%&*-~])*

w => [ \t\r

\f]*

s => [\s\f]+

%%

comment => \/\*[^*]*\*+([^/][^*]*\*+)*\/

cdo => <!--

cdc => -->

class => \.(?<ident>)

hash => #(?<ident>{name})

ident => {id}

element => {id}

important_sym => !{w}important

selector => (?<or>)

or => (?<comb-sel>)(\s*,\s*(?<or>))?

comb-sel => (?<and>)((?<combinator>)(?<comb-sel>))?

combinator => (?<child>\s+)|(?<immediate-child>>)|(?<preceded-by>~)|(?<immediatly-preceded-by>\+)

and => (?<sim-sel>)(?<and>)?

sim-sel => (?<hash>)|(?<class>)|(?<element>)|\*|(?<attrib>)|(?<pseudo>)

attrib => \[\s*(?<ident>)\s*(?<attrib-val>)?\]

attrib-val => [\^\$\*\|\~]?=((?<ident>)|(?<string>))

pseudo => :(?<ident>)(?<parens>\s*(?<selector>)|(?<nth-expr>)\s*)?

nth-expr => even|odd|[\+-]?{int}(n([\+-]{int})?)?

Performance Numbers

These are the performance numbers for the two parsers each parsing 6

short inputs 1000 times. Also included is (a version) of that

output. (Recursive-expressions return CLOS object trees that have in

the results below been converted to a list representation for easy of

viewing.) As you can see the recursive-expressions version is ten

times slower and uses twenty times the memory.

(defun run-some-tests ()

(timed-side-by-side-parses

(list ".foo" "#bar" ":bast" "div[onclick]"

"div[onclick=foo]"

":nth-last-child( 2n+1 ), foo.bar>bast:blech( .foo )" )

1000))

CSS> (run-some-tests)

;;; Recursive-Regex Results

Evaluation took:

7.548 seconds of real time

7.480000 seconds of total run time (6.850000 user, 0.630000 system)

[ Run times consist of 1.410 seconds GC time, and 6.070 seconds non-GC time. ]

99.10% CPU

18,821,417,333 processor cycles

1,167,530,480 bytes consed

((:CLASS ".foo" (:IDENT "foo")) (:HASH "#bar" (:IDENT "bar"))

(:PSEUDO ":bast" (:IDENT "bast"))

(:AND "div[onclick]" (:ELEMENT "div")

(:ATTRIB "[onclick]" (:IDENT "onclick")))

(:AND "div[onclick=foo]" (:ELEMENT "div")

(:ATTRIB "[onclick=foo]" (:IDENT "onclick")

(:ATTRIB-VAL "=foo" (:IDENT "foo"))))

(:OR ":nth-last-child( 2n+1 ), foo.bar>bast:blech( .foo )"

(:PSEUDO ":nth-last-child( 2n+1 )" (:IDENT "nth-last-child")

(:MATCHED-PARENS "( 2n+1 )" (:BODY "2n+1 " (:NTH-EXPR "2n+1"))))

(:COMB-SEL "foo.bar>bast:blech( .foo )"

(:AND "foo.bar" (:ELEMENT "foo") (:CLASS ".bar" (:IDENT "bar")))

(:IMMEDIATE-CHILD ">")

(:AND "bast:blech( .foo )" (:ELEMENT "bast")

(:PSEUDO ":blech( .foo )" (:IDENT "blech")

(:MATCHED-PARENS "( .foo )"

(:BODY " .foo" (:CLASS ".foo" (:IDENT "foo")))))))))

;;; CL-YACC Results

Evaluation took:

0.790 seconds of real time

0.790000 seconds of total run time (0.770000 user, 0.020000 system)

[ Run times consist of 0.080 seconds GC time, and 0.710 seconds non-GC time. ]

100.00% CPU

1,968,753,765 processor cycles

64,695,792 bytes consed

((:CLASS "foo") (:HASH "bar") (:PSEUDO "bast")

(:AND (:ELEMENT "div") (:ATTRIBUTE "onclick"))

(:AND (:ELEMENT "div") (:ATTRIBUTE "onclick" (:EQUALS "foo")))

(:OR (:NTH-PSEUDO "nth-last-child" 2 1)

(:IMMEDIATE-CHILD (:AND (:ELEMENT "foo") (:CLASS "bar"))

(:AND (:ELEMENT "bast") (:PSEUDO "blech" (:CLASS "foo"))))))



TL/DR

In conclusion, I am certainly not going to replace my working, fast,

memory efficient cl-yacc parser with my recursive-expressions parser.

However, if I wanted to have a working, legible (maybe) parser

definition, that will match as I intuitively expect, I might use

recursive-expressions. Because I am so used to using regex’s for

matching, if performance was not an issue, I would probably always

prefer the recursive expressions version. I could also see the

recursive expressions solution being a nice prototyping tool to help

develop the cl-yacc parser.

Obviously some of these opinions are going to be biased because

I wrote one of these libraries and not the other

CL-YACC

Pros

Pretty quick and very memory efficient parser

Easy to customize parser output (each production has a lambda body to build whatever output is necessary)

Theoretically well grounded

Cons

Its hard to make unambiguous grammars

Not exceedingly helpful with its suggestions for how to fix your ambiguities

Recursive Regex

Pros

Relatively easy to get working

lexer and parser are the same tool, built on top of cl-ppcre, which

you presumably already know, and has a good test environment (regex-coach, repl)

you presumably already know, and has a good test environment (regex-coach, repl) Parses ambiguous grammars

Has reasonable parser tracing built in, so debugging can be somewhat easier

Cons