argumate:

voidfraction:

argumate: voidfraction: argumate: voidfraction: argumate: voidfraction: regexes are a trash tool for trash people, real men use parser combinators real rugged humans realise that regexps are parser combinators! with non-deterministic choice and backtracking and everything oh you want to parse recursive grammars? if only regexps had been extended with a recursion operator… oh wait, they totally have, several times! real transcended sentients acknowledge that type systems are pretty cool and prefer to use parsers that output stuff like Expr(n: Either[Numeric, Expr] , op: Operator, n: Either[Numeric, Expr]) instead of some stringly-typed regex gobbledygook beings of pure energy that float in harmony with the void agree that expressive type systems are awesome, but point out that sometimes you are working with textual data whose “type” is fundamentally an ordered pattern of characters and the simplest way of describing an ordered pattern of characters is a regexp BOOM upright and respectable fractally-gendered ems running on the computational cluster simulating reality agree that regexps are a pretty awesome way to describe and consume patterns of characters but would respectfully point out that regexps still don’t compose in any useful way the administrators of the aforementioned reality simulating cluster have temporarily decohered from their usual state of quantum bliss to table the motion that regexps are as compositional as any other grammatical system: R ::= RR // concatenation

R := R | R // alternation

R := R* // kleene star

R := symbol

R := void Indeed, given that some regular expression formulations support negation and intersection operators, one could argue that they are compositional in a way that is more convenient than many grammars! A = <complex regular expression 1>

B = <complex regular expression 2> C = A | B <– seamless composition via logical or C = A /\ !B <– (!) seamless composition via logical and / not The sysadmins who maintain the machines that simulate the aforementioned reality awake from their sarcophagi after countless eons, grab a mug-analog of piping-hot stimulants, and remark that the countless regexes used in their workflow treat [0-9]* and [a-z]* as parsers producing strings which makes composition more trouble than it’s worth. (Reaching into the fabric of reality, they modify their original post to ask if regexes can be mapped over and composed using type safe |, ++ and * operators, which would result in, basically, parser combinators) Not being overly full of themselves, they also request a link to any useful cliff notes on finite automata/regular expressions/context free grammars/etc because when they took Theory of CS the stars were still young. (cc @apexys)

The mystic forces behind this laborious scene-setting preamble quickly get out of the way so that we can sketch some regular expression combinators with typed derived attributes that get passed back up the tree:

concat : regexp T1 -> regexp T2 -> (T1 -> T2 -> T3) -> regexp T3

alt : regexp T -> regexp T -> regexp T

star : regexp T -> T -> (T -> T -> T) -> regexp T

symbol : char -> T -> regexp T

void : T -> regexp T

(For notational convenience we will write “alt R1 R2″ as “R1 | R2″).

Now we can parse the integers:

digit : regexp int := symbol ‘0′ 0 | symbol ‘1′ 1 | symbol ‘2′ 2 | … | symbol ‘9′ 9

integer : regexp int := star digit 0 \x y -> x*10+y

But this will accept numbers with leading zeroes, so let’s forbid them:

zero : regexp int := symbol ‘0′ 0

nonzero_digit : regexp int := symbol ‘1′ 1 | symbol ‘2′ 2 | … | symbol ‘9′ 9

digit : regexp int := zero | non_zero_digit

integer : regexp int := zero |

concat non_zero_digit (star digit 0 \x y -> x*10+y) (\x y -> omitted for brevity)

And so on. There are some tricky aspects: the alternation and kleene star operators are non-deterministic, so any given regular expression could return multiple possible values. If we weaken the alternation by privileging the left alternate over the right to make it deterministic then we’ve reinvented PEGs, which isn’t what we want at all.

A fun trick that everyone should know about regular expressions is how to implement a matching algorithm using Brzozowski derivatives, where you differentiate the regular expression by each character in the input string, and see if you end up with the empty regular expression, in which case it matched.

But I’m even more partial (heh!) to Antimirov’s more recent algorithm based on partial derivatives of regular expressions, which can create automata more efficiently and elegantly than traditional methods. I would recommend reading the paper, it’s short and very clever.