Expanding Regular Expressions with LogicT

By Alex Beal

April 8, 2017

Below is code from a previous post on expanding regular expressions. The way it handles non-terminating regular expressions is not optimal. Can you spot the issue? (Hint: the issue is not that it doesn’t terminate.)

-- The regex AST data Regex = Lit Char -- Character literals | Empty -- The empty string | Concat Regex Regex -- Concatenation of two regexs | Alt Regex Regex -- Choice between two regexs | Kleene Regex -- The Kleene star produceAll :: Regex -> [ String ] Lit s) = return [s] produceAll (s)[s] Empty = return "" produceAll Concat r1 r2) = do produceAll (r1 r2) a <- produceAll r1 produceAll r1 b <- produceAll r2 produceAll r2 return (a ++ b) (ab) Alt r1 r2) = produceAll r1 ++ produceAll r2 produceAll (r1 r2)produceAll r1produceAll r2 Kleene r) = do produceAll (r) let concats = ( fmap (\i -> foldr Concat Empty ( replicate i r)) [ 0 .. ]) concats(\ii r)) []) let expandedKleene = foldr Alt Empty concats expandedKleeneconcats produceAll expandedKleene

The issue is that expansion happens in an undesirable way when concatenation and alternatives are combined with non-terminating expressions. Here are two examples to illustrate:

Alternatives are explored from left to right. If I have the regex a*|b , the a* alternative is expanded until exhaustion. Because a* is never exhausted, b is never expanded.

, the alternative is expanded until exhaustion. Because is never exhausted, is never expanded. Concatenations are explored by expanding the left branch until the first success and then exhausting the right branch. If I have regex (a|b)c* , (a|b) is expanded until the first success yielding a , then c* is expanded until exhaustion. Because c* is never exhausted, b is never explored.

We can observe this behavior by running a*|b and (a|b)c* against the interpreter:

-- a*|b > take 10 $ produceAll (Alt (Kleene (Lit 'a')) (Lit 'b')) ["","a","aa","aaa","aaaa","aaaaa","aaaaaa","aaaaaaa","aaaaaaaa","aaaaaaaaa"] -- (a|b)c* > take 10 $ produceAll (Concat (Alt (Lit 'a') (Lit 'b')) (Kleene (Lit 'c')) ) ["a","ac","acc","accc","acccc","accccc","acccccc","accccccc","acccccccc","accccccccc"]

So a*|b only expands a* and (a|b)c* only expands ac* .

A better implementation would alternate expanding each branch no matter if one was non-terminating.

To understand where this undesirable behavior comes from, let’s examine the the Alt and Concat cases in the interpreter, starting with the Alt case:

Alt r1 r2) = produceAll r1 ++ produceAll r2 produceAll (r1 r2)produceAll r1produceAll r2

It’s straightforward to see that if produceAll r1 doesn’t terminate, produceAll r2 is never evaluated.

Now the Concat case:

Concat r1 r2) = produceAll (r1 r2) a <- produceAll r1 produceAll r1 b <- produceAll r2 produceAll r2 return (a ++ b) (ab)

This is harder to see. Essentially what happens is that the first result of produceAll r1 is concatenated to all the results of produceAll r2 , but because r2 is never exhausted, the interpreter never proceeds past the first expansion of r1 . It might be helpful to think of this as two nested loops, where the inner loop is never exhausted.

What’s the solution? LogicT was designed to address precisely this issue. Kiselyov, Shan, and Friedman write:

Most existing backtracking monad transformers, including the ones presented by Hinze, suffer from three deficiencies in practical use: unfairness, confounding negation with pruning, and a limited ability to collect and operate on the final answers of a non-deterministic computation. First, the straightforward depth-first search performed by most implementations of MonadPlus is not fair: a non-deterministic choice between two alternatives tries every solution from the first alternative before any solution from the second alternative. When the first alternative offers an infinite number of solutions, the second alternative is never tried, making the search incomplete. […] Our contribution in this regard is to implement fair disjunctions and conjunctions in monad transformers and using control operators and continuations.

Alt corresponds to the disjunction case: Alt r1 r2 produces either r1 or r2 . This case is unfair in the interpreter because it inherits the unfair semantics of the (++) operator. Concat corresponds to the conjunction case: Concat r1 r2 produces r1 and r2 , conjoined as a single result. This case is unfair in the interpreter because it inherits the unfair semantics of the (>>=) operator on List .

LogicT provides fair disjunction and conjunction as interleave and >>- . Comparing this to their unfair counterparts (++) and (>>=) we see that the types line up precisely:

interleave :: Logic a -> Logic a -> Logic a (++) :: [a] -> [a] -> [a] [a][a][a] (>>-) :: Logic a -> (a -> Logic b) -> Logic b (ab) (>>=) :: [a] -> (a -> [b]) -> [b] [a](a[b])[b]

The similar type signatures are no coincidence. Swapping the fair operators in for the unfair gives us the semantics we want:

produceAllFair :: Regex -> Logic String Alt r1 r2) = produceAllFair r1 `interleave` produceAllFair r2 produceAllFair (r1 r2)produceAllFair r1produceAllFair r2 Concat r1 r2) = produceAllFair (r1 r2) >>- \a -> produceAllFair r1\a >>- \b -> produceAllFair r2\b return (a ++ b) (ab) Lit s) = return [s] produceAllFair (s)[s] Empty = return "" produceAllFair Kleene r) = do produceAllFair (r) let concats = ( fmap (\i -> foldr Concat Empty ( replicate i r)) [ 0 .. ]) concats(\ii r)) []) let expandedKleene = foldr Alt Empty concats expandedKleeneconcats produceAllFair expandedKleene

Running our previous test cases against the new fair interpreter, where observeMany now takes the place of take, the results look much better:

-- a*|b > observeMany 10 $ produceAllFair ( Alt ( Kleene ( Lit 'a' )) ( Lit 'b' )) observeManyproduceAllFair ()) ()) [ "" , "b" , "a" , "aa" , "aaa" , "aaaa" , "aaaaa" , "aaaaaa" , "aaaaaaa" , "aaaaaaaa" ] -- (a|b)c* > observeMany 10 $ produceAllFair ( Concat ( Alt ( Lit 'a' ) ( Lit 'b' )) ( Kleene ( Lit 'c' ))) observeManyproduceAllFair () ()) ())) [ "a" , "b" , "ac" , "bc" , "acc" , "bcc" , "accc" , "bccc" , "acccc" , "bcccc" ]

In a*|b , the b branch is no longer completely ignored in favor of the a* branch.

In (a|b)c* , the b branch is no longer completely ignored in favor of the a and c* branches.