MonadFix example: compiling regular expressions

{-# LANGUAGE RecursiveDo, BangPatterns #-} import Control.Applicative import Data.Function (fix) (fix) import Data.IntMap as IntMap import Control.Monad.Fix (mfix) (mfix) import Control.Monad.Trans.State import Control.Monad.Trans.Class (lift) (lift) import Text.Read (readMaybe) (readMaybe)

MonadFix is an odd beast; many Haskell programmers will never use it in their careers. It is indeed very rarely that one needs MonadFix; and for that reason, non-contrived cases where MonadFix is needed are quite interesting to consider.

In this article, I’ll introduce MonadFix and show how it can be handy for compiling the Kleene closure (also known as star or repetition) of regular expressions.

What is MonadFix?

If you hear about MonadFix for the first time, you might think that it is needed to define recursive monadic actions, just like ordinary fix is used to define recursive functions. That would be a mistake. In fact, fix is just as applicable to monadic actions as it is to functions:

= fix $ \ repeat -> do guessNumber mfix putStrLn "Enter a guess" n <- readMaybe <$> getLine readMaybe if n == Just m then putStrLn "You guessed it!" else do putStrLn "You guessed wrong; try again" repeat

So, what is mfix for? First, recall that in Haskell, one can create recursive definitions not just for functions (which makes sense in other, non-lazy languages) or monadic actions, but for ordinary data structures as well. This is known as cyclic (or circular, or corecursive) definitions; and the technique itself is sometimes referred to as tying the knot.

The classic example of a cyclic definition is the (lazy, infinite) list of Fibonacci numbers:

= 0 : 1 : zipWith ( + ) fib ( tail fib) fib) fib (fib)

Cyclic definitions are themselves rare in day-to-day Haskell programming; but occasionally, the right hand side will be not a pure value, but a monadic computation that needs to be run in order to obtain the value.

Consider this (contrived) example, where we start the sequence with an arbitrary number entered by the user:

= do fibIO1 putStrLn "Enter the start number" <- read <$> getLine start return $ start : 1 : zipWith ( + ) fibIO1 ( tail fibIO1) start) fibIO1 (fibIO1)

This doesn’t typecheck because fibIO is not a list; it’s an IO action that produces a list.

But if we try to run the computation, it doesn’t make much sense either:

= do fibIO2 putStrLn "Enter the start number" <- read <$> getLine start <- fibIO2 fibfibIO2 return $ start : 1 : zipWith ( + ) fib ( tail fib) start) fib (fib)

This version of fibIO will ask you to enter the start number ad infinitum and never get to evaluating anything.

Of course, the simplest thing to do would be to move IO out of the recursive equation; that’s why I said the example was contrived. But MonadFix gives another solution:

= mfix $ \fib -> do fibIO3mfix\fib putStrLn "Enter the start number" <- read <$> getLine start return $ start : 1 : zipWith ( + ) fib ( tail fib) start) fib (fib)

Or, using the do-rec syntax:

= do fibIO4 rec <- do fib putStrLn "Enter the start number" <- read <$> getLine start return $ start : 1 : zipWith ( + ) fib ( tail fib) start) fib (fib) return fib fib

Compiling regular expressions

As promised, I am going to show you an example usage of MonadFix that solved a problem other than “how could I use MonadFix?”. This came up in my work on regex-applicative.

For a simplified presentation, let’s consider this type of regular expressions:

data RE = Sym Char -- symbol | Seq RE RE -- sequence | Alt RE RE -- alternative | Rep RE -- repetition

Our goal is to compile a regular expression into a corresponding NFA. The states will be represented by integer numbers. State 0 corresponds to successful completion; and each Sym inside a regex will have a unique positive state in which we are expecting the corresponding character.

type NFAState = Int

The NFA will be represented by a map

type NFA = IntMap ( Char , [ NFAState ]) , [])

where each state is mapped to the characters expected at that state and the list of states where we go in case we get the expected character.

To compile a regular expression, we’ll take as an argument the list of states to proceed to when the regular expression as a whole succeeds (otherwise we’d have to compile each subexpression separately and then glue NFAs together). This is essentially the continuation-passing style; only instead of functions, our continuations are NFA states.

During the compilation, we’ll use a stack of two State monads: one to assign sequential state numbers to Sym s; the other to keep track of the currently constructred NFA.

-- Returns the list of start states and the transition table compile :: RE -> ([ NFAState ], NFA ) ([], = runState (evalStateT (go re [ 0 ]) 0 ) IntMap.empty compile rerunState (evalStateT (go re [])) IntMap.empty -- go accepts exit states, returns entry states go :: RE -> [ NFAState ] -> StateT NFAState ( State NFA ) [ NFAState ] ) [ = go re exitStates case re of re Sym c -> do ! freshState <- gets ( + 1 ); put freshState freshStategets (); put freshState $ modify' (IntMap.insert freshState (c, exitStates)) liftmodify' (IntMap.insert freshState (c, exitStates)) return [freshState] [freshState] Alt r1 r2 -> ( ++ ) <$> go r1 exitStates <*> go r2 exitStates r1 r2go r1 exitStatesgo r2 exitStates Seq r1 r2 -> go r1 =<< go r2 exitStates r1 r2go r1go r2 exitStates

This was easy so far: alternatives share their exit states and their entry states are combined; and consequtive subexpressions are chained. But how do we compile Rep ? The exit states of the repeated subexpression should become its own entry states; but we don’t know the entry states until we compile it!

And this is precisely where MonadFix (or recursive do) comes in:

Rep r -> do rec let allEntryStates = ownEntryStates ++ exitStates allEntryStatesownEntryStatesexitStates <- go r allEntryStates ownEntryStatesgo r allEntryStates return allEntryStates allEntryStates

Why does this circular definition work? If we unwrap the State types, we’ll see that the go function actually computes a triple of three non-strict fields:

The last used state number The list of entry states The NFA map

The elements of the triple may depend on each other as long as there are no actual loops during evaluation. One can check that the fields can be indeed evaluated linearly in the order in which they are listed above:

The used state numbers at each step depend only on the regular expression itself, so it can be computed wihtout knowing the other two fields. The list of entry states relies only on the state number information; it doesn’t need to know anything about the NFA transitions. The NFA table needs to know the entry and exit states; but that is fine, we can go ahead and compute that information without creating any reverse data dependencies.

Further reading

An ASM Monad – a similar example from a different domain.

Oliver Charles’s 24 Days of GHC Extensions: Recursive Do.

Levent Erkok’s thesis which contains all you need to know about MonadFix, including several other examples.

Todd Wilson points out that Douglas McIlroy describes a similar regular expression compilation technique in his 2004 JFP Functional Pearl Enumerating the strings of regular languages. Like this article, Douglas’s paper uses a circular definition when compiling the Kleene closure. But the circular definition is not monadic there: instead of using the State monad, Douglas passes the state around by hand.