The Problem



> {-# LANGUAGE TypeSynonymInstances,FlexibleInstances,MultiParamTypeClasses #-}

> import qualified Data.Array as B

> import Data.Array.Unboxed as U

> import Data.Foldable

> import Data.Monoid

> import Data.FingerTree hiding (fromList)

> import qualified Data.List as L





A Finite State Machine

.*(.*007.*).*

fsm



> fsm 0 '(' = 1

> fsm 0 _ = 0



> fsm 1 '0' = 2

> fsm 1 _ = 1



> fsm 2 '0' = 3

> fsm 2 _ = 1



> fsm 3 '7' = 4

> fsm 3 '0' = 3

> fsm 3 _ = 1



> fsm 4 ')' = 5

> fsm 4 _ = 4



> fsm 5 _ = 5







> matches s = Prelude.foldl fsm 0 s==5





matches "(00 7)"

matches "He(007xxxxxxxxxxxx)llo"

(.)

(.)

f . g

g

f



> tabulate f = array (0,5) [(x,f x) | x <- range (0,5)]







> letters = array (' ','z') [(i,tabulate (flip fsm i)) | i <- range (' ','z')]





mappend



> type Table = UArray Int Int



> instance Monoid Table where

> mempty = tabulate id

> f `mappend` g = tabulate (\state -> (U.!) g ((U.!) f state))





Table

Ints

Int

[0..5]

Table

letters



> matches' s = table!0==5 where

> table = mconcat (map ((B.!) letters) s)





matches

Table

Table

Table

Table

Size

Table



> data Elem a = Elem { getElem :: a } deriving Show

> data Size = Size { getSize :: Int } deriving (Eq,Ord,Show)



> instance Monoid Size where

> mempty = Size 0

> Size m `mappend` Size n = Size (m+n)





measure



> instance Measured (Size,Table) (Elem Char) where

> measure (Elem a) = (Size 1,(B.!) letters a)





A Fingertree



> type FingerString = FingerTree (Size,Table) (Elem Char)







> insert :: Int -> Char -> FingerString -> FingerString

> insert i c z = l >< (Elem c <| r) where (l,r) = split (\(Size n,_) -> n>i) z







> fromList = L.foldl' (|>) empty

> string = fromList (map Elem $ take 1000000 $ cycle "the quick brown fox jumped over the lazy dog")





fromList



> matches007 string = ((U.!) (snd (measure string)) 0)==5





An Interactive Loop

ghc --make -O5 -o regexp regexp.lhs



> loop string = do

> print $ "matches? " ++ show (matches007 string)

> print "(Position,Character)"

> r <- getLine

> let (i,c) = read r

> loop $ insert i c string



> main = do

> loop string





(100,'f')

foldl'



(100,'(')

(900000,')')

(20105,'0')

(20106,'0')

(20107,'7')



Discussion

Table

Consider this problem: Fix a regular expression R. Suppose you have a string of length N. There's not much you can do about it, you'll likely have to scan all N characters to test so see if the string matches R. But once you've performed the test, how fast can you test the string again if a small edit is made to it? It seems that in general you'd have to rescan the entire string, or at least rescan from where the edit was made. But it turns out that you can do regular expression matching incrementally so that for many changes you might make to the string, you only require O(log N) time to recompute whether the string matches. This is true even if characters at opposite ends of the string interact to make a succcessful match. What's more, it's remarkably straightforward to implement if we make use of fingertrees and monoids.I'm going to assume a bit of background for which resources can be found on the web: understanding some basics about monoids , understanding apfelmus's inspiring introduction to fingertrees or the original paper , and you'll need to be completely comfortable with the idea of compiling regular expressions to finite state machines.As usual, this is literate Haskell. But you need to have the fingertree package installed and we need a bunch of imports.So let's start with an example regular expression:. We're looking for "007" enclosed between parentheses, but the parentheses could be millions of characters apart.A standard technique for finding regular expressions is to compile them to a finite state automaton. It takes quite a bit of code to do that, but it is completely standard. So rather than do that, here's a finite state machine I constructed by hand for this regular expression:I've used the convention that an unlabelled edge corresponds to any input that isn't matched by another labelled edge. We can express the transitions as a functionThe initial state is 0 and a match corresponds to state 5.We can test a string in the standard way usingTryandWe can thing of the inputs as being functions acting on the automaton. Each input character is a function that maps the automaton from one state to another. We could use the Haskell composition function,, to compose these functions. Butdoesn't really do anything,is just a closure that says "when the time comes, applyand thenOn the other hand we could tabulate our transition functions as follows:We have one such tabulated function for each letter in our alphabet:Given two tabulated functions we can easily form the table of the composition function. In fact, our tabulated functions form a monoid withfor composition. I used unboxed arrays for performance:Note that we've cheated a bit. An object of typecould be any array ofindexed by. But if we promise to only build arrays indexed by elements ofand containing elements of the same range then our claim to monoidhood is valid.Given any string, we can compute whether it matches our regular expression by looking up the correspondingin ourarray, composing them, and then checking if the tabulated function maps the initial state 0 to the final state 5:This is slower and more complex than our original implementation of. But what we've now done is 'tease out' a monoid structure from the problem. If we store a string as a sequence of characters represented by a fingertree, we can store in each subtree the element ofcorresponding to the substring it represents. Every time the tree is rebalanced we need to recompute the correspondings. But that's fine, it typically involves only O(log N) operations, and we don't need to write any code, the fingertree will do it for us automatically. Once we've done this, we end up with a representation of strings with the property that we always know what the correspondingis. We can freely split and join such trees knowing that thewill always be up to date.The only slight complication is that I want to be able to randomly access the nth character of the tree. apfelmus explains that in his post. The change I need to make is that I'm going to use both themonoid and themonoid, so I need the product monoid We need to implementas in the fingertree paper And now we can define our strings as:The insertion routine is more or less what's in the paper:Note how I project out the size from the product monoid in order to insert at the correct position.Here's an example string. Adjust the length to suit your memory and CPU horsepower:(I use a strict form ofto ensure the tree actually gets built.)The actual match function simply projects out the second component of the monoid and again tests to see if it maps the initial state to the final state:I recommend compiling with optimisation, something likeNow you can run this interactively. Input values liketo insert an 'f' at position 100. It can take a good few seconds to compute the initial tree, but after that the matching process is instantaneous. (Actually, the second match might take a few seconds, that's because despite thethe tree hasn't been fully built.)A suitable sample input is:Note there is quite an overhead for this example. I'm storing an entirefor each character. But you can easily store chunks of string (like in a rope ). This means that some chunks will be rescanned when a string is edited - but rescanning a 1K chunk, say, is a lot less expensive than scanning a gigabyte file in its entirety. Working in blocks will probably speed up the initial scan too, a much smaller tree needs to be built.When Hinze and Patterson originally wrote the fingertree paper they were motivated by parallel prefix sum methods. Just about any parallel prefix algorithm can be converted to an incremental algorithm using fingertrees. This article is based on the idea of doing this with the parallel lexing scheme described by Hillis and Steele in their classic Connection Machine paper So why would you want to match against a fixed regular expression like this? Well this method extends to a full blown incremental lexer. This will lex quickly even if placing a character in a string changes the type of lexemes billions of characters away. See the Hillis and Steele paper for details.Note there's nothing especially Haskelly about this code except that Haskell made it easy to prototype. You can do this in C++, say, using mutable red-black trees.