Ideal finger 001 2015-04-05

Finite state automaton index for /usr/share/dict/words .

A finite set of known items. Represent them as byte sequences, hereafter words. Recognize them in insipid soups of bytes. Use a device, a finite state automaton. Now build it, naïvely, in OCaml. Hold on this is only a first step.

Finite state automata are directed graphs whose nodes are states and edges labelled arcs. A label is the byte consumed from the input word when the arc is followed. We totally order labels, it is always useful.

module Label = type t = int let compare : t -> t -> int = Pervasives . compare end

States can be accepting, whenever they recognize a word. They are determined by their set of outgoing arcs. This set of arcs is just a partial map from labels to target states. We also add a mysterious integer tag to states, it will be useful later, ignore it for now.

module State = module Lmap = Map . Make ( Label ) type t = { accepts : bool ; arcs : arcs ; tag : int } and arcs = t Lmap . t

To construct states, we start from a distinguished empty state which has no arcs and does not accept.

val empty : t let empty = { accepts = false ; arcs = Lmap . empty ; tag = 0 }

Starting from this input black hole we construct new states, by making them accepting or by repeatedly adding arcs leading to other states.

val accepting : t -> bool -> t let accepting s accepts = { s with accepts } val add_arc : t -> Label . t -> t -> t let add_arc s l t = { s with arcs = Lmap . add l t s . arcs }

For recognizing words we only need two operators on states. The first one to test if a state accepts, in case we end up there with the input exhausted. The second one to determine, given a state and a label, in which target state we end up, if there is one.

val accepts : t -> bool let accepts s = s . accepts val find : t -> Label . t -> t option let find s l = try Some ( Lmap . find l s . arcs ) with | Not_found -> None

Like labels we want to be able to compare states and especially test them for equality. We ignore tags. When are two states equal ? When they recognize the same set of words. How do we know ? When they are both either accepting or non accepting and have exactly the same set of arcs leading to states that recognizes the same set of words. Recurse.

val compare : t -> t -> int let rec compare s0 s1 = let c = Pervasives . compare s0 . accepts s1 . accepts in if c <> 0 then c else Lmap . compare compare s0 . arcs s1 . arcs

Finally the following functions allow to fold over the arcs of a state and access its tag or change it.

val fold_arcs : ( Label . t -> t -> ' a -> ' a ) -> t -> ' a -> ' a let fold_arcs f s acc = Lmap . fold f s . arcs acc val tag : t -> int let tag s = s . tag val retag : t -> int -> t let retag s tag = { s with tag } end

We have states. By definition they completely define the reachable subgraph of the larger automaton they are part of. Hence with this representation a finite state automaton is simply defined by its starting state.

module Fsa = struct type t = State . t

Recognizing a word is straightforward. We start from the initial state and follow the arcs according to the input byte sequence and see if we end up in a final state.

val accepts : t -> string -> bool let accepts s w = let max = String . length w - 1 in let rec loop k s = if k > max then State . accepts s else match State . find s ( Char . code w .[ k ]) with | None -> false | Some t -> loop ( k + 1 ) t in loop 0 s

Our goal is to build automata for recognizing finite sets of items. For that we start from a distinguished empty automaton which does not accept any byte sequence. Remember the black hole state.

val empty : t let empty = State . empty

To this empty automaton we want to repeateadly add words to recognize. To do this, given a new word, we first want to see which prefix of the word is already recognized by the automaton. For example if we have an automaton recognizing the word "words" and we try to add the word "worlds" we would get to the gray state below before being stuck, lacking an arc for label l .

The function below finds that stuck state and returns the reverse list of states we traversed to get to it. The first item of the list is the stuck state itself. The integer returned is the start position of the word's suffix that could not be recognized. If this value equals the length of the word, all needed states are present in the automaton; but we may still need to change the "stuck" (not really in that case) state to accept.

val find_prefix : t -> string -> int * State . t list let find_prefix s w = let max = String . length w - 1 in let rec loop k rev_path = if k > max then k , rev_path else let s = List . hd rev_path in match State . find s ( Char . code w .[ k ]) with | None -> k , rev_path | Some t -> loop ( k + 1 ) ( t :: rev_path ) in loop 0 [ s ]

Now that we know where we are stuck we can just fork a new branch of states from there to recognize the missing suffix. Applying this to our example results in the automaton below.

This is done with two functions. The first one adds the branch to the stuck state. It does this by constructing the branch backwards, from a newly created accepting state to the stuck state. The second one rebuilds the automaton we deconstructed from the stuck state back to the start state.

val add_branch : State . t -> string -> pos : int -> State . t let add_branch stuck w ~pos = let max = String . length w - 1 in let rec loop k next = let l = Char . code w .[ k ] in if k = pos then ( State . add_arc stuck l next ) else loop ( k - 1 ) ( State . add_arc State . empty l next ) in if pos > max then State . accepting stuck true else loop max ( State . accepting State . empty true ) val rebuild : string -> pos : int -> State . t list -> t = let rec rebuild w ~pos rev_path = match rev_path with | [ start ] -> start | t :: s :: rev_path -> let s = State . add_arc s ( Char . code w .[ pos ]) t in rebuild w ~pos :( pos - 1 ) ( s :: rev_path ) | [] -> assert false

With these functions it now becomes easy to add new words to an automaton.

val add_word : t -> string -> t let add_word s w = let pos , rev_path = find_prefix s w in let stuck = List . hd rev_path in let unstuck = add_branch stuck w ~pos in rebuild w ~pos :( pos - 1 ) ( unstuck :: List . tl rev_path )

However if go back to the automaton with words and worlds, we see that creating new branches creates states that are equal, i.e. that recognize the same set of words. In this example, the target states of d -labelled arcs all recognize the same set of words, namely {s}. In fact this algorithm builds a trie where each final state is unique to each word. It is easy to see that this is very wasteful and that common word suffixes will never share the same representation with this algorithm. One way to side-step the issue is to merge equivalent states after the fact to get to a minimal automaton like the following one.

Given that our automaton is acyclic by construction, minimizing is not too hard. Just perform a post-order enumeration of the directed acyclic graph and with each state look into a set of states to see if there is not already an equivalent one that exists. If that is the case use that one instead.

module Sset = Set . Make ( State ) val minimize : t -> t let minimize s = let rec minimize exists s = try exists , Sset . find s exists with | Not_found -> let explore_arc l t ( exists , s ) = let exists , t = minimize exists t in exists , State . add_arc s l t in let empty = State .( accepting empty ( accepts s )) in let init = exists , empty in let exists , s = State . fold_arcs explore_arc s init in Sset . add s exists , s in snd ( minimize Sset . empty s )

That's it, the first step is over. We can build minimal recognizers for finite sets of items.

Recognizing items is not enough. We want to associate data to items so that they can be queried for that information. Items are also ordered by their lexicographical byte order. We want to be able to perform range queries returning the data for all the items existing between two given items.

Considering these wishes, the unminimized trie structure produced by the add_word function retrospectively feels handy: each word ends with its own accepting state. We could simply store an index value for the word in the tag of that state and use this to index a table of associated data. Ranges seem tricky though and tries are a waste of memory anyways.

Let us go back to our minimal automaton. The problem is that many words end up being recognized with the same final state, so the tag cannot directly be used as an index value. However it turns out there exists a beautiful state tagging scheme applicable to any automaton and hence to the minimal one aswell, that will allow us to recover a unique index value during word recognition.

It is very simple. Given a state, tag it with the number of words that can be recognized by starting from this state. Remember, the set of added words is finite, the automaton is not a general one. It is simpler, acyclic, this number is thus finite.

How can we compute this number ? For a given state, we start with 0 if the state does not accept or 1 if it does and sum for each of its arcs, the number of words accepted by the target state. Recurse. Here again a simple post-order state enumeration — or let us say reconstruction — accumulating accepted word counts in state tags will compute the right numbers.

We define an index data structure as an automaton whose states are tagged with accepted word counts. We build it by minimizing a given automaton and retagging it as defined above — this could of course be done in a single pass.

type index = t val index : t -> index let index s = let s = minimize s in let rec retag s = let explore_arc l t ( wcount , s ) = let t = retag t in ( wcount + State . tag t , State . add_arc s l t ) in let wcount = if State . accepts s then 1 else 0 in let empty = State .( accepting empty ( accepts s )) in let init = wcount , empty in let wcount , s = State . fold_arcs explore_arc s init in State . retag s wcount in retag s

A by-product, it is now trivial to determine how many words the automaton accepts.

val index_word_count : index -> int let index_word_count s = State . tag s

Now gather all your items, order them lexicographically in a list. For each item consider its zero-based position in the list. This gives you a unique index number for an item. It happens that for a word this number can be derived with little cost from the tagged automaton during word acceptance.

Assume you are in a state s and found the label l you need to follow. Consider all the arcs that have a label smaller than l , had you taken one of these arcs you would have ended up with a word that is lexicographically smaller than yours. Summing up the number of words that are accepted by following each of these arcs, i.e. summing up their target's tag, and adding 1 if s itself is an accepting state gives you the number of elements in the lexicographical list of items you have skipped by taking the l -labelled arc. By computing and accumulating this number in each traversed state during word recognition you end up with the word's unique position in the lexicographical list.

val find_index : index -> string -> int option let find_index s w = let max = String . length w - 1 in let rec loop k s i = if k > max then ( if State . accepts s then Some i else None ) else let l = Char . code w .[ k ] in match State . find s l with | None -> None | Some t -> let add_skip l ' t i = if l ' < l then ( State . tag t ) + i else i in let i = State . fold_arcs add_skip s i in loop ( k + 1 ) t ( if State . accepts s then ( i + 1 ) else i ) in loop 0 s 0

Definining the index number as being the position in the lexicographical list makes range queries trivial. Compute the index of the starting range. Compute the index of the end range. You have the index value interval you need to consider.

Here is a wish we did not have. Given and index i returned by find_index find the corresponding word. This means we need to count i+1 items in the lexicographical list to get to our word. As we explained above we know how much items we skip in the lexicographical list by taking or not taking an arc. At each state we need to take the smallest arc that will allow us to skip enough lexicographically smaller items to get to our word. Accumulating the labels of the arcs we follow constructs the word corresponding to the index value.

val find_word : index -> int -> string option let find_word s i = let b = Buffer . create 42 in let rec loop s count = let count = if State . accepts s then count - 1 else count in if count = 0 then Some ( Buffer . contents b ) else let find_arc l t ( found , count as acc ) = match found with | Some _ -> acc | None -> let wcount = State . tag t in if wcount < count then ( found , count - wcount ) else ( Buffer . add_char b ( Char . chr l ); Some t , count ) in match State . fold_arcs find_arc s ( None , count ) with | None , _ -> None | Some t , count -> loop t count in loop s ( i + 1 ) end

Given a set of items, its tagged finite state automaton maintains a bijective map between an item and its zero-based position in the lexicographical order of items.

Other words. What do we get here ? A perfect, minimal, order-preserving and invertible hash function for any given finite set of items. Perfect, no collisions. Minimal, maps n items to n consecutive integers. Order-preserving, maps ordered items to ordered hashes. Invertible, give me the hash, I give you back the smoking word.

Throw out the buckets !

And the baby.

This is a naïve implementation. There are two main issues.

Firstly, our finite state automaton is minimal in the sense that it maximizes data sharing among accepted words. But its memory representation is not compact, it uses a huge amount of big, fat, memory pointers. This paper has a few hints on how finite state automata can be represented compactly in memory.

Secondly, building first the trie and only minimize it later can waste a huge amount of memory when you process millions of items. The solution is to construct the minimal automaton on the fly, this paper has an algorithm.

This state numbering scheme for constructing unique indices seems to be described for the first time in this paper.

Full naïve source.