Re: Henry Spencer's "Tcl" Regex Library

"Russ Cox" <rsc@swtch.com>

Tue, 2 Oct 2007 12:12:18 -0400

From: "Russ Cox" <rsc@swtch.com> Newsgroups: comp.compilers Date: Tue, 2 Oct 2007 12:12:18 -0400 Organization: Compilers Central References: 07-10-016 Keywords: lex, DFA Posted-Date: 03 Oct 2007 13:18:36 EDT

There are (at least) three different issues you've brought up --

the use of the term Hybrid NFA/DFA, whether Tcl does what

Michela does, and on-the-fly DFA constructions.









* Hybrid NFA/DFA





The first problem is your use of the term "hybrid NFA/DFA", which

unfortunately has been defined in Friedl's _Mastering Regular

Expressions_ to mean something different from what you mean.





Friedl doesn't ever define what a DFA or NFA is. He only says

that NFAs provide submatch information and backreferences, and

DFAs don't but run faster. The closest he comes to admitting that

these aren't really definitions is in the sidebar on p. 180:





As a programmer, if you have a true (mathematically speaking)

NFA regex engine, it is a relatively small task to add support

for backreferences. A DFA's engine's design precludes the

adding of this support, but an NFA's common implementation

makes it trivial. In doing so, you create a more powerful

tool, but you also make it decidedly nonregular (mathematically

speaking). What does this mean? At most, that you should

probably stop calling it an NFA, and start using the phrase

``nonregular expressions,'' since that describes (mathematically

speaking) the new situation. No one has actually done this,

so the name ``NFA'' has lingered, even though the implementation

is no longer (mathematically speaking) an NFA.





What does all this mean to you, as a user? Absolutely

nothing. As a user, you don't care if it's regular, nonregular,

unregular, irregular, or incontinent. So long as you know

what you can expect from it (something this chapter will

show you), you know all you need to care about.





http://regex.info/blog/2006-09-15/248





This "quacks like a duck" approach to calling things NFAs or

DFAs cannot accomodate implementations that run a DFA pass first

and then an NFA pass only to find submatches, or that manage to

use nothing but DFAs to find submatches. So Friedl calls them

"hybrid NFA/DFAs", because they kind of quack like both.

Table 4-1 (p. 145), which classifies programs by "engine type",

lists GNU awk, GNU grep/egrep, and Tcl as "Hybrid NFA/DFA".





The sidebar on p. 182 expands on this classification:





I've said several times that a DFA can't provide capturing

parentheses or backreferences. This is quite true, but it

certainly doesn't preclude hybrid approaches that mix

technologies in an attempt to reach regex nirvana. The

sidebar on page 180 told how NFAs have diverged from the

theoretical straight and narrow in search of more power, and

it's only natural that the same happens with DFAs. A DFA's

construction makes it more difficult, but that doesn't mean

impossible.





GNU grep takes a simple but effective approach. It uses a

DFA when possible, reverting to an NFA when backreferences

are used. GNU awk does something similar--it uses GNU grep's

fast shortest-leftmost DFA engine for simple "does it match"

checks, and reverts to a different engine for checks where

the actual extent of the match must be known. Since that

other engine is an NFA, GNU awk can conveniently offer

capturing parentheses, and it does via its special gensub

function.





Tcl's regex engine is a true hybrid, custom built by Henry

Spencer (whom you may remember having played an important

part in the early development and popularization of regular

expression [p. 88]). The Tcl engine someitmes appears to

be an NFA--it has lookaround, capturing parentheses,

back-references, and lazy quantifiers. Yet, it has true

POSIX longest-leftmost match ([p. 177]), and doesn't suffer

from some of the NFA problems that we'll see in Chapter 6.

It really seems quite wonderful.





Thus the use of "hybrid NFA-DFA" to describe Michela's work

unfortunately makes it sound, to those who know these terms only

through Friedl's book, like it's no different than GNU grep or Tcl.





Your one-sentence description makes it sound, to me, like it

*is* different from both the awk/grep hybrid and from Tcl; if

it's not too late, you might try to change the name to something

other than "hybrid".









* What does Tcl's regex do?





Let's assume a regexp with capturing parentheses but not

backreferences. (That is, /the (cat|dog) ran/ is okay, but

/(cat|dog) eat \1/ is not.) Let's also assume POSIX-style longest

match semantics are the goal.





Tcl's regex first runs an unanchored DFA search to find p, the

ending position of the shortest match (meaning the match with

leftmost end). If there is no match, stop, returning failure.





If the caller doesn't care about any match position info, stop,

returning success. Otherwise, it's time to figure out the exact

position of the leftmost longest match.





The leftmost longest match must start at or before p. Tcl's

regex loops over every character position from the beginning of

the string to p, trying an anchored DFA search to find the length

of the longest match starting at the current position. When it

finds a match, it stops the loop: that's the leftmost longest

match.





If the caller only cares about the position of the overall match

($0) but not any of the capturing parentheses ($1, $2, ...),

then stop, returning success. Otherwise, it's time to figure

out the positions of the capturing parentheses.





The algorithm up to this point is:





match(re, text, match, nmatch)

{

etext = text + strlen(text);

p = shortest(/.*re/, text, etext);

if(p == NULL)

return false;

if(nmatch == 0)

return true;

for(q = text; q <= p; q++){

eq = longest(/re/, q, etext);

if(eq != NULL){

match[0].start = q;

match[0].end = eq;

break;

}

}

if(nmatch == 1)

return true;





dissect(re, match[0].start, match[0].end, match);

}





The function dissect determines the position of every subexpression

in re, saving the ones corresponding to capturing parentheses.

It does this by recursing over the structure of the regexp and

searching using DFAs corresponding to the subpieces.





To dissect a concatenation, Tcl figures out the longest possible

match for the first half and checks whether the remainder matches

the second half. If so, done. If not, find the next longest

possible match for the first half, and so on:





dissect(re, text, etext, match)

{

if(re is concat: left right){

for(middle = etext; middle >= text; middle = p-1){

p = longest(/left/, text, middle);

if(p == NULL)

break;

if(longest(/right/, p, etext) == etext){

dissect(left, text, p, match);

dissect(right, p, etext, match);

}

}

panic("cannot dissect concat");

}





To dissect an alternation, Tcl tries each choice looking for the

one that uses all the text:





if(re is alt: left|right){

if(longest(/left/, text, etext) == etext)

dissect(left, text, etext, match);

else

dissect(right, text, etext, match);

}





If dissect comes across a capturing parenthesis, then it can

record the match.





if(re is capture n: (subexpr)){

match[n].start = text;

match[n].end = etext;

dissect(subexpr, text, etext, match);

}

}





(The regexp is actually a directed graph that might have cycles,

not a tree; regexps like a* are rewritten into (aa*|).)





Note that there is no backtracking here: each step determines

the subdissection and then moves on. There is no "and if that

doesn't work, ...".





An obvious optimization would be to make dissect do nothing if

the re being considered contained no capturing parens. The Tcl

code I'm looking at appears not to do that.





The run-time of this algorithm is a little tricky to analyze.

On a regexp of length m on a text of length n, and ignoring the

time to construct DFAs/NFAs from regexps...





If no submatch info is required it takes O(n) time.





If only match[0] is required this it O(n^2) time.





The top-level concatenation dissection requires O(n^2) time. In

the worst case you cut off just one character, making the next

levels O(1^2) and O((n-1)^2). Repeating that gives O(n^3) time.

Alternations might force you to do this multiple times, but no

more than m, which would be O(m * n^3) time.





I'm not sure how things like repetition or empty strings affect

the run-time. The fact that a* is rewritten into an effectively

infinite-length regular expression can probably bump the m up

to n, so that you could force O(n^4) behavior if you really tried.





Anyway, at least the run time is polynomial!









* On-the-fly DFA constructions





Thompson's 1968 CACM paper is hard to classify, since it doesn't

mention the terms NFA or DFA. If you're willing to read between

the lines then the compiled code is really an encoding of an NFA

that runs a subset simulation as it reads the input. It's kind

of building the DFA on the fly, if you view each successive CLIST

as a DFA state, but it's not caching any of the DFA states that

it builds, so it's more like an NFA subset simulation. Since

it doesn't cache at all, it's slower, but it can't possibly build

an exponentially-large machine.





Spencer's Tcl regex builds the DFA on the fly from the NFA

representation, using a cache to speed the DFA simulation. The

cache has a fixed maximum size, though, so it can't end up

building an exponentially-large machine either.









Russ



Return to the comp.compilers page.

Search the comp.compilers archives again.