Jul 12, 2016

Introduction

This is an example of a primitive regex parser and matcher generator. Given a regex pattern it generates a C++ function that can do the matching. The tool is written in F# from scratch. The aim is to use the code as an example of certain simple compilation techniques for F# beginners.

Check out the code

Regex matcher example usage

Let’s assume that we want to find a text lolwat with some possible variations like lolwut and maybe also lolwaaaaat. A regex pattern for this would be:

lolw(aa*|u)t

Here the * (star) symbol means zero or more repetitions (like “a”, “aa” or “aaaaa” etc.) and | (vertical bar) means alternatives (like “a” or “u” for example).

Run the tool like so:

regexgen.exe matchLolWatText "lolw(aa*|u)t" > matchLolWatText.cpp

The produced C++ file will have a function inside:

bool matchLolWatText(std::string text) {...}

The generated function takes a std::string text to match with the pattern and produces a bool result: true if a given text matches the pattern and false otherwise.

This is how it can be used from a C++ program:

bool m; m = matchLolWatText("lolwat"); // true m = matchLolWatText("lolwut"); // true m = matchLolWatText("lolwaaat"); // true m = matchLolWatText("lolwuuut"); // false m = matchLolWatText("lol"); // false m = matchLolWatText("cat"); // false

Algorithms overview

The algorithm is taken from a book “Compilers” by Aho, Sethi aka “Dragon Book”. The implementation uses a recursive descent parser to parse a regex to its syntax tree, applies Thompson’s algorithm to convert it to NFA, then another “empty closure” algorithm to convert it to DFA. In the end the DFA is used to implement a regex matcher.

regex text → AST → NFA → DFA → C++ regex matcher

The algorithms are described in more details in this article: Writing own regular expression parser. The difference here is that the regex matcher is done in both F# and C++. A code generator converts DFA to C++ data structures embedded in the matching function. This article will be focused on the used data structures, F# specific implementation details and the C++ regex matcher generator.

Regex parser: from string to AST

The parser supports just a few expression types: concatenation, alternatives choice and Kleene star closure.

Here’s the used F# model of the expressions AST:

type Regex = | RegexChar of char | RegexStar of Regex | RegexConcat of (Regex * Regex) | RegexChoice of (Regex * Regex)

RegexChar is a primitive expression that marks the tree leaves. An expression like “a” will be parsed to a value: RegexChar('a') . It will match to the only string “a” exactly.

RegexStar represents a Kleene closure. For example an expression “a*” will be parsed as RegexStar(RegexChar('a')) . It will match to an empty string and to any string that is composed of any number of ‘a’ repetitions like “a” or “aaaa”.

RegexChoice represents alternatives matching pattern. For example “a|b” regex will be parsed as RegexChoice(RegexChar('a'), RegexChar('b')) . It means that it matches just 2 strings: either string “a” or string “b”.

RegexConcat represents concatenation, that is when one expression follows the other. For example a regex “ab” will be parsed as RegexConcat(RegexChar('a'), RegexChar('b')) . It will match to only one string “ab” exactly.

All expression types (except RegexChar) can be combined with each other to form more interesting patterns. In the input string this grouping is denoted with parentheses. The parentheses are removed after parsing, because the grouping is naturally represented by expression combinations. For example “(no)*” will be parsed as RegexStar(RegexConcat(RegexChar('n'), RegexChar('o'))) , and it will match strings like: “”, “no”, “nono”, “nonono” etc.

Parentheses can be omitted, for example a regex “((yes)|(no))” is the same as “yes|no”. It means that concatenation must have a higher priority than choice. The operators’ priorities from lowest to highest are:

choice (lowest priority) concatenation star grouping (highest priority)

This is the order in which the recursive descent regex parser tries to match an input regex string (see RegexStringParser module). Most expressions have a special operator symbol, which makes it easy to identify and parse. The only problem is concatenation which doesn’t have any associated operator character. To solve this problem let’s note that the right hand side expression of RegexConcat can only start from “(” or a base alphabet character. In other words if the next character is “*” or “|” then we can’t apply a rule for concatenation (GRegexConcatComposite, see parseGRegexConcat).

NFA & DFA

The regex matcher implementation is based on a state machine where each transition between states is marked with an alphabet character. Matching is done by traversing the state machine graph, following the transitions and the input string character by character.

Building the matcher state machine is done in 2 stages. The first stage builds a “nondeterministic” state machine aka NFA (see buildNFA function). It is represented by the following type:

type NFA = array< Map<Option<char>, Set<int>>>

The states are marked with numbers (from zero). The i-th item in the NFA array represents all possible transitions from state i to some other states. This state machine is nondeterministic, because each character can lead to several states ( Set<int> ). The transition arrow can also be marked with an empty string which is represented by None value of the optional Option<char> .

The second stage builds a deterministic state machine aka DFA (see convertNFA2DFA). This state machine has only one possible target state for each transition and it doesn’t have “None” values. It is easy to match some text using a regex’s DFA. After the NFA2DFA conversion the following structure is built:

type DFA = { transitions: array< Map<char, int>>; finalStates: Set<int>; }

The “transitions” part is similar to the NFA structure except that the “char” alphabet character is not optional and the target state is determined. In addition the DFA might have multiple final states.

Regex matcher generator

Having the DFA data structure makes it possible to generate a regex matcher in a different programming language like C++. See an example of the generated code here: mymatcher.cpp. The F# Map type is changed to the STL unordered_map , and the Set type is made into unordered_set . The regex matcher algorithm simply follows the state machine rules from state 0 to one of the final states. Transition to the next state is made if a current input text character matches the transition rule character. When the end of input string is reached in one of the final states - a match is declared to be found.

F# usage notes

Here are some notes and warnings for novice F# developers based on this project experience:

(+) F# looks great when you describe a recursive data type like an AST or an expression grammar.

(+) F# is great for traversing recursive data structures. For example, searching in a tree looks very natural.

(!) F# requires discipline to keep the code readable. It’s very easy to pile up too much with “|>” operator.

(!) There are 2 styles of combining function calls: assigning an intermediate result with “let” and using it later, or piping with “|>”. Mixing both styles arbitrarily might worsen readability.

(!) When you convert code from imperative style to functional style it’s hard to preserve speed and memory characteristics. Immutability easily leads to memory bloating with data copies in combination with recursion. Having a profiler is a must.

(-) What used to be simple loops in the imperative style in some cases turns to be pairs of recursive functions with extra parameters, and in some other cases becomes “for” generators with lots of mapping, picking and plucking.

(-) The mutually recursive functions ( let rec ... and ) must be fully defined before the IDE syntax recognition fully works again.

) must be fully defined before the IDE syntax recognition fully works again. (-) The online F# documentation is cumbersome and it’s not easy to get an offline version with a quick search.

References

Cover image by fontplaydotcom under CC BY.