Debugging Regexes

"Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems." -- JWZ

Regexes can sometimes be cryptic and difficult to understand. On top of that, when a regex fails, regex implementations rarely give you any feedback to figure out why or where. I've spent a suboptimal portion of my life staring bemusedly at regexes that weren't quite right, so I decided to write a small hack to fix my annoyances.

Since I had a regex implementation kicking around for Myrddin, I decided to instrument it to make a tool that would make it easier to see where regexes failed.

Pattern

[0-9]*(foo|bar|baz)[0-9]*

Inputs 123bar666 123bat999 Run!

Press 'Run' to see output!

The regex dialect used is documented on the Myrddin reference page

Understanding the Output

Before describing how it works, I'm going to go through some examples of what it dumps, and how it can be used to debug things:

We're going to be ignoring the fundamental rule of validating email addresses with regex ("don't") and running with a very broken email regex:

$ redump '[a-zA-Z0-9]+@[a-zA-Z0-9]+.(com|org|net)'

redump will sit waiting for text to match on stdin, so let's type in an email that matches our expectations:

valid@email.org Matched: valid@email.org group 1: org coverage: [a-zA-Z0-9]+@[a-zA-Z0-9]+.(com|org|net) ^ ^^^ ^^^ ^^^^^

This tells us, pretty clearly: The string matched, and the group that was selected was 'org', since that was the only parenthesized section. It also shows the coverage for the regex -- in other words, the part of the regex that we passed through in order to match the string. In this case, we covered these all clauses up to the final selection, where the final match only included com.

Now, on the other hand, If the regex fails to match, we also get an indicator showing where the regex was where it failed, making it far easier to figure out why it failed:

thingy_thing@hotmail.com Match failed: [a-zA-Z0-9]+@[a-zA-Z0-9]+.(com|org|net) ~~~~~~~~~~~~^ thingy_thing@hotmail.com ~~~~~~^ something@university.edu Match failed: [a-zA-Z0-9]+@[a-zA-Z0-9]+.(com|org|net) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^ something@university.edu ~~~~~~~~~~~~~~~~~~~~~^

This tells us where we were in the string when the regex failed, as well as where we were in the regex. For example, with the input thingy_thing@hotmail.com , the regex failed because of the '_'. Matching up with the pattern, the failure is immediately after the [a-zA-Z0-9]+ . Ah, this pattern is failing because the first section of the pattern doesn't contain a _ .

The second input fails on the edu in the string, and the net in the pattern, again highlighting the source of the failure quickly and easily, showing where the farthest mismatch in the string is.

Note: in the current implementation of coverage printing, if we have a single clause, there is only one caret printed, even if it spans multiple characters. Therefore, if we passed through

[a-z]

We would mark that in coverage as

[a-z] ^

Instead of

[a-z] ^^^^^

This is a restriction of the current way that debug information about the regex is recorded, but is probably not too difficult to fix.

Getting the Code

The redump is built on top of the Myrddin regex library, and as a result is shipped with the Myrddin programming language. The dialect of regex implemented is documented here in the API reference for the library

The code lives here:

https://github.com/oridb/mc/blob/master/lib/regex/redump.myr

Implementing The Debugger

Execution Traces

The regex engine I'm using is implemented as a relatively simple NFA virtual machine, which will walk through a set of instructions, either matching the instruction or dying. So, for example, the regex abc will generate the following code, as dumped by redump -v abc :

0: `Ilbra 0 1: `Ibyte 97 (a) 2: `Ibyte 98 (b) 3: `Ibyte 99 (c) 4: `Irbra 0 5: `Imatch

The regex starts off at instruction 0. Ilbra 0 marks the start of a capture group 0, which captures the whole expression. The next three instructions will fail the match if the next byte in the pattern is not a , followed by b and c . The next Irbra instruction closes the match group started by Ilbra , recording the capture group 0. Finally, Imatch ends the expression successfully.

But this is a pretty boring example. What if we have more interesting regexes, like a*b ? This compiles to something like:

0: `Ilbra 0 1: `Ifork (2, 4) 2: `Ibyte 97 (a) 3: `Ijmp 1 4: `Ibyte 98 (b) 5: `Irbra 0 6: `Imatch

The structure looks pretty similar, but there are a couple of very interesting instructions in there. Ifork will split the execution stream, effectively branching both ways. Executing this regex program on 'aab' will succeed, giving the following execution trace:

The red squares indicate failed attempts at matching. The final green square indicates a successful match.

However, if run over 'aacdef', the first mismatched character, 'c', will cause the final thread to exit, since it matches neither live thread. The

Failures

If we want to know why a regex failed, the last thread alive is the most interesting one. Simply mapping from the final program counter to the offset in the regex is sufficient to know where the regex failed with enough fidelity to enable debugging.

To support debugging, the regex parser and compiler needed to be instrumented to keep track of where nodes in the AST were generated, so that they could be mapped back to offsets within the pattern. The debug information is stored along side the regex, and is pulled out by redump when formatting the output:

0: offset 0: `Ilbra 0 // start 1: offset 1: `Ifork (2, 4) // from * 2: offset 0: `Ibyte 97 (a) // from 'a' 3: offset 1: `Ijmp 1 // from * 4: offset 2: `Ibyte 98 (b) // from b 5: offset 0: `Irbra 0 // don't care 6: offset 0: `Imatch // don't care

There's not too much more to it. We know what match the regex failed on, where we were when we failed, and where in the pattern that came from.

Successes

Often, it's just as confusing trying to sort out why a regex failed as why it succeeded. A bug in the pattern can allow some essential portion to be skipped, or some portion may capture too much data. Coming up with a useful method of debugging successful matches is slightly trickier, but I think that the coverage solution presented above does a good job of making it easier to understand why a pattern matched.

The debug information needed for generating the coverage info is a superset of the information needed to debug failures. In addition to knowing the farthest PC of the thread, a full trace of the states that it hit is needed.

This is recorded during regex execution as per-thread a bit set of instruction pointers, indexed by the thread id. Each step advanced through the instructions is recorded, and the winning thread is stored at the end of regex execution.

Finally, after the success is recorded, the set of 'hit' locations is generated by mapping from the instruction pointers to the regex offsets, and printing a caret for every regex offset that was hit.

There's one problem with the coverage information being used for this purpose: Currently, each pc only maps to a single point, rather than a range. This causes some loss of fidelity, and is the reason that the coverage caret only shows up under the start of a multi character entity.

Fixing this bug would involve a bit of extra book keeping, but should be fixable.

Conclusion

That's it. Regexes are debuggable, and you can use this code to do it.

If you want to learn more about how regexes work in detail, I'd suggest reading Russ Cox's excellent series: https://swtch.com/~rsc/regexp/