Replacing regexps

Did you know...? LWN.net is a subscriber-supported publication; we rely on subscribers to keep the entire operation going. Please help out by buying a subscription and keeping LWN on the net.

Regular expressions are a pain. Their power cannot be doubted; a regular expression can describe complicated text patterns in an exceedingly concise manner. Using regular expressions, a program can perform all kinds of string parsing and recognition tasks. But they are also difficult to write, difficult to read, difficult to understand, and difficult to debug. Any but the most trivial of regular expressions are quite likely to contain errors. So it is not surprising that developers would think about replacing them with something better. But, as a recent discussion in the Python community shows, that replacement, like regular expressions themselves, may be difficult.

The compactness of the regular expression syntax is part of their power, but also part of the problem. Consider even a very simple expression:

<A\b[^>]*>(.*)</A>

A reader familiar with this syntax will recognize that this expression matches the HTML <A> tag and sets aside the anchor text for later processing. But even experienced regular expression developers must look at that expression for a moment and think about how the various metacharacters affect each other before being able to say for sure what it does. It takes even longer to notice the subtle bug: this expression will be confused by the presence of multiple <A> tags in the text being searched.

So how might one do better? That was Mike Meyer's question as he sought a more "pythonic" way of doing text matching. Needless to say, he is not the first to ask that kind of question; there are a number of attempts at better string matching out there. The first of those is arguably not "pythonic" at all: it is SnoPy, a port of the venerable SNOBOL language to Python.

SNOBOL was developed during the 1960's; it included pattern matching as a core feature of the language. Unlike regular expressions, SNOBOL was anything but concise. Concatenation of strings was explicit, " [abc] " was " Any("abc") ", and so on. Nonetheless, SNOBOL was highly influential in this area, and one can see echoes of the language in current regular expressions. That said, SNOBOL is not heavily used now, and the Python SNOBOL module seems to have suffered the same fate; its last release was in 2002.

Another approach is the rxb.py module by Ka-Ping Yee. This module, posted in 2005, creates a new, relatively verbose but relatively readable language for the creation of patterns. Using this language, the regular expression shown above would look something like:

<A + any(wordchars + whitespace)> + label(1, anychars) + </A>

(Note to readers; the above is totally untested and should not be relied upon for production use). This module, too, has not seen a great deal of use.

Various other packages are out there. For example, one can try to use Icon-style pattern matching with Python. For something completely different, there is the eGenix mxTextTools module, which allows the creation of text-matching programs in an assembly-like language, complete with goto constructs. mxTextTools is intimidating and not necessarily any easier to read than regular expressions, but it is said to be powerful and fast, and there are a number of real users.

Still, none of these seem likely to replace regular expressions as the first tool Python programmers reach for when they need to perform string matching. Python creator Guido van Rossum thinks things will stay that way:

I fear that regular expressions have this market cornered, and there isn't anything possible that is so much better that it'll drive them out.

Pushing aside an established incumbent is always hard, and regular expressions are well established indeed. It is never enough to simply be better in this situation; the proposed replacement has to be a lot better. As Guido noted, nothing seems to have come along which is that much better, and it may be that nothing ever will. For some medium-term value of "ever," anyway.

But, then, one also should not underestimate the ingenuity of free software developers. Or their persistence. People will almost certainly continue to throw themselves against this problem, and, maybe, somebody will come up with something interesting. Until then, we'll have to continue beating our heads against our desks as we try to figure out why our expressions don't work as intended.

