Building complex expressions with Perl 5.10

[ Perl tips index ]

[ Subscribe to Perl tips ]

In an earlier tip we discussed some of the changes to regular expressions in Perl 5.10. In particular Perl 5.10 allows us to name the captures we want to make inside a regular expression. In this tip we explore more powerful capturing techniques, as well as using named groups to parse complex grammars.

Named captures

Perl's regular expressions set the match variables $1, $2 and friends counting each open parentheses in an expression. When we embed other variables inside regular expressions this can make it very hard to identify which match variable will be set for later parentheses. For example, which match variable will the last sequence of digits be placed into in the following expression? What if $customer_name_regexp contains parentheses too?

/ (\d+) $customer_name_regexp (\d+) /x;

In Perl 5.10 we can name our captures:

/ (?<account>\d+) $customer_name_regexp (?<credit>\d+) /x;

and then access our values using $+{account} and $+{credit} . While you won't see it very often, we're actually looking up entries in the special hash %+ .

We can use the power of named captures to allow us to build up complex regular expressions out of smaller, simpler pieces - and still trust them to work. We can use qr{} (quoted regular expressions) to create our regular expression snippets (these are covered in more detail in another tip).

In this example we build an expression to match a title and another to match a name before combining them to pull information out of a letter:

my $title = qr{ (?<title> Mrs|Mr|Ms|Miss|Dr ) }x; my $name = qr{ (?<name> \w+ ) }x; $letter =~ m{ Dear \s $title \s $name, }x; say "Title: $+{title}"; say "Name: $+{name}";

As we have used named captures, we know that $+{title} will be set for any successful regular expression that includes the expression in $title . Most importantly, this makes our expressions much more maintainable; rather than looking at ugly regexp syntax and numbered variables, we're now looking at meaningful names. If we update a regexp, say by allowing hyphens in names and ensuring we match word-boundaries, we can be sure that all code that uses $name will use that update:

my $name = qr{ (?<name> \b [\w-]+ \b ) }x;

If a named capture is used more than once, %+ will contain only the last successful match. However all matches can be accessed via the special %- hash:

my $account = qr{ (?<account> \d{8} ) }x; my $money = qr{ \$ (?<money> (\d+\.\d{2} ) }x; my $date = qr{ (?<date> \d{4}-\d{2}-\d{2} ) }x; if(/Transferred $money from $account to $account on $date/) { say "From account: $-{account}[0]"; say "To account: $-{account}[1]"; }

Back references

A common problem with Perl 5's normal back references is that you can't build up patterns which use them; as they rely on knowing which pattern buffer you want to match. In the following expression, we don't know what \2 will refer to, as $some_regexp may also contain captures:

/ (\d+) $some_regexp (\d+) \s+ \2 /x;

Perl 5.10 provides a new back reference syntax. We can use \g{-1} to refer to the previous capture, which means we can always be sure of getting the same result even if $some_regexp contains captures:

/ (\d+) $some_regexp (\d+) \s+ \g{-1} /x;

We can also use \g{1} to refer to the first capture (the same as \1 ), and \g{label} to refer to the first capture with a name of label. This last form allows writing of regexps like the following, which capture an account number by name, and then look for the same account later in the regexp:

/ (?<account> \d{8}) $some_regexp \g{account} /x

Grammars and (?(DEFINE))

The (?(DEFINE)...) construct allows us to define parts of a regular expression which are not immediately executed as part of a match, but which can be recursed into later using (?&NAME) . This allows us to create powerful regular expressions which can match recursive structures such as grammars.

Let's look at an example designed to recognise a simple set of algebraic expressions. For example we'd like to match the valid expressions a=x+1 and x=2 , but not the invalid expression =a .

my $expression = qr{ (?(DEFINE) (?<expr> (?&term) (?&opterm)? ) (?<term> (?&identifier) | (?&number) ) (?<opterm> (?&operator) (?&term) (?&opterm)? ) (?<operator> [=+*/-] ) (?<identifier> [A-Za-z][A-Za-z0-9]* ) (?<number> [0-9]+ ) ) (?&expr) }x;

There's a lot going on here, so let's look at that line by line:

my $expression = qr{ We use qr{} to create a quoted regular expression reference and assign that into $expression . This does not run the expression, we're just building it for later. (?(DEFINE) This tells Perl that we're defining a set of rules. None of the parentheses in a define capture, instead they only act to group terms together. (?<expr> (?&term) (?&opterm)? ) This defines the named capture expr and says that an expr is a term which may (optionally) be followed by an opterm . We'll find out what is allowed as a term and opterm as we progress through the expression. (?<term> (?&identifier) | (?&number)) A term is either an identifier or a number . (?<opterm> (?&operator) (?&term) (?&opterm)?) An opterm is an operator followed by a term which may then be optionally followed another opterm . Note here that we're defining an opterm in terms of itself! (?<operator> [=+*/-] ) An op (operator) is one of the five basic algebraic operations (equals, plus, multiply, divide, and minus). (?<identifier> [A-Za-z][A-Za-z0-9]* ) An identifier is a sequence of letters and numbers, but this sequence must begin with a letter. For example value or x3 are both considered valid identifiers. (?<number> [0-9]+ ) A number is one or more digits between 0 and 9. ) Finally, we close our define rule. (?&expr) A define block merely specifies a set of rules. In order to be able to then use those rules we need to specify which rule to start with. Thus we tell Perl to recurse into the expr rule to start matching. }x; The end of our expression. We're using extended regular expressions, so space characters and comments are ignored.

To show this in action, let's consider a how a few expressions might be broken up:

x = 3 => expr => term opterm => ident op term => (x) (=) number => (x) (=) (3) x = a + b * c => expr => term opterm => ident op term opterm => (x) (=) ident op term opterm => (x) (=) (a) (+) ident op term => (x) (=) (a) (+) (b) (*) ident => (x) (=) (a) (+) (b) (*) (c)

However this cannot match the following:

x = a = => expr => term opterm => ident op term opterm => (x) (=) ident op term => (x) (=) (a) (=) ???

because we're missing the final term.

In order to use the regular expression we've built up in $expression we just include inside a regular expression where we want it, as follows:

while (<>) { say "That's an expression" if /^ $expression $/x; }

Unfortunately, the named blocks inside a DEFINE section do not capture, so additional work may be required to extract the information you're after. However this still allows the regexp engine to be used for some very powerful tasks that were previously impossible for the average developer.

Further references

For further information, we recommend the following resources:

[ Perl tips index ]

[ Subscribe to Perl tips ]

This Perl tip and associated text is copyright Perl Training Australia. You may freely distribute this text so long as it is distributed in full with this Copyright noticed attached.

If you have any questions please don't hesitate to contact us:

Email: contact@perltraining.com.au Phone: 03 9354 6001 (Australia) International: +61 3 9354 6001

Copyright Perl Training Australia. Contact us at contact@perltraining.com.au