Readable and compositional regexes in Perl

Regexes don’t (always!) have to be unreadable mess. For example see this HN post a little Clojure DSL for readable, compositional regexes. Here is the simple Clojure example that was given:

(def datestamp-re (let [d {\0 \9}] (regex [d d d d :as :year] \- [d d :as :month] \- [d d :as :day])))

And the equivalent Perl regex “DSL” can be equally lucid:

sub datestamp_re { qr/ (?<year> \d \d \d \d) - (?<month> \d \d) - (?<day> \d \d ) /x; }

The two things that provide a little extra help to grok whats going on here are:

The x modifier on the end of qr// which allows whitespace and newlines to be sprinkled into your regex pattern without any effect on the pattern matching. See perlre Modifers And “Named Capture Buffers” which were added at perl 5.10. (?<year> \d{4}) # stores pattern matched in "year" buffer Above not only gives a name to that capture buffer but provides an excellent visual placeholder to help describe what you are trying to do with the regex.

When processing named capture regexes the matches to patterns are recorded in the %+ hash variable:

for my $date (qw/2007-10-23 20X7-10-23/) { printf "year:%d, month:%d, day:%d

", @+{qw/year month day/} if $date =~ datestamp_re; } # => year:2007, month:10, day:23

This is much more flexible for dealing with regex captures compared to positional $1, $2, $3, etc . So not just more readable but more compositional:

# nice readable regex sub datestamp_re { my $year = qr/ (?<year> \d{4}) /x; my $month = qr/ (?<month> \d{2}) /x; my $day = qr/ (?<day> \d{2}) /x; qr/ $year - $month - $day /x; }

or:

# DRY regex sub datestamp_re { my %re = map { my ($name, $digits) = @$_; $name => qr/ (?<$name> \d{$digits}) /x; } [ year => 4 ], [ month => 2 ], [ day => 2 ]; qr/ $re{year} - $re{month} - $re{day} /x; }

and even:

# regex generator sub re { qr/ (?<$_[0]> $_[1] )/x } sub regex { my $pattern = join q{}, @_; qr/ $pattern /x; } sub datestamp_re { regex re( year => '\d{4}' ), '-', re( month => '\d{2}' ), '-', re( day => '\d{2}' ); }

Now that is a regex DSL 🙂

Note that the %+ hash variable only captures the first occurrence in the relevant named buffer:

sub numbers_re { my $four = qr/ (?<four> \d{4}) /x; my $two = qr/ (?<two> \d{2}) /x; qr/ $four - $two - $two /x; } if ('2007-10-23' =~ numbers_re) { say 'four => ', $+{four}; say 'two => ', $+{two}; } # four => 2007 # two => 10

To get to the second $two (ie. 23) then use the %- hash variable which stores all the captures in an array reference for relevant named buffer:

if ('2007-10-23' =~ numbers_re) { say 'two(s) => ', join ',' => @{ $-{two} }; } # two(s) => 10,23

/I3az/

PS. Please note that the WordPress syntax highlighter used is unfortunately upper-casing all code comments 😦