Practical Perl 6 Regexes Brian Duggan bduggan@matatu.org DC Baltimore Perl Workshop, April 6, 2019

Why? Perl 5 set the standard for regexes BUT terse

too many special cases

"write only"

not composable Perl 6 regexes not as terse

consistent

more readable

first class objects. Building blocks for grammars.

Outline Characters

Groups

Quantifiers

Capturing

Composing

Characters Question: Which of these print True ? (in Perl 6) [press return or click] say so 'abc' =~ /b/ ===SORRY!=== Error while compiling example.p6 Unsupported use of =~ to do pattern matching; in Perl 6 please use ~~ at example.p6:1 ------> say so 'abc' =~<HERE> /b/ say so 'abc' ~~ /b/ True say so 'abc' ~~ / 'b' / True say so 'abc' ~~ regex { b } True my regex letter-b { b } say so 'abc' ~~ / <letter-b> / True Use / or regex to make a regex.

Characters Literals How about these? say so 'good' ~~ / good / True say so 'not-good' ~~ / not-good / ===SORRY!=== Unrecognized regex metacharacter - (must be quoted to match literally) at example.p6:1 ------> say so 'not-good' ~~ / not<HERE>-good / Unable to parse regex; couldn't find final '/' at example.p6:1 ------> say so 'not-good' ~~ / not-<HERE>good / say so 'not-good' ~~ / 'not-good' / True say so 'schőn' ~~ / schőn / True Use quotes inside a regex. Everything except alphanumeric characters and underscores must be quoted.

Characters Spaces say so 'abc' ~~ / abc / True say so 'abc' ~~ / a b c / Potential difficulties: Space is not significant here; please use quotes or :s (:sigspace) modifier (or, to suppress this warning, omit the space, or otherwise change the spacing) at example.p6:1 ------> say so 'abc' ~~ / a<HERE> b c / Space is not significant here; please use quotes or :s (:sigspace) modifier (or, to suppress this warning, omit the space, or otherwise change the spacing) at example.p6:1 ------> say so 'abc' ~~ / a b<HERE> c / True say so 'a b c' ~~ / a b c / Potential difficulties: Space is not significant here; please use quotes or :s (:sigspace) modifier (or, to suppress this warning, omit the space, or otherwise change the spacing) at example.p6:1 ------> say so 'a b c' ~~ / a<HERE> b c / Space is not significant here; please use quotes or :s (:sigspace) modifier (or, to suppress this warning, omit the space, or otherwise change the spacing) at example.p6:1 ------> say so 'a b c' ~~ / a b<HERE> c / False say so 'a b c' ~~ / 'a b c' / True say so 'a b c' ~~ / a ' ' b ' ' c / True say so 'a b c' ~~ / a \s+ b \s+ c / True say so 'a b c' ~~ / a # hey, this is a comment \s+ b \s+ c / True Spaces are not significant. Neither are comments.

Characters Adverbs say so 'a b c' ~~ /a \s* b \s* c/; say so 'a b c' ~~ /a <ws> b <ws> c/; say so 'a b c' ~~ /:s a b c/; say so 'a b c' ~~ /:sigspace a b c/; True True True True say so 'ABC' ~~ /:i b/; say so 'ABC' ~~ /:ignorecase b/; True True say so 'abc' ~~ /:r b/; say so 'abc' ~~ /:ratchet b/; True True Adverbs start with : . Ratcheting makes matching much faster -- no backtracking. Sigspace improves readability.

Characters Tokens and rules say so 'abc' ~~ regex { :r abc } say so 'abc' ~~ token { abc } True True say so 'a b c' ~~ token { :s a b c } say so 'a b c' ~~ rule { a b c } True True A token is a regex with ratching. A rule is a token with sigspace. These are deep concepts! Tokens and rules are building blocks for grammars.

Characters Back to basics Vehicle Identification Numbers my $vin = '1FAHP3GNXBW107581'; if $vin ~~ / I | O | Q / { say "Invalid VIN" } else { say "Maybe it's okay"; } Maybe it's okay For alternation, use | .

Character classes TMTOWDI Alternation say so 'QUIT' ~~ / I | O | Q / True say so 'QUIT' ~~ / | I | O | Q / True say so 'QUIT' ~~ / | I | O | Q / True say so 'QUIT' ~~ / <[IOQ]> / True You can put an extra | at the beginning. Construct character classes using <[ and ]> .

Character classes Character classes say so 'e' ~~ / <[a e i o u]> / True say so 'b' ~~ / <[a..e]> / True my regex vowels { <[a e i o u]> } say so 'e' ~~ / <vowels> /; True Put lists of characters or ranges in character classes. Spaces can be in character classes.

Character classes Negate Character classes my regex not-vowels { <-[aeiou]> } say so 'x' ~~ / <not-vowels> /; say so '!' ~~ / <not-vowels> /; True True my regex consonants { <[a..z] - [aeiou]> } say so '!' ~~ / <consonants> /; say so 'x' ~~ / <consonants> /; False True Take the complement of a character class <-[ ... ]> . Or use - to take the set difference.

Outline Characters

Groups

Quantifiers

Capturing

Composing

Groups Grouping Brackets make a non-capturing group. say so 'sat, apr 6' ~~ / [ sat | sun ] ', ' [ mar | apr | may ] <[0..9]> / False Like (?:...) from Perl 5.

Groups Grouping Digression -- why did that not match? say so 'sat, apr 6' ~~ / 'sat, apr 6' / True say so 'sat, apr 6' ~~ / 'sat, ' 'apr ' '6' / True say so 'sat, apr 6' ~~ / sat ', ' apr ' ' 6 / True say so 'sat, apr 6' ~~ / [ sat | sun ] ', ' [ mar | apr | may ] ' ' <[0..9]> / True Digression -- why did that not match?

Groups Grouping Spot the difference say so 'sat, apr 6' ~~ / [ sat | sun ] ', ' [ mar | apr | may ] <[0..9]> / False say so 'sat, apr 6' ~~ / [ sat | sun ] ', ' [ mar | apr | may ] ' ' <[0..9]> / True Spot the difference

Groups Anyway, back to groups say so 'sat, apr 6' ~~ / < sat sun> ', ' < mar apr may> ' ' <[0..9]> / True Start < > with a space to make a word list.

Groups As usual, tmtowtdi my @days = <sat sun>; my @months = <mar apr may>; say so 'sat, apr 6' ~~ / @days ', ' @months ' ' <[0..9]> / True Or use an array. Scalar are interpolated too, btw. How about two digit days?

Outline Characters

Groups

Quantifiers

Capturing

Composing

Quantifiers Quantifiers say so 'a' ~~ / a? /; # 0 or 1 True say so 'a' ~~ / a* /; # 0 or more True say so 'a' ~~ / a+ /; # 1 or more True say so 'a' ~~ / a**2 /; # exactly 2 False say so 'a' ~~ / a**1..5 /; # 1 to 5 True Use ? , * , and + as usual. Use ** (exponentiation) for values or ranges.

Quantifiers Quantifiers my @days = <sat sun>; my @months = <mar apr may>; say so 'sat, apr 6' ~~ / @days ', ' @months ' ' <[0..9]> ** 1..2 / True

Quantifiers Modified Quantifiers my regex part { <-[/]>+ } my regex path { '/' [ <part> '/' ]* <part> } say so '/home/brian/talk.txt' ~~ / <path> / True my regex part { <-[/]>+ } my regex path { '/' <part>* % '/' } say so '/home/brian/talk.txt' ~~ / <path> /; True "separated by" A* % B is a shorthand for [ AB ]* A? . "separated by"is a shorthand for Works for other quantifiers too (`+`, ** ) Useful with , . See also %% .

Outline Characters

Groups

Quantifiers

Capturing

Composing

Capturing Capturing say 'abc' ~~ / abc /; ｢abc｣ my $match = 'abc' ~~ / abc /; say $match.WHAT; (Match) A match returns a match object. 'abc' ~~ / abc/; say $/.WHAT; say $/; (Match) ｢abc｣ The most recent match is stored in $/ . Use say to print $/.gist which provides the match tree.

Capturing Captures 'hello, world' ~~ /^ [ <-[,]>+ ] ', ' (.*) $/; say $/; ｢hello, world｣ 0 => ｢world｣ Parentheses will capture. 'hello, world' ~~ /^ [ <-[,]>+ ] ', ' (.*) $/; say $/[0]; say ~$/[0]; ｢world｣ world You can get positional captures by treating $/ like an array. Stringify with ~ .

Capturing Named Captures my regex word { <-[,]>+ } 'hello, world' ~~ /^ <word> ', ' (.*) $/; say $/; ｢hello, world｣ word => ｢hello｣ 0 => ｢world｣ Named captures use the names of embedded regexes. The match tree can help.

Capturing Named Captures my regex word { <-[,]>+ } 'hello, world' ~~ /^ <word> ', ' (.*) $/; say $/{'word'}; say $/<word>; say $<word>; # all the same ｢hello｣ ｢hello｣ ｢hello｣ When accessing named captures in $/ , you can omit the / .

Capturing Named Captures my regex word { <-[,]>+ } 'hello, world' ~~ /^ <word> ', ' <word> $/; say $<word>; say $<word>[0]; [｢hello｣ ｢world｣] ｢hello｣ It's matches all the way down.

Capturing Named Captures my regex word { <-[,]>+ } say 'new york, new york' ~~ /^ <word> ', ' $<word> $/; ｢new york, new york｣ word => ｢new york｣ You can interpolate the match variable in the regex to be clever.

Capturing Named Captures my regex word { <-[,]>+ } say 'oh, ho' ~~ /^ <word> ', ' <{ $<word>.flip }> $/; ｢oh, ho｣ word => ｢oh｣ You can even put code in the regex if you want to be very clever.

Capturing Restricted Captures my regex char { <-["]> | '\"' } my regex quoted { '"' <char>* '"' } 'a "good" program' ~~ / <quoted> /; say ~$<quoted>; "good"

Capturing Restricted Captures my regex char { <-["]> | '\"' } my regex quoted { '"' <( <char>* )> '"' } 'a "good" program' ~~ / <quoted> /; say ~$<quoted>; good Pro tip: use <( and >) to restrict the entire match.

Outline Characters

Groups

Quantifiers

Capturing

Composing

Composing Grammars grammar G { regex TOP { 'a ' <quoted> ' program' } regex letters { <[a..z]>+ } regex quoted { '"' <( <letters> )> '"' } } say G.parse('a "good" program'); ｢a "good" program｣ quoted => ｢good｣ letters => ｢good｣ Put regexes together into grammars.

Composing Grammars grammar G { rule TOP { a <quoted> program } token letters { <[a..z]>+ } token quoted { '"' <( <letters> )> '"' } } say G.parse('a "good" program'); ｢a "good" program｣ quoted => ｢good｣ letters => ｢good｣ Reminders -- Use token for regexes that don't need backtracking. Use rule for tokens with sigspace.

Composing Grammars Examples on modules.perl6.org and docs.perl6.org. Also JSON-Tiny Or Protobuf (EBNF). Have fun!

The End