Recently I read the course material of the Rakudo and NQP Internals Workshop, and had an idea for a small optimization for the regex engine. Yesterday night I implemented it, and I'd like to walk you through the process.

As a bit of background, the regex engine that Rakudo uses is actually implemented in NQP, and used by NQP too. The code I am about to discuss all lives in the NQP repository, but Rakudo profits from it too.

In addition one should note that the regex engine is mostly used for parsing grammar, a process which involves nearly no scanning. Scanning is the process where the regex engine first tries to match the regex at the start of the string, and if it fails there, moves to the second character in the string, tries again etc. until it succeeds.

But regexes that users write often involve scanning, and so my idea was to speed up regexes that scan, and where the first thing in the regex is a literal. In this case it makes sense to find possible start positions with a fast string search algorithm, for example the Boyer-Moore algorithm. The virtual machine backends for NQP already implement that as the index opcode, which can be invoked as start = index haystack, needle, startpos , where the string haystack is searched for the substring needle , starting from position startpos .

From reading the course material I knew I had to search for a regex type called scan , so that's what I did:

$ git grep --word scan 3rdparty/libtommath/bn_error.c: /* scan the lookup table for the given message 3rdparty/libtommath/bn_mp_cnt_lsb.c: /* scan lower digits until non-zero */ 3rdparty/libtommath/bn_mp_cnt_lsb.c: /* now scan this digit until a 1 is found 3rdparty/libtommath/bn_mp_prime_next_prime.c: /* scan upwards 3rdparty/libtommath/changes.txt: -- Started the Depends framework, wrote d src/QRegex/P5Regex/Actions.nqp: QAST::Regex.new( :rxtype<sca src/QRegex/P6Regex/Actions.nqp: QAST::Regex.new( :rxtype<sca src/vm/jvm/QAST/Compiler.nqp: method scan($node) { src/vm/moar/QAST/QASTRegexCompilerMAST.nqp: method scan($node) { Binary file src/vm/moar/stage0/NQPP6QRegexMoar.moarvm matches Binary file src/vm/moar/stage0/QASTMoar.moarvm matches src/vm/parrot/QAST/Compiler.nqp: method scan($node) { src/vm/parrot/stage0/P6QRegex-s0.pir: $P5025 = $P5024."new"("scan" :named("rx src/vm/parrot/stage0/QAST-s0.pir:.sub "scan" :subid("cuid_135_1381944260.6802") src/vm/parrot/stage0/QAST-s0.pir: push $P5004, "scan"

The binary files and .pir files are generated code included just for bootstrapping, and not interesting for us. The files in 3rdparty/libtommath are there for bigint handling, thus not interesting for us either. The rest are good matches: src/QRegex/P6Regex/Actions.nqp is responsible for compiling Perl 6 regexes to an abstract syntax tree (AST), and src/vm/parrot/QAST/Compiler.nqp compiles that AST down to PIR, the assembly language that the Parrot Virtual Machine understands.

So, looking at src/QRegex/P6Regex/Actions.nqp the place that mentions scan looked like this:

$block < orig_qast > := $qast ; $qast := QAST::Regex . new ( : rxtype < concat > , QAST::Regex . new ( : rxtype < scan > ) , $qast , ( $anon ?? QAST::Regex . new ( : rxtype < pass > ) !! (nqp::substr( % * RX < name > , 0 , 12 ) ne ' !!LATENAME!! ' ?? QAST::Regex . new ( : rxtype < pass > , : name ( % * RX < name > ) ) !! QAST::Regex . new ( : rxtype < pass > , QAST::Var . new ( : name (nqp::substr( % * RX < name > , 12 )) , : scope ( ' lexical ' ) ) ) ))) ;

So to make the regex scan, the AST (in $qast ) is wrapped in QAST::Regex.new(:rxtype<concat>,QAST::Regex.new( :rxtype<scan> ), $qast, ...) , plus some stuff I don't care about.

To make the optimization work, the scan node needs to know what to scan for, if the first thing in the regex is indeed a constant string, aka literal. If it is, $qast is either directly of rxtype literal , or a concat node where the first child is a literal. As a patch, it looks like this:

--- a/src/QRegex/P6Regex/Actions.nqp +++ b/src/QRegex/P6Regex/Actions.nqp @@ -667,9 +667,21 @@ class QRegex::P6Regex::Actions is HLL::Actions { self.store_regex_nfa($code_obj, $block, QRegex::NFA.new.addnode($qast)) self.alt_nfas($code_obj, $block, $qast); + my $scan := QAST::Regex.new( :rxtype<scan> ); + { + my $q := $qast; + if $q.rxtype eq 'concat' && $q[0] { + $q := $q[0] + } + if $q.rxtype eq 'literal' { + nqp::push($scan, $q[0]); + $scan.subtype($q.subtype); + } + } + $block<orig_qast> := $qast; $qast := QAST::Regex.new( :rxtype<concat>, - QAST::Regex.new( :rxtype<scan> ), + $scan, $qast,

Since concat nodes have always been empty so far, the code generators don't look at their child nodes, and adding one with nqp::push($scan, $q[0]); won't break anything on backends that don't support this optimization yet (which after just this patch were all of them). Running make test confirmed that.

My original patch did not contain the line $scan.subtype($q.subtype); , and later on some unit tests started to fail, because regex matches can be case insensitive, but the index op works only case sensitive. For case insensitive matches, the $q.subtype of the literal regex node would be ignorecase , so that information needs to be carried on to the code generation backend.

Once that part was in place, and some debug nqp::say() statements confirmed that it indeed worked, it was time to look at the code generation. For the parrot backend, it looked like this:

method scan( $node ) { my $ops := self . post_new( ' Ops ' , : result ( % * REG < cur > )) ; my $prefix := self . unique( ' rxscan ' ) ; my $looplabel := self . post_new( ' Label ' , : name ( $prefix ~ ' _loop ' )) ; my $scanlabel := self . post_new( ' Label ' , : name ( $prefix ~ ' _scan ' )) ; my $donelabel := self . post_new( ' Label ' , : name ( $prefix ~ ' _done ' )) ; $ops . push_pirop( ' repr_get_attr_int ' , ' $I11 ' , ' self ' , % * REG < curclass > , ' "$!from" ' ) ; $ops . push_pirop( ' ne ' , ' $I11 ' , - 1 , $donelabel ) ; $ops . push_pirop( ' goto ' , $scanlabel ) ; $ops . push ( $looplabel ) ; $ops . push_pirop( ' inc ' , % * REG < pos > ) ; $ops . push_pirop( ' gt ' , % * REG < pos > , % * REG < eos > , % * REG < fail > ) ; $ops . push_pirop( ' repr_bind_attr_int ' , % * REG < cur > , % * REG < curclass > , ' "$!from" ' , % * REG < pos > ) ; $ops . push ( $scanlabel ) ; self . regex_mark( $ops , $looplabel , % * REG < pos > , 0 ) ; $ops . push ( $donelabel ) ; $ops ; }

While a bit intimidating at first, staring at it for a while quickly made clear what kind of code it emits. First three labels are generated, to which the code can jump with goto $label : One as a jump target for the loop that increments the cursor position ( $looplabel ), one for doing the regex match at that position ( $scanlabel ), and $donelabel for jumping to when the whole thing has finished.

Inside the loop there is an increment (inc) of the register the holds the current position ( %*REG<pos> ), that position is compared to the end-of-string position ( %*REG<eos> ), and if is larger, the cursor is marked as failed.

So the idea is to advance the position by one, and then instead of doing the regex match immediately, call the index op to find the next position where the regex might succeed:

--- a/src/vm/parrot/QAST/Compiler.nqp +++ b/src/vm/parrot/QAST/Compiler.nqp @@ -1564,7 +1564,13 @@ class QAST::Compiler is HLL::Compiler { $ops.push_pirop('goto', $scanlabel); $ops.push($looplabel); $ops.push_pirop('inc', %*REG<pos>); - $ops.push_pirop('gt', %*REG<pos>, %*REG<eos>, %*REG<fail>); + if nqp::elems($node.list) && $node.subtype ne 'ignorecase' { + $ops.push_pirop('index', %*REG<pos>, %*REG<tgt>, self.rxescape($node[0]), %*REG<pos>); + $ops.push_pirop('eq', %*REG<pos>, -1, %*REG<fail>); + } + else { + $ops.push_pirop('gt', %*REG<pos>, %*REG<eos>, %*REG<fail>); + } $ops.push_pirop('repr_bind_attr_int', %*REG<cur>, %*REG<curclass>, '"$!from"', %*REG<pos>); $ops.push($scanlabel); self.regex_mark($ops, $looplabel, %*REG<pos>, 0);

The index op returns -1 on failure, so the condition for a cursor fail are slightly different than before.

And as mentioned earlier, the optimization can only be safely done for matches that don't ignore case. Maybe with some additional effort that could be remedied, but it's not as simple as case-folding the target string, because some case folding operations can change the string length (for example ß becomes SS while uppercasing).

After successfully testing the patch, I came up with a small, artifical benchmark designed to show a difference in performance for this particular case. And indeed, it sped it up from 647 ± 28 µs to 161 ± 18 µs, which is roughly a factor of four.

You can see the whole thing as two commits on github.

What remains to do is implementing the same optimization on the JVM and MoarVM backends, and of course other optimizations. For example the Perl 5 regex engine keeps track of minimal and maximal string lengths for each subregex, and can anchor a regex like /a?b?longliteral/ to 0..2 characters before a match of longliteral , and generally use that meta information to fail faster.

But for now I am mostly encouraged that doing a worthwhile optimization was possible in a single evening without any black magic, or too intimate knowledge of the code generation.