I’m pleased to report that Hyperscan, the regular expression matcher that ate my life from 2008 through 2018, finally has a paper (pdf) – it’s being presented this week at NSDI ’19.

Anyone who lacks a taste for my discursive bullshit could just stop reading this blog post there, and just go read the paper…. Bye!

It’s a good paper and it works as a standalone read. I’d like to call out a few things in it as particularly interesting and worth reviewing. I’d also like to call out a few things in Hyperscan that we couldn’t write about for reasons of space – we were hard up against the space limits for any conceivable conference paper, so there were whole subsystems and implementation strategies that we couldn’t write about. I will also mention a few aspects of Hyperscan historically that didn’t make it into the Hyperscan 4.x release series (the first open source release of Hyperscan and a substantial API change) nor the Hyperscan 5.x release series (a major release number that largely celebrates the now Australian-free aspect of Hyperscan).

Hyperscan Summary

It’s a software based, large-scale regex matcher designed to match multiple patterns at once (up to tens of thousands of patterns at once) and to ‘stream‘ (that is, match patterns across many different ‘stream writes’ without holding on to all the data you’ve ever seen). To my knowledge this makes it unique.

RE2 is software based but doesn’t scale to large numbers of patterns; nor does it stream (although it could). It occupies a fundamentally different niche to Hyperscan; we compared the performance of RE2::Set (the RE2 multiple pattern interface) to Hyperscan a while back.

Most back-tracking matchers (such as libpcre) are one pattern at a time and are inherently incapable of streaming, due to their requirement to backtrack into arbitrary amounts of old input.

Hyperscan was the product of a ‘pivot’ for a startup called Sensory Networks that was founded in 2003 to do regex on FPGA. I joined in 2006 and spent some time doing regex on GPGPU, before we decided that implementing regular expressions on GPCPU was a better idea. We worked in a much diminished company from 2009 (5 staff members!) until our acquisition by Intel in October 2013.

Future Thingies

Hyperscan and RE2 continue in their niches. There are a number of other projects out there – even a few hardware regular expression matchers struggle on – but I’ll have to get to other projects in another post.

It’s possible that something much simpler than Hyperscan could get a considerable portion of the benefit of Hyperscan, while being a more tractable code base for adaptation, porting to other CPU architectures or exotic architectures (GPGPU, for example), alternate use cases like bioinformatics, etc. I have some ideas in this area under the dubious code name “Ultimate Regex Engine the Third” (ure3) – Hyperscan itself was, originally “Ultimate Engine the Second”. If anyone is interested, get in contact.

Pattern Matching Techniques Roundup

Hyperscan Things That Made It Into The Paper

Pattern Decomposition (Section 3)

A key idea of Hyperscan is that if we have a complex regular expression – say:

42:/<\s*object[^>]*data\s*\x3A[^,>]*base64/smi

We can observe that there are a lot of parts of this pattern that are simply literal strings: “object”, “data”, “base64”. It’s a lot easier and cheaper to match literals rather than do a full regular expression implementation, so breaking up this pattern (an easy one, as it is simply ‘one thing after another’) into several components allows us to remove the strings from the finite automata matching task.

A rudimentary version of this is ‘pre-filtering’ – we could just make sure we saw all 3 strings, then run the pattern (although this is tricky in streaming mode – more details at some other time). But this still means we’re matching all the characters in those three strings in an automata. Given that our methods for matching NFAs are quite sensitive to the number of states, it’s better to have smaller automata.

String Matching: “FDR” (Section 4.1)

FDR is one of a long list of string matching engines that I’ve built. It’s been fairly effective, although it gets a bit cranky at large string counts. See the paper for details, as there’s only so many times I can explain what a “bucketed SIMD shift-or matcher with supercharacters” is.

The FDR matcher is used for moderately string match cases – 80+ strings – and uses SIMD not for scanning many characters at once, but holding a much larger set of simple state machines than could be done with a general purpose register.

This is worth noting given the debates about vector machines – not every use of SIMD is oriented towards applying simple operations over great long vectors of data. FDR is a bucketed shift-or matcher that works better at 128 bits (8 buckets, 16 wide) than it would at 64 bits – there’s no obviously urgent extension to 256, 512 or for that matter, 2048 bits.

Freud said, sometimes a cigar is just a cigar; similarly, sometimes a SIMD register is just a SIMD register.

Glushkov NFA implementation: “LimEx” (Section 4.2)

The Glushkov construction is a way of getting an ε-free NFA (“epsilon free”); that is, an NFA that does not make transitions from state to state without processing a character. This makes it considerably easier to build state machines with simple bit vector operations (although the successor function, showing which states can follow other states, is not necessarily trivial).

The LimEx matcher is an evolution of several earlier matchers. The “Limited NFA” could only allow states to jump to 0,1 or 2 states ahead (generalized to a larger number, eventually, but still not allowing irregular transformations). The “General NFA” used multiple partitions to allow arbitrary jumps after the Method of Four Russians. The appallingly stupidly-named “LimGen” NFA combined the two, gathering only states requiring a ‘non-limited’ transition using a PSHUFB instruction (yes, that one again, fans), and doing a table lookup.

All of these exotic techniques fell away to the simple strategy: do ‘limited’ transitions (a certain number of fixed-distance jumps only) then work through all the remaining states and do “Exceptions” (the “Ex” in “LimEx”). This covers quite a bit of plumbing for accept states, triggering and reading of bounded repeats, etc. It’s not pretty, but it works.

Hyperscan Things That Didn’t Make It Into The Paper

The following sections are not an exhaustive list; there are many more optimizations and transformations that don’t make it in. A few are below:

Graph transformations : redundancy, unreachable components, etc. – we built an abstract NFA graph (also a Glushkov NFA, but tailored towards representing our workload, not a direct implementation) and apply transformations. We can detect redundant work in an NFA and eliminate it, expand out literal paths, etc.

: redundancy, unreachable components, etc. – we built an abstract NFA graph (also a Glushkov NFA, but tailored towards representing our workload, not a direct implementation) and apply transformations. We can detect redundant work in an NFA and eliminate it, expand out literal paths, etc. Bounded Repeat Analysis and implementation : handling of single-character bounded repeats: e.g. /foo[^

]{1024}bar/ is quite challenging. We ‘discover’ such repeats in our NFA graph, and have specialized code to implement this class of repeats.

: handling of single-character bounded repeats: e.g. /foo[^

]{1024}bar/ is quite challenging. We ‘discover’ such repeats in our NFA graph, and have specialized code to implement this class of repeats. Other string matchers : we have a small-group literal matcher called “Teddy” (that uses SIMD; to this day, the best explication of Teddy is in the Rust reimplementation of it, even if they could benefit from using the rest of my compiler algorithm; my logic to merge multiple strings into a bucket is at least mildly sophisticated).

: we have a small-group literal matcher called “Teddy” (that uses SIMD; to this day, the best explication of Teddy is in the Rust reimplementation of it, even if they could benefit from using the rest of my compiler algorithm; my logic to merge multiple strings into a bucket is at least mildly sophisticated). DFA implementation : Alex Coyte, one of our engineers (now at Google) always said “DFAs are awesome”. They are; they are faster and simpler than NFAs. We used them quite a bit (when we could, and when they made sense). We had a pretty straightforward implementation of DFAs without too many exotic components, but there are some tricks to making DFAs smaller.

: Alex Coyte, one of our engineers (now at Google) always said “DFAs are awesome”. They are; they are faster and simpler than NFAs. We used them quite a bit (when we could, and when they made sense). We had a pretty straightforward implementation of DFAs without too many exotic components, but there are some tricks to making DFAs smaller. NFA and DFA acceleration : when it’s possible to check that an automata, whether NFA or DFA, can’t leave its local neighborhood of current states without seeing easily detected ‘event’, it’s possible to use SIMD to scan for that event.

For example, the pattern fragment /[^

]*foo/ can scan for a newline, which breaks out of the

state, or an ‘f’, which may commence the start of a ‘foo’.

This extends even to NFAs, where we can take the closure of all possibly active ‘accelerable’ states and scan ahead for events.

: when it’s possible to check that an automata, whether NFA or DFA, can’t leave its local neighborhood of current states without seeing easily detected ‘event’, it’s possible to use SIMD to scan for that event. For example, the pattern fragment /[^

]*foo/ can scan for a newline, which breaks out of the

state, or an ‘f’, which may commence the start of a ‘foo’. This extends even to NFAs, where we can take the closure of all possibly active ‘accelerable’ states and scan ahead for events. Specialized engines , like “Sheng“, and many others. The ‘Castle’ engine, for example, efficiently handles many bounded repeats at once, improving overall performance but also radically reducing the amount of state we have to store.

, like “Sheng“, and many others. The ‘Castle’ engine, for example, efficiently handles many bounded repeats at once, improving overall performance but also radically reducing the amount of state we have to store. Lookaround assertions : we use a similar technique to my SMH matching engine to, upon receipt of a literal match, to augment the power of the literal match by verifying that the characters around the match will match at the regex level. So if we extracted the string ‘foobar’ from a regex (to use a modified version of our earlier example) ‘/<\s*object[^>]*\s\dfoobar\d[^,>]*base64/’ we can implement the character class lookarounds to check that we’ve seen ‘\s\dfoobar\d’ before running a full-fledged NFA matcher for the other parts of the pattern.

: we use a similar technique to my SMH matching engine to, upon receipt of a literal match, to augment the power of the literal match by verifying that the characters around the match will match at the regex level. So if we extracted the string ‘foobar’ from a regex (to use a modified version of our earlier example) … and many others

Hyperscan Things That Didn’t Make It Into Open Source Hyperscan

Ports to non-x86 platforms . Hyperscan had ports to MIPS, ARM, PowerPC and Tilera at one stage. These ports were a lot of hassle to maintain, and I don’t think it takes a genius to see how these ports were not particularly strategic for Intel to devote resources to.

. Hyperscan had ports to MIPS, ARM, PowerPC and Tilera at one stage. These ports were a lot of hassle to maintain, and I don’t think it takes a genius to see how these ports were not particularly strategic for Intel to devote resources to. Capturing subexpressions: at once stage we had an experimental product that (in block mode only, not streaming) could accurately model libpcre’s capturing semantics. This worked by scanning the data backwards with a backwards version of the pattern, making a trace of the states visited, and tracing forwards through the backward state trace in order to correctly model the behavior that libpcre would have done. This was possible since when we’re tracing forwards through the backwards trace and following 1-bits in our states, we know they will lead to an accept state. Even more fun was implementing a version of this that checkpointed our backwards state traces periodically, allowing us to store O(sqrt(N)) states in our state trace, at the expense of having to scan twice as much (a generalization would have been to allow us to store O(k*pow(N, 1/k)) states with a cost of scanning k times as much. This had the feel of one of those tricky algorithm interview questions… overall, capturing support it was fun to work on, but no-one wanted it commercially. We probably wouldn’t do it this way today.

Acknowledgements

We have already mentioned the pioneering work of Victor Glushkov. His work was introduced to us at Sensory Networks by the excellent textbook Flexible Pattern Matching in Strings by Gonzalo Navarro and Mathieu Raffinot. Professor Navarro has also written many excellent papers over the years, so if you can’t afford the book, do check out his publications page. His output is by no means confined to regular expressions, but his treatment of NFA implementations is very good.

A great deal of hard work went into Hyperscan over the years. Version 1.0 was the initiative of Peter Duthie, whose startup Ground Labs is going strong. Alex Coyte, Matt Barr and Justin Viiret worked all the way through from the early days at Sensory Networks through to the end of the Sydney-based Intel DCG Lab. I’d also like to thank Xiang Wang, Yang Hong, Harry Chang, Jiayu Hu and Heqing Zhu from Intel; KyoungSoo Park from KAIST for their work on the paper, as well as the Hyperscan team in general, past and present.