This blog post is a quick summary of a big project; we want to go into a lot more detail (and you can always look at the code), but it makes sense for us to get the fundamentals out there. This post is aimed at developers who are looking to understand how Hyperscan works or get involved in the Hyperscan source base; you certainly don't need to understand Hyperscan internals to make use of the library to match regular expressions.

Hyperscan is an automata-based (e.g. NFA/DFA) style approach rather than a back-tracking approach. The automata-based approach yields advantages and disadvantages: on the plus side, an automata-based approach is amenable to streaming and handling of multiple regular expressions. On the minus side, automata-based regular expression matching cannot handle some regular expression constructs easily, or at all – backreferences and arbitrary lookaround asserts are notable regular expression features that we do not support.

Hyperscan makes use of many different techniques to try to make the regular expression matching task tractable for large numbers of regular expressions. We have not found a single, elegant automata approach that handles arbitrary regular expressions in arbitrary number – although we are still looking! Instead, we have a lot of optimizations in order to try to extract the best possible performance whether we have one simple regular expression or tens of thousands.

Some of these techniques include: