Before continuing, I must explain how this algorithm works, and how does it pack so much power (searching for multiple strings by inspecting every input byte only once) in just 2 lines of code.

The secret behind it is a Finite State Automaton supported by a trie (built from the set of search strings). For example, for a set of search strings “SEE”, “SEAM” and “EAST”, it would look like this (I have omitted some arrows for a cleaner picture):

Trie based automaton of Aho-Corasic

The root node is marked with zero. There are three terminal (green) nodes indicating the match for search strings “SEE” (3), “SEAM” (5) and “EAST” (9). Building this structure requires a rather sophisticated algorithm (explaining which is out of the scope of this article). But using it for the actual search process is straightforward — this is why it can be expressed in just a couple lines of code.

Here’s how it works: we start the search at the root node “0”. Then, for each symbol of the input, we follow the arrow from the current node that is marked with that symbol. If there is no arrow with the symbol being processed, we jump back to the root node. Whenever we hit a terminal node, we have a match.

For example: if our input is “SEAST”, we will traverse the nodes 0→ 1→ 2→ 4→ 8→ 9 (where 9 is the terminal node for our search string “EAST”).

Another example: for input “SEEAST”, we will traverse the nodes 0→ 1→ 2→ 3→ 7→ 8→ 9. This time we hit not just one, but two terminal nodes along the way: 3 (for “SEE”), and 9 (for “EAST”). As you can see, it can even find partially overlapping search words (if we don’t reset to the root node after every match).