Shift-And Visualization

In this article I would like to describe the Shift-And algorithm for pattern matching, based on the book Flexible Pattern Matching in Strings. Apart from the actual description, you can scroll down for a demonstration of how the algorithm works.

Shift-And is categorized as a prefix-based algorithm, similar to KMP , but what Shift-And does is to keep a set of all prefixes p that matches a suffix of the text read so far. Keep in mind that is the suffix of the text read so far, not a suffix of the whole text.

The set of prefixes is updated using bit-parallelism, that is, the set of prefixes is kept in a bit mask called D . D is defined as: d m … d 1 , where m is the canonical way of referring to the pattern length. For example, if we have a six character pattern, and so far our first character matches against the text, then D would be 000001 .

With every character read from the text, D will be updated, if the bit d m is active, then we have a match. Is it possible to update this set in constant time? Yes, using bit-parallel operations.

This algorithm will first build a table B , holding a bit mask b m … b 1 for every character in the pattern. So, if we are searching for the word announce , the entry for the character n will look like this: 00100110 , that is, the word announce has an n in the 2nd, 3rd and 6th positions, which are marked in the bit mask from right to left. So to recap: the algorithm will scan the pattern we are searching, and then it will build this table, updating the bit mask as it goes. Here’s the code in javascript:

// init the table for ( var i = 0 ; i < l ; i ++ ) { b [ p . charAt ( i )] = 0 ; } // build bit mask table; for ( var i = 0 ; i < l ; i ++ ) { b [ p . charAt ( i )] = b [ p . charAt ( i )] | ( 1 << i ); }

So the bit mask table for the word announce will end up looking like this:

var B = { ' a ' : 00000001 , ' n ' : 00100110 , ' o ' : 00001000 , ' u ' : 00010000 , ' c ' : 01000000 , ' e ' : 10000000 , ' * ' : 00000000 }

Of course we will have the actual numbers, not those bit masks, with the padding and so on. We also added an * entry in the table. That’s because whenever a read character from the text doesn’t match, then the bit mask used will be 0 .

Once we have the table built, we start searching for the pattern in the text. First we set D to 0 , and then we start reading from the text, character by character. Let’s say we read an a , so we search for the a bit mask in our table, it will return 00000001 . To calculate the new D we first shift it 1 to the left, and we OR a 1 to it : (D << 1) | 1 . That value gets AND'ed to the mask we retrieved from the table, so this would be the complete line: D <- ((D << 1) | 1) & B[current-char] .

We keep iterating over the text until we find a match. How do we find a match? We need to check if the bit d m is active. To do that we have to run the following operation: D & (1 << m-1) . We create the following bit mask 10000000 and we AND it to D . If this is not 0 , then we have a match.

Here’s the complete algorithm in a vanilla implementation using Javascript:

/** * string p: pattern to search (needle); * string text: haystack **/ function shift_and ( p , text ) { var b = {}; var l = p . length ; var tl = text . length ; // init the table for ( var i = 0 ; i < l ; i ++ ) { b [ p . charAt ( i )] = 0 ; } //build bit mask table; for ( var i = 0 ; i < l ; i ++ ) { b [ p . charAt ( i )] = b [ p . charAt ( i )] | ( 1 << i ); } var d = 0 ; var matchMask = 1 << l - 1 ; for ( var i = 0 ; i < tl ; i ++ ) { d = (( d << 1 ) | 1 ) & ( b [ text . charAt ( i )] | 0 ); var matched = ( d & matchMask ); if ( matched != 0 ) { return i - l + 1 ; } } return - 1 ; }

And now to the interesting part, the visual simulation:

Shift-And visual simulation

Let me briefly explain what’s going on here. First we have a form, where you can enter the pattern to search, and the text where to search it for. That’s easy.

Once hit start , the visualization will display the text and bellow it the pattern you are searching for. Beneath them, there will be the b table and next to it, the results for each step will be reported.

For example, we will see the initial value of D, then it will be (D << 1) | 1 . Then we read a character from the text, and we get the value from the table, AND'ing it to the value of D and getting the result below, and so on.

I hope the colors are pretty self explanatory (I’m colorblind so don’t trust me on this one).

So, click start and enjoy the show.

Shift-And Pattern: Text: annual_announce Visualization Bit-mask Table Run

I hope this post has made justice to what I think is one of the most elegant algorithms that I know of. If there was a book called “Algorithms from the Book” similar to this, I’m sure this algorithm would make it.

If you are interested in learning more about Pattern Matching in Strings, then I would recommend you get a copy of the book Flexible Pattern Matching in Strings since it’s really worth every penny.

I would like to mention also Mr. @sicher who basically single handedly fixed the mess of HTML and CSS that I had created. So if you enjoyed the visual styles, then send him some props on twitter.