May 10, 2016 at 05:07 Tags Go

I have recently discovered - to my delight - that Go has an efficient implementation of suffix arrays right in the standard library - index/suffixarray.

A suffix array is an ingenious data structure that lets us take a large body of text (or any binary data, for that matter), preprocess it and then be able to find any substring in this text in logarithmic time. And the coolest thing is that a suffix array only requires O(n) space and can be constructed efficiently. For more details, turn to the Wikipedia page on Suffix Arrays - it's pretty good.

Like the rest of the Go standard library, infex/suffixarray is decently documented. However, I couldn't find good examples, so I'll provide some in this post. The full, runnable code of these examples can be found here.

Building the suffix array and simple lookups Let's start with the basics. suffixarray.New builds a new Index , which can then be used to efficiently look up substrings. One catch with suffixarray.New is that it just takes a single byte slice as the data. This suits many use cases, but sometimes we need some extra preprocessing. A common task is having a bunch of words (or sentences), and finding which word(s) a certain substring is part of. For example: words := [] string { "banana" , "apple" , "pear" , "tangerine" , "orange" , "lemon" , "peach" , "persimmon" , } We can combine these words into a single []byte , delimiting them by a byte that appears in no valid word; \x00 is a good candidate. Then, we can create the suffix array: data := [] byte ( "\x00" + strings . Join ( words , "\x00" ) + "\x00" ) sa := suffixarray . New ( data ) Now we can use the Lookup method to find all locations of some substring: indices := sa . Lookup ([] byte ( "an" ), - 1 ) // indices is now: [4 2 31 20] Lookup returns a slice of indices into the data buffer the suffix array was created with. The substrings can be found at these indices. Given our \x00 delimiters, it's fairly easy to reconstruct a word from an index: func getStringFromIndex ( data [] byte , index int ) string { var start , end int for i := index - 1 ; i >= 0 ; i -- { if data [ i ] == 0 { start = i + 1 break } } for i := index + 1 ; i < len ( data ); i ++ { if data [ i ] == 0 { end = i break } } return string ( data [ start : end ]) } Now we can get back the words from indices: // Reconstruct matches from indices found by Lookup. for _ , idx := range indices { fmt . Println ( getStringFromIndex ( data , idx )) } // prints: // banana // banana // orange // tangerine Note that banana appears twice - this is because the substring an appears twice in the word.

FindAllIndex - flexibility with a big caveat Another lookup method provided by a suffixarray is FindAllIndex , taking a *regexp.Regexp instead of a []byte . To be honest, I was initially stumped. Nothing I knew about suffix arrays included being able to match artibrary regular expressions! So I looked into the source of the package and drew a breath of calm, and alarm. On one hand, there's no magic. On the other hand, there's no magic! Therefore, I would treat the FindAllIndex method with great suspicion, since it's potentially misleading. What FindAllIndex does is first figure out whether the given regexp has a literal prefix. If it does, Lookup is used to find the prefix (efficiently) and then regexp matching is used to find the rest of the match starting from the found prefix. If there is no literal prefix, however, it just invokes the Regexp.FindAllIndex method, which scans the whole data linearly to find matches. So while I can see how FindAllIndex can be convenient in some cases, I would advise a lot of caution when using it, lest you throw the whole premise of the suffix array out of the window. Why bother building the suffix array when you're going to search through it linearly anyway? Be careful to only use FindAllIndex when you know for sure that the regexp has a literal prefix.