Jon is a Member of Technical Staff at Bell Labs. He can be reached at [email protected] .bell-labs.com.

Suffix trees are simple, yet powerful, data structures: Fill an array with pointers to every position in a string, sort the array, then quickly search the string for long phrases. A few lines of code suffice to implement this structure, which can be near optimal in time and space. In this column, I'll introduce the structure with a simple problem, then apply it to a more subtle task.

Given an input file of text, how can a program find the longest duplicated substring of characters? For instance, the longest repeated string in "Ask not what your country can do for you, but what you can do for your country" is " can do for you," with "your country" a close second. How would you write a program to solve this problem?

If the input string is stored in c[0..n-1], a simple program could compare every pair of substrings with the following pseudocode:

maxlen = -1 for (i = 0; i < n; i++) for (j = 0; j < n; j++) thislen = comlen(&c[i], &c[j]) if thislen > maxlen maxlen = thislen maxi = i maxj = j

The comlen function returns the length that its two parameter strings have in common, starting with their first characters:

int comlen(char *p, char *q) i = 0 while *p && (*p++ == *q++) i++ return i

Because the algorithm looks at all pairs of substrings, it takes time proportional to n2, at least.

Suffix arrays give a faster algorithm. The program processes at most MAXN characters, which it stores in the array c:

#define MAXN 5000000 char c[MAXN], *a[MAXN];

The suffix array is the array a of pointers to characters. As the program reads the input, it initializes a so that each element points to the corresponding character in the input string:

while (ch = getchar()) != EOF a[n] = &c[n] c[n++] = ch c[n] = 0

The final element of c contains a null character, which terminates all strings.

The element a[0] points to the entire string; the next element points to the suffix of the array beginning with the second character, and so on. On the input string "banana," the array will represent these suffixes:

a[0]: banana a[1]: anana a[2]: nana a[3]: ana a[4]: na a[5]: a

The pointers in the array a together point to every suffix in the string, hence the name "suffix array."

If a long string occurs twice in the array c, it appears in two different suffixes. The program therefore sorts the array to bring together equal suffixes. The "banana" array sorts to the following:

a[0]: a a[1]: ana a[2]: anana a[3]: banana a[4]: na a[5]: nana

The code then scans through this array comparing adjacent elements to find the longest repeated string, which, in this case, is "ana." The suffix array structure is that simple: Fill an array with pointers, and sort them. It has been used at least since the 1970s, though the term was introduced in the 1990s.

The program will sort the suffix array with the qsort function:

qsort(a, n, sizeof(char *), pstrcmp)

The pstrcmp comparison function adds one level of indirection to the library strcmp function. This scan through the array uses the comlen function to count the number of letters that two adjacent words have in common:

for (i = 0; i < n; i++) if comlen(a[i], a[i+1]) > maxlen maxlen = comlen(a[i], a[i+1]) maxi = i printf("%.*s\en", maxlen, a[maxi])

The printf statement uses the "*" precision to print maxlen characters of the string.

Suffix arrays represent every substring in n characters of input text using the text itself and n additional pointers. On typical text files of n characters, the algorithm runs in O(nlogn) time, due to sorting. The complete program in Listing One found the longest repeated string in the 4,460,056 characters of the King James Bible in about 36 seconds on a 600-MHz Celeron processor (check out the seventh chapter of the "Book of Numbers").

Markov Text

How can you generate random text? A classic approach is to let loose that poor monkey on his aging typewriter. If the beast is equally likely to hit any lowercase letter or the space bar, the output might look like this:

cbczqpbtvfbyak zfw ecrodtgdd bautfxqkdajxoc

This is pretty unconvincing English text.

If you count the letters in word games (like Scrabble or Boggle), you notice that there are different numbers of the various letters. There are many more As, for instance, than there are Zs. A monkey could produce more convincing text by counting the letters in a document  if A occurs 300 times in the text while B occurs just 100 times, then the monkey should be three-times more likely to type an A than a B. This moves a small step closer to English:

saoo nte on sch wirT,hhths fewr loieseium rase

Most events occur in context. Suppose that you wanted to generate randomly a year's worth of Fahrenheit temperature data. A series of 365 random integers between 0 and 100 wouldn't fool the average observer. You could be more convincing by making today's temperature a (random) function of yesterday's temperature: If it is 85 degrees today, it is unlikely to be 15 degrees tomorrow.

The same is true of English words: If this letter is a Q, then the next letter is quite likely to be a U. A generator can make more interesting text by making each letter a random function of its predecessor. You could, therefore, read a sample text and count how many times every letter follows an A, how many times they follow a B, and so on for each letter of the alphabet. When the generator writes the random text, it produces the next letter as a random function of the current letter. The Order-1 text was made by exactly this scheme:

Order-1. Hixt oftorawa opikie the wanos? sof I thincks my beehimofove, f fonemar

Order-2. ligs art myrrh lover comousalipper goodge shing spould bely, he not hart socks galhat is the

Order-3. Many of scarly of Israel. What spices. Thy love; and, and the like the us sword my belove; forth, and my spousalem, bund gold lilie art flook not.

Order-4. We hath banners. The foxes, as the lions? My belove, new and frankince's daughters of smote merchant? I would none throught handles of being house;

This idea extends to longer sequences of letters. The Order-2 text was made by generating each letter as a function of the two letters preceding it (a letter pair is often called a "digram"). The digram TH, for instance, is often followed in English by the vowels A, E, I, O, U and Y, less frequently by R and W, and rarely by other letters. The Order-3 text is built by choosing the next letter as a function of the three previous letters (a trigram). In Order-4 text, most words are English, and you might not be surprised to learn that it was generated from the King James Version of the "Song of Solomon." (See the "Further Reading" section at the end of this column for more examples of randomly generated text.)

Readers with a mathematical background might recognize this process as a Markov chain. One state represents each k-gram, and the odds of going from one to another don't change, so this is a "finite-state Markov chain with stationary transition probabilities."

You can also generate random text at the word level. The dumbest approach is to spew forth the words in a dictionary at random. A slightly better approach reads a document, counts each word, and then selects the next word to be printed with the appropriate probability. More interesting text, though, is produced by using Markov chains that take into account a few preceding words as they generate the next word. Here is some random text produced from my book Programming Pearls:

Order-1. the STL also adds the same sequence of the large they didn't, they are those structures, design can sometimes decrease storage allocation is trivial program in increasing y- order,

order, Order-2. The 40-pass algorithm in Solution 4) to run out of order with its push-button encoding, sort by column then row, and then bring together all words with the sizeof structures. We therefore used a total of 96 megabytes.

structures. We therefore used a total of 96 megabytes. Order-3. I despise having to use a binary search. These techniques apply to any piecewise-linear functions. 3. This C code implements the sorting algorithm, using the functions defined in Solution 2.

The Order-1 text is almost readable aloud, while the Order-3 text consists of very long phrases from the original input, with random transitions between them. For purposes of parody, Order-2 text is usually juiciest. (See the "Further Reading" section for many more examples of word-level Markov text.)

Automating Parody

I first saw letter-level and word-level order-k approximations to English text in Shannon's 1948 classic Mathematical Theory of Communication. Shannon describes how he generated such text:

To construct [order-1 letter-level text] for example, one opens a book at random and selects a letter at random on the page. This letter is recorded. The book is then opened to another page and one reads until this letter is encountered. The succeeding letter is then recorded. Turning to another page this second letter is searched for and the succeeding letter recorded, etc. A similar process was used for [order-1 and order-2 letter-level text, and order-0 and order-1 word-level text]. It would be interesting if further approximations could be constructed, but the labor involved becomes enormous at the next stage."

Programs have been automating this laborious task since the early 1950s. Shannon's method scans the input text for the current k-gram to generate the next word; this approximation works well when k is small, but gives biased output when there are uneven gaps between k-grams. Fast machines can instead scan the complete input text to choose a truly random successor. A program at http://www .programmingpearls.com/ implements this approach, but it generates only a few outputs per second when processing the 4-million characters in the King James Bible.

A suffix array lets a program search for the next phrase more efficiently. The Order-k C program will store at most 5 MB of text in the array inputchars:

int k = 2; char inputchars[5000000]; char *word[1000000]; int nword = 0;

It will employ the array word as a suffix array pointing to the characters that start on word boundaries (a common modification). The variable nword holds the number of words. It reads the file with the following code:

word[0] = inputchars while scanf("%s", word[nword]) != EOF word[nword+1] = word[nword] + strlen(word[nword]) + 1 nword++

Each word is appended to inputchars (no other storage allocation is needed), and is terminated by the null character supplied by scanf.

After reading the input, the program sorts the word array to bring together all pointers that point to the same sequence of k words. The following function does the comparisons:

int wordncmp(char *p, char* q) n = k for ( ; *p == *q; p++, q++) if (*p == 0 && --n == 0) return 0 return *p - *q

It scans through the two strings while the characters are equal. At every null character, it decrements the counter n and returns equal after seeing k identical words. When it finds unequal characters, it returns the difference.

After reading the input, the program appends k null characters (so the comparison function doesn't run off the end), prints the first k words in the document (to start the random output), and calls the sort:

for (i = 0; i < k; i++) word[nword][i] = 0 for (i = 0; i < k; i++) print word[i] qsort(word, nword, sizeof(word[0]), sortcmp)

The sortcmp function, as usual, adds a level of indirection to its pointers.

The space-efficient structure now contains a great deal of information about the k-grams in the text. If k is 1 and the input text is "of the people, by the people, for the people," the word array might look like this:

word[0]: by the word[1]: for the word[2]: of the word[3]: people word[4]: people, for word[5]: people, by word[6]: the people, word[7]: the people word[8]: the people,

For clarity, this picture shows only the first k+1 words pointed to by each element of word, even though more words usually follow. To find a word to follow the phrase "the," a program looks it up in the suffix array to discover three choices: "people," twice and "people" once.

The program may now generate nonsense text with the pseudocode in Example 1. The loop is initialized by setting phrase to the first characters in the input (recall that those words were already printed on the output file). The binary search locates the first occurrence of phrase (it is crucial to find the very first occurrence). The next loop scans through all equal phrases, selects one of them at random. If the k-th word of that phrase is of length zero, the current phrase is the last in the document, and an early exit is taken from the loop.

The complete pseudocode (see Example 2) implements those ideas, and also puts an upper bound on the number of words it generates. Listing Two is the complete C program. (See "Further Reading" for a description of a more typical program for generating Markov text that doesn't use suffix arrays. The suffix array approach is about half the length in code, has similar run time, and uses an order of magnitude less memory.)

Principles

Need to search for phrases in a long string? Create an array of pointers to every relevant position in the string (every character or every word), and sort it. The resulting suffix array gathers together similar strings, and lets you look up a string using binary search. It requires only a few lines of code to build, n extra pointers of space, and a small and fast binary search to answer a query.

Further Reading

This column is based on Chapter 15 of the second edition of Programming Pearls (Addison-Wesley, 2000). The full column is available at http://www.programmingpearls.com/. It contains exercises, solutions, code, references to related work, and numerous examples of Markov text.

DDJ

/* Copyright (C) 1999 Lucent Technologies */ /* From 'Programming Pearls' by Jon Bentley */ /* longdup.c -- Print longest string duplicated M times */ #include <stdlib.h> #include <string.h> #include <stdio.h> int pstrcmp(char **p, char **q) { return strcmp(*p, *q); } int comlen(char *p, char *q) { int i = 0; while (*p && (*p++ == *q++)) i++; return i; } #define M 1 #define MAXN 5000000 char c[MAXN], *a[MAXN]; int main() { int i, ch, n = 0, maxi, maxlen = -1; while ((ch = getchar()) != EOF) { a[n] = &c[n]; c[n++] = ch; } c[n] = 0; qsort(a, n, sizeof(char *), pstrcmp); for (i = 0; i < n-M; i++) if (comlen(a[i], a[i+M]) > maxlen) { maxlen = comlen(a[i], a[i+M]); maxi = i; } printf("%.*s

", maxlen, a[maxi]); return 0; }

Back to Article

/* Copyright (C) 1999 Lucent Technologies */ /* From 'Programming Pearls' by Jon Bentley */ /* markov.c -- generate random text from input document */ #include <stdio.h> #include <stdlib.h> #include <string.h> char inputchars[4300000]; char *word[800000]; int nword = 0; int k = 2; int wordncmp(char *p, char* q) { int n = k; for ( ; *p == *q; p++, q++) if (*p == 0 && --n == 0) return 0; return *p - *q; } int sortcmp(char **p, char **q) { return wordncmp(*p, *q); } char *skip(char *p, int n) { for ( ; n > 0; p++) if (*p == 0) n--; return p; } int main() { int i, wordsleft = 10000, l, m, u; char *phrase, *p; word[0] = inputchars; while (scanf("%s", word[nword]) != EOF) { word[nword+1] = word[nword] + strlen(word[nword]) + 1; nword++; } for (i = 0; i < k; i++) word[nword][i] = 0; for (i = 0; i < k; i++) printf("%s

", word[i]); qsort(word, nword, sizeof(word[0]), sortcmp); phrase = inputchars; for ( ; wordsleft > 0; wordsleft--) { l = -1; u = nword; while (l+1 != u) { m = (l + u) / 2; if (wordncmp(word[m], phrase) < 0) l = m; else u = m; } for (i = 0; wordncmp(phrase, word[u+i]) == 0; i++) if (rand() % (i+1) == 0) p = word[u+i]; phrase = skip(p, 1); if (strlen(skip(phrase, k-1)) == 0) break; printf("%s

", skip(phrase, k-1)); } return 0; }

Back to Article