Words are the basic component of documents, and many important problems can be solved by searching for words. Sometimes, however, I search long strings (my documents, or help files, or web pages, or even the entire web) for phrases, such as ``substring searching'' or ``implicit data structures''.
How would you search through a large body of text to find ``a phrase of several words''? If you had never seen the body of text before, you would have no choice but to start at the beginning and scan through the whole input; most algorithms texts describe many approaches to this ``substring searching problem''.
Suppose, though, that you had a chance to preprocess the body of text before performing searches. You could make a hash table (or search tree) to index every distinct word of the document, and store a list of every occurrence of each word. Such an ``inverted index'' allows a program to look up a given word quickly. One can look up phrases by intersecting the lists of the words they contain, but this is subtle to implement and potentially slow. (Some web search engines do, however, take exactly this approach.)
We'll turn now to a powerful data structure and apply it to a small problem: given an input file of text, find the longest duplicated substring of characters in it. For instance, the longest repeated string in ``Ask not what your country can do for you, but what you can do for your country'' is `` can do for you'', with `` your country'' a close second place. How would you write a program to solve this problem?
[Click for more examples of long repeated strings]
This problem is reminiscent of the anagram problem that we saw in Section 2.4. If the input string is stored in c[0..n-1], then we could start by comparing every pair of substrings using pseudocode like this
maxlen = -1 for i = [0, n) for j = (i, n) if (thislen = comlen(&c[i], &c[j])) > maxlen maxlen = thislen maxi = i maxj = j
int comlen(char *p, char *q) i = 0 while *p && (*p++ == *q++) i++ return i
Our program will process at most MAXN characters, which it stores in the array c:
#define MAXN 5000000 char c[MAXN], *a[MAXN];
while (ch = getchar()) != EOF a[n] = &c[n] c[n++] = ch c[n] = 0
The element a[0] points to the entire string; the next element points to the suffix of the array beginning with the second character, and so on. On the input string ``banana'', the array will represent these suffixes:
a[0]: banana a[1]: anana a[2]: nana a[3]: ana a[4]: na a[5]: a
If a long string occurs twice in the array c, it appears in two different suffixes. We will therefore sort the array to bring together equal suffixes (just as sorting brought together anagrams in Section 2.4). The ``banana'' array sorts to
a[0]: a a[1]: ana a[2]: anana a[3]: banana a[4]: na a[5]: nana
We'll sort the suffix array with the qsort function:
qsort(a, n, sizeof(char *), pstrcmp)
for i = [0, n) if comlen(a[i], a[i+1]) > maxlen maxlen = comlen(a[i], a[i+1]) maxi = i printf("%.*s\n", maxlen, a[maxi])
I ran the resulting program to find the longest repeated string in the 807,503 characters in Samuel Butler's translation of Homer's Iliad. The program took 4.8 seconds to locate this string:
[Click for more examples of long repeated strings]
Suffix arrays represent every substring in n characters of input text using the text itself and n additional pointers. Problem 6 investigates how suffix arrays can be used to solve the substring searching problem. We'll turn now to a more subtle application of suffix arrays.
Next: Section 15.3. Generating Random Text.
Copyright © 1999 Lucent Technologies. All rights reserved. Wed 18 Oct 2000