book cover

Principles
(Section 15.4 of
Programming Pearls)

String Problems. How does your compiler look up a variable name in its symbol table? How does your help system quickly search that whole CD-ROM as you type in each character of your query string? How does a web search engine look up a phrase? These real problems use some of the techniques that we've glimpsed in the toy problems of this column.

Data Structures for Strings. We've seen several of the most important data structures used for representing strings.

Hashing. This structure is fast on the average and simple to implement.

Balanced Trees. These structures guarantee good performance even on perverse inputs, and are nicely packaged in most implementations of the C++ Standard Template Library's sets and maps.

Suffix Arrays. Initialize an array of pointers to every character (or every word) in your text, sort them, and you have a suffix array. You can then scan through it to find near strings or use binary search to look up words or phrases.
Section 13.8 uses several additional structures to represent the words in a dictionary.

Libraries or Custom-Made Components? The sets, maps and strings of the C++ STL were very convenient to use, but their general and powerful interface meant that they were not as efficient as a special-purpose hash function. Other library components were very efficient: hashing used strcmp and suffix arrays used qsort. I peeked at the library implementations of bsearch and strcmp to build the binary search and the wordncmp functions in the Markov program.

Next: Section 15.5. Problems.

Copyright © 1999 Lucent Technologies. All rights reserved. Wed 18 Oct 2000