book cover

Problems
(Section 15.5 of
Programming Pearls)

The Solutions to Column 15 give answers for some of these problems.

1. Throughout this column we have used the simple definition that words are separated by white space. Many real documents, such as those in HTML or RTF, contain formatting commands. How could you deal with such commands? Are there any other tasks that you might need to perform?

2. On a machine with ample main memory, how could you use the C++ STL sets or maps to solve the searching problem in Section 13.8? How much memory does it consume compared to McIlroy's structure?

3. How much speedup can you achieve by incorporating the special-purpose malloc of Solution 9.2 into the hashing program of Section 15.1?

4. When a hash table is large and the hash function scatters the data well, almost every list in the table has only a few elements. If either of these conditions is violated, though, the time spent searching down the list can be substantial. When a new string is not found in the hash table in Section 15.1, it is placed at the front of the list. To simulate hashing trouble, set NHASH to 1 and experiment with this and other list strategies, such as appending to the end of the list, or moving the most recently found element to the front of the list.

5. When we looked at the output of the word frequency programs in Section 15.1, it was most interesting to print the words in decreasing frequency. How would you modify the C and C++ programs to accomplish this task? How would you print only the M most common words, where M is a constant such as 10 or 1000?

6. Given a new input string, how would you search a suffix array to find the longest match in the stored text? How would you build a GUI interface for this task?

7. Our program for finding duplicated strings was very fast for ``typical'' inputs, but it can be very slow (greater than quadratic) for some inputs. Time the program on such an input. Do such inputs ever arise in practice?

8. How would you modify the program for finding duplicated strings to find the longest string that occurs more than M times?

9. Given two input texts, find the longest string that occurs in both.

10. Show how to reduce the number of pointers in the duplication program by pointing only to suffixes that start on word boundaries. What effect does this have on the output produced by the program?

11. Implement a program to generate letter-level Markov text.

12. How would you use the tools and techniques of Section 15.1 to generate (order-0 or non-Markov) random text?

13. The program for generating word-level Markov text is at this book's web site. Try it on some of your documents.

14. How could you use hashing to speed up the Markov program?

15. Shannon's quote in Section 15.3 describes the algorithm he used to construct Markov text; implement that algorithm in a program. It gives a good approximation to the Markov frequencies, but not the exact form; explain why not. Implement a program that scans the entire string from scratch to generate each word (and thereby uses the true frequencies).

16. How would you use the techniques of this column to assemble a word list for a dictionary (the problem that Doug McIlroy faced in Section 13.8)? How would you build a spelling checker without using a dictionary? How would you build a grammar checker without using rules of grammar?

17. Investigate how techniques related to k-gram analysis are used in applications such as speech recognition and data compression.

Next: Section 15.6. Further Reading.

Copyright © 1999 Lucent Technologies. All rights reserved. Wed 18 Oct 2000