The Solutions to Column 15 give answers for some of these problems.
1.
Throughout this column we have used the simple definition
that words are separated by white space.
Many real documents, such as those in HTML or RTF,
contain formatting commands.
How could you deal with such commands?
Are there any other tasks that you might need to perform?
2.
On a machine with ample main memory,
how could you use the C++ STL sets or maps
to solve the searching problem in Section 13.8?
How much memory does it consume compared to McIlroy's
structure?
3.
How much speedup can you achieve by incorporating
the special-purpose malloc of Solution 9.2
into the hashing program of Section 15.1?
4.
When a hash table is large and
the hash function scatters the data well,
almost every list in the table has only a few elements.
If either of these conditions is violated,
though,
the time spent searching down the list can be substantial.
When a new string is not found in the hash table
in
Section 15.1,
it is placed at the front of the list.
To simulate hashing trouble,
set NHASH to 1 and experiment with
this and other list strategies,
such as appending to the end of the list,
or moving the most recently found element to the
front of the list.
5.
When we looked at the output of the word frequency
programs in
Section 15.1,
it was most interesting to print the words in
decreasing frequency.
How would you modify the C and C++ programs to
accomplish this task?
How would you print only the M most common words,
where M is a constant such as 10 or 1000?
6.
Given a new input string,
how would you search a suffix array to find
the longest match in the stored text?
How would you build a GUI interface for this task?
7.
Our program for finding duplicated strings was
very fast for ``typical'' inputs,
but it can be very slow (greater than quadratic)
for some inputs.
Time the program on such an input.
Do such inputs ever arise in practice?
8.
How would you modify the program for finding
duplicated strings to find
the longest string that occurs more than M times?
9.
Given two input texts,
find the longest string that occurs in both.
10.
Show how to reduce the number of pointers
in the duplication program
by pointing only to suffixes that start on word boundaries.
What effect does this have on the output
produced by the program?
11.
Implement a program to generate letter-level Markov text.
12.
How would you use the tools and techniques of
Section 15.1
to generate (order-0 or non-Markov) random text?
13.
The
program for generating word-level Markov text
is at this book's web site.
Try it on some of your documents.
14.
How could you use hashing to speed up the Markov program?
15.
Shannon's quote in Section 15.3
describes the
algorithm he used to construct Markov text;
implement that algorithm in a program.
It gives a good approximation to the Markov frequencies,
but not the exact form;
explain why not.
Implement a program that scans the entire string
from scratch to generate each word
(and thereby uses the true frequencies).
16.
How would you use the techniques of this column
to assemble a word list for a dictionary
(the problem that Doug McIlroy faced in Section 13.8)?
How would you build a spelling checker without using
a dictionary?
How would you build a grammar checker without using
rules of grammar?
17.
Investigate how techniques related to k-gram analysis
are used in applications such as speech recognition
and data compression.
Next: Section 15.6. Further Reading.
Copyright © 1999
Lucent Technologies. All rights reserved.
Wed 18 Oct 2000