book cover

Solutions
(To Column 15 of
Programming Pearls)

1. Many document systems provide a way to strip out all formatting commands and see a raw text representation of the input. When I played with the string duplication program of Section 15.2 on long texts, I found that it was very sensitive to how the text was formatted. The program took 36 seconds to process the 4,460,056 characters in the King James Bible, and the longest repeated string was 269 characters in length. When I normalized the input text by removing the verse number from each line, so long strings could cross verse boundaries, the longest string increased to 563 characters, which the program found in about the same amount of run time.

3. Because this program performs many searches for each insertion, very little of its time is going to memory allocation. Incorporating the special-purpose memory allocator reduced the processing time by about 0.06 seconds, for a ten-percent speedup in that phase, but only a two-percent speedup for the entire program.

5. We could add another map to the C++ program to associate a sequence of words with each count. In the C program we might sort an array by count, then iterate through it (because some words tend to have large counts, that array will be much smaller than the input file). For typical documents, we might use key indexing and keep an array of linked lists for counts in the range (say) 1..1000.

7. Textbooks on algorithms warn about inputs like ``aaaaaaaa'', repeated thousands of times. I found it easier to time the program on a file of newlines. The program took 2.09 seconds to process 5000 newlines, 8.90 seconds on 10,000 newlines, and 37.90 seconds on 20,000 newlines. This growth appears to be slightly faster than quadratic, perhaps proportional to roughly n log2 n comparisons, each at an average cost proportional to n. A more realistic bad case can be constructed by appending two copies of a large input file.

8. The subarray a[i..i+M] represents M+1 strings. Because the array is sorted, we can quickly determine how many characters those M+1 strings have in common by calling comlen on the first and last strings:

comlen(a[i], a[i+M])
Code at this book's web site implements this algorithm.

9. Read the first string into the array c, note where it ends, terminate it with a null character, then read in the second string and terminate it. Sort as before. When scanning through the array, use an ``exclusive or'' to ensure that precisely one of the strings starts before the transition point.

14. This function hashes a sequence of k words terminated by null characters:

unsigned int hash(char *p)
    unsigned int h = 0
    int n
    for (n = k; n > 0; p++)
        h = MULT * h + *p
        if (*p == 0)
            n--
    return h % NHASH
A program at this book's web site uses this hash function to replace binary search in the Markov text generation algorithm, which reduces the O(n log n) time to O(n), on the average. The program uses a list representation for elements in the hash table to add only nwords additional 32-bit integers, where nwords is the number of words in the input.

Copyright © 1999 Lucent Technologies. All rights reserved. Wed 15 Nov 2000