CS 232: Data Structures and Problem Solving

Spring 1999

Assignment 9 (2 Week Lab)

Hashing for Efficient File Compression

First Review is Due Monday April 26 (5 pts)

Final Program and Report is Due Wednesday May 5 (35 pts)

For this assignment, you will be implementing the LZ78 compression algorithm, which uses a dictionary (ie. a hash table) for storing and indexing common sequences of characters. A rudimentary hash table (the CodeWriter class) has already been implemented and will simplify the completion of part 1. Part 2 has you extending the Table class in order to perform experiments on hash table performance. Finally, you will extend the Table class so that its table automatically resizes when necessary.

Implement the LZ78 encoding algorithm as described online. While we will go through an example in class, this reference provides a sketch of the algorithm. For this assignment, a code word is represented by an integer. In particular, the integer 0 represents the empty string and each subsequent code word (created by adding one to the previous code) represents each string added to the dictionary (i.e. the hash table). The files in ~miller/cs232/lab9 contain source files that will also help you get started. You should copy them to your own lab9 directory.

Among these files is the Press.java file, which has a good demonstration of the CodeWriter classe. The CodeWriter class allows you to easily write out compressed integer--character pairs to a file as required by the LZ78 algorithm. The Table class serves as the dictionary where each string is a key and the integer code is the object to be retrieved. You will need to use the Integer wrapper class to store integers in the table.

Also supplied is the decoding implementation Depress.java. Currently the demonstration Press program produces a file that can be successfully interpreted by the Depress program. However, you will note that no file compression is obtained in its current form.

Once you have implemented the encoding algorithm, run it on at least four similar text files of different sizes. Graph your results, where the x-ordinate is the file size and the y-ordinate is the compression ratio (the size of the original file divided by the size of the compressed file). Describe the results and draw an appropriate conclusion in a short, written summary (about two paragraphs).

From our discussion in class, you should recall that the linear probing method may produce large clusters around certain hash indexes and thus cause a large number of collisions. To investigate their frequency, write a display member function that writes out the following for each hash entry:

  1. the key value (the string)
  2. the code value (the integer)
  3. how far from the original hash value the entry is placed
Your display function should also provide the average for the last figure. Using your display function, collect average results for at least five data points, three of which should be from the following situations:
  1. using a hash table just large enough to handle a large text file
  2. using a hash table with twice the size as above
  3. using a hash table with four times the size as the first

Graph your results, where the x-ordinate is the hash table size and the y-ordinate is the average number of probes needed to find the key. Also plot the number of predicted probes based on Knuth's formula (p. 559 in the text). Describe the results and draw an appropriate conclusion in a short, written summary (about two paragraphs).

Finally, extend the Table class so that the hash table is automatically resized when it becomes more than half full. This extension is not quite as simple as it sounds since each old item needs to be rehashed into the new array.

Submission

The file compression program should be written and turned in by the first review. For the final submission, turn in all modified source files as well as the written reports (must be word-processed and exquisitely edited) and graphs.