Computer Science 132: Analysis of Algorithms

Assignment 2

The code for Part 1 and Part 2 is due Tuesday October 6 before 23:00

Written summaries are due at class time.

For this assignment, you will be implementing the LZ78 compression algorithm, which uses a dictionary (ie. a hash table) for storing and indexing common sequences of characters. A rudimentary hash table (the CodeTable class) has already been implemented and will simplify the completion of part 1. Part 2 has you extending the CodeTable class in order to perform experiments on hash table performance. Finally, you will extend CodeTable so that its table automatically resizes when necessary.

Part 1

Implement the LZ78 encoding algorithm as described online. While we will go through an example in class, this reference provides a comprehensive description of the algorithm. The files in ~miller/cs231/hw2 contain source files that will also help you get started. You should copy them to your own hw2 directory.

Among these files is the Press.cpp file, which has a good demonstrations of the CodeTable and CodeStream classes. The CodeTable implements the dictionary for the LZ78 algorithm. The CodeStream class allows you to easily write out compressed integer--code pairs to a file as required by the LZ78 algorithm. Running the make command compiles the CodeTable and CodeStream classes and the source Press.cpp to create the executable program called press.

Also supplied is the decoding implementation Depress.cpp (the executable is called depress). Currently the demonstration press program produces a file that can be successfully interpreted by the depress program. However, you will note that no file compression is obtained in its current form.

Once you have implemented the encoding algorithm, run it on at least three similar text files of different sizes. Data collection (and only data collection) may be done in groups of up to three people. Graph your results, where the x-ordinate is the file size and the y-ordinate is the compression ratio. Describe the results and draw an appropriate conclusion in a short, written summary (about one to two paragraphs). If you collected data in a group, be sure to list your collaborators. The code and written summary must be your own work.

Part 2

From our discussion in class, you should recall that the linear probing method may produce large clusters around certain hash indexes and thus cause a large number of collisions. To investigate their frequency, write a display member function that writes out the following for each hash entry:

  1. the key value (the string)
  2. the code value (the integer)
  3. how far from the original hash value the entry is placed
Your display function should also provide the average for the last figure. Using your display function, collect average results for the following three situations:
  1. using a hash table just large enough to handle a large text file
  2. using a hash table with twice the size as above
  3. using a hash table with four time the size as the first

Graph your results, where the x-ordinate is the hash table size and the y-ordinate is the average number of probes needed to find the key. Describe the results and draw an appropriate conclusion in a short, written summary (about one to two paragraphs). Again, data collection may be done in a group of up to three people, but the code and written summary must be your own work.

Finally, extend the CodeTable class so that the hash table is automatically resized when it becomes more than half full.

Submission

The files will be automatically collected shortly after midnight in the early morning after the due date. However, written summaries (use a word-processor!) should be turned in at class time on the due date.