CS 132 Lab 12 and 13

CS 132: Intro to Computer Science II

Spring 1999

Assignments 12 and 13

File I/O, Recursive Sorting, and Class Design

Part I is due Monday April 26

Part II is due Monday May 3

Some authors are well known for using their expansive vocabularies when writing their masterpieces. Other authors aim for simplicity and choose only among the most basic words to express their ideas. These differences in styles are generally revealed by word usage statistics, which might include a list of the most commonly used words, the average length of a word, and the percentage that specific words are used. For the next two labs, you will create a program that performs word usage analysis on some literary texts, whose files can be found in the directory ~miller/cs132/lab12.

Part I

For the first week, you will design and implement a class, called WordList, for storing and manipulating the list of words and their statistics from a text. The class should be able to add a new word when it is not already in the list and increment its usage count when it already is. Your class should also provide a method for sorting the list words by their usage. Since the list may be long, it should use one of the O(n lg n) sorts discussed in class. Finally, the class should provide methods for retrieving the average word length, a word given its usage rank, and usage statistics (count and percentage) given a word.

In designing your class, you will decide on methods that will later be useful in analyzing and reporting word usage statistics. While you will be graded on the usefulness of your design, you should take the opportunity of discussing your class design with the instructor before implementing it. You may find the design and implementation of the Card and DeckOfCards classes to be good starting points for your WordList class. While it is conceivable that the word statistics class could directly handle file I/O, you should address I/O issues outside of your class when you complete part II. After implementing your class, make sure that you thoroughly test your class. A test driver can be written in the static main method and should be left in your code when it is submitted.

In addition to the usual criteria, the grade for Part I will be based on the following:

Usefulness of the class design
Completeness of the class documentation (see the RowOfColors class in lab5 as a model)
Efficiency of the class implementation (although sequential search in an unsorted list is okay)
Adequacy of the test driver

Submission for Part I

Turn in the source of your WordList class (and any class you developed for it) as well as a hardcopy of the script running it.

Part II

For the second week, your assignment is to use your class from Part I to write the word statistics program, which you should call Analyze.java. It should read from the text file specified in the command line, store words and calculate statistics using your word list class, and write a report in HTML that summarizes the statistics. For example, if java Analyze cask.poe were typed in at the unix prompt, the program would read and analyze the file cask.poe and write the HTML report to cask.poe.html.

The filename in the command line can be referenced in the string array that is passed as a parameter to the main method. Reading from a file word-by-word is simplified by using the StreamTokenizer class. The following code is an example of using both of these:

  static public void main (String [] args) {

    String fileName = args[0];
    System.out.println("Reading from...  " + fileName);
    try {
      Reader r = new BufferedReader(new FileReader(fileName));
      StreamTokenizer st = new StreamTokenizer(r);

      st.wordChars('\'', '\'');
      st.ordinaryChar('.');
      int code;
      while ((code = st.nextToken()) != st.TT_EOF) {

	if (code == st.TT_WORD) {
          // st.sval is a single word that can be written out
          //    (or added to the word list class)
	  System.out.println(st.sval);
	}
      }

    }
    catch (IOException e) {
      System.out.println("Couldn't open " + fileName);
    }
  }

At minimum, your report should contain:

A list of the most commonly used words
The usage percentage of each of the commonly used words
The average word length

Here's an excellent reference for basic HTML.

Submission for Part II

Turn in hardcopies of your java source code (Analyze.java), the html code that your program generated for the file "wuther.txt", and a printout of Netscape's display of this html code.