Some authors are well known for using their expansive vocabularies when writing their masterpieces. Other authors aim for simplicity and choose only among the most basic words to express their ideas. These differences in styles are generally revealed by word usage statistics, which might include a list of the most commonly used words, the average length of a word, and the percentage that specific words are used. For the next two labs, you will create a program that performs word usage analysis on some literary texts, whose files can be found in the directory ~miller/cs132/lab12.
For the first week, you will design and implement a class, called WordList, for storing and manipulating the list of words and their statistics from a text. The class should be able to add a new word when it is not already in the list and increment its usage count when it already is. Your class should also provide a method for sorting the list words by their usage. Since the list may be long, it should use one of the O(n lg n) sorts discussed in class. Finally, the class should provide methods for retrieving the average word length, a word given its usage rank, and usage statistics (count and percentage) given a word.
In designing your class, you will decide on methods that will later be useful in analyzing and reporting word usage statistics. While you will be graded on the usefulness of your design, you should take the opportunity of discussing your class design with the instructor before implementing it. You may find the design and implementation of the Card and DeckOfCards classes to be good starting points for your WordList class. While it is conceivable that the word statistics class could directly handle file I/O, you should address I/O issues outside of your class when you complete part II. After implementing your class, make sure that you thoroughly test your class. A test driver can be written in the static main method and should be left in your code when it is submitted.
In addition to the usual criteria, the grade for Part I will be based on the following:
The filename in the command line can be referenced in the string array that
is passed as a parameter to the main method. Reading from a file
word-by-word is simplified by using the StreamTokenizer class. The
following code is an example of using both of these:
At minimum, your report should contain:
static public void main (String [] args) {
String fileName = args[0];
System.out.println("Reading from... " + fileName);
try {
Reader r = new BufferedReader(new FileReader(fileName));
StreamTokenizer st = new StreamTokenizer(r);
st.wordChars('\'', '\'');
st.ordinaryChar('.');
int code;
while ((code = st.nextToken()) != st.TT_EOF) {
if (code == st.TT_WORD) {
// st.sval is a single word that can be written out
// (or added to the word list class)
System.out.println(st.sval);
}
}
}
catch (IOException e) {
System.out.println("Couldn't open " + fileName);
}
}
Here's an excellent reference for basic HTML.