Intelligent Information Retrieval

Intelligent Information Retrieval
CSC 575

Assignment 4
Due: Wednesday, March 8, 2023

Part 1. Text Categorization

In this problem you will explore two approaches for automatic text categorization (a.k.a. document classification). You will first implement a basic multinomial Naïve Bayesian classifier. Then you will implement a simple document classifier based on the Rocchio method (adapted for classification). For both methods you will use this modified subset of BBC News stories to train and test your classifiers. The assignment dataset consists of an already processed set of 2225 documents from the BBC news website corresponding to stories in five topical areas from 2004-2005. Each document is represented as a feature vector over a set of 6167 features (after tokenization, stop word removal, stemming, and frequency filtering). Each document belongs to one of 5 classes (business, entertainment, politics, sport, tech). The data has already been split (80%, 20%) into training and test data both of which are provided in document x term matrix format containing a row for each document and a column for each feature in the vocabulary (note, however, that the first column of each row represents the documeet id in the original dataset, and this column should be removed to get the doc-term frequency matrices). The class labels for instances in the training and the test data are provided in separate files. Also, the set of features used in the data (representing the columns in the train and test matrices) is provided as a separate file. Please be sure to read the readme.txt file in the distribution for a more detailed description of the data.

[40 pts] Document Categorization Using Naïve Bayesian Classifier)

Implement your own version of a Naïve Bayesian classifier. Your program should consist of two separate components: training and test. The training component, NB_Train(train_matrix, train_class_labels), learns from the training data by computing the priors for each class and the conditional (posterior) class probabilities associated with different features (in this case terms in the vocabulary). The test component, NB_classify(posteriors, priors, test_instance), uses the learned model (posterior and prior probabilities returned by the training function) to classify a test instance that was not used for training. Your program should work for any number of classes and not just for the number of classes in the provided data set. Note: to avoid floating point underflow, you should use the sum of log probabilities when computing the class labels using Bayes's rule (for uniformity, please use natural log when computing the log probabilities).

Your task is to train your classifier using the provided training data. Then you need to evaluate your classifier using the provided test data. For the latter task you should create an evaluation function to measure the accuracy of your trained classifier on the full test set. This function will apply your learned model returned from the training component to predict a class label for each of the test instances in the test data. Then for each test instance, it compares the actual test class label to the predicted class label. Your evaluation function should return the overall Classification Accuracy (ratio of correct predictions to the number of test instances in the full test set). It should also have the option to output the predicted class labels together with the sum of log probabilities for a portion of the test data (this is to illustrate that your classifier is working properly). Your program should also provide the capability to view the learned class probabilities for each class label and the posteriors for specified terms (again, this is to provide optional intermediate results demonstrating the proper functioning of your training function). In addition to your code, provide the following outputs as part of your submission (clearly identify which output corresponds to which part of the problem):
1. After training your classifier, provide the prior probabilities associated with each class label.
2. Based on your trained model, provide the posteriors (conditional class probabilities) for each of the following features: "hollywood", "footbal", "comput", "labour". Note that this is Pr(feature | class) for each class. Here is an example output using the term "tax".
3. Provide an output showing actual and predicted class labels for the first 20 document instances (instances 0-19) in the test data along with the log probability values for each test instance. For example, see this sample output for test instances 30-39.
4. Using your evaluation function provide the overall classification accuracy across all test instances.
[20 pts] Document Categorization Using Rocchio Method

Implement your own version of the Rocchio method for text classification discussed in class (See: Text Categorization - Part 1). This is a simple algorithm, but often produces good results in text classification tasks. You will train and test your implementation using the same BBC News dataset used in the previous problem. As before, you should separate the training function from the classification function. The training part for the classifier can be implemented as a function that takes as input the training data matrix and the training labels, returning the prototype vectors for each class. The classification part can be implemented as another function that would take as input the prototypes returned from the training function and the instance to be classified. This function should measure Cosine similarity of the test instance to each prototype vector. Your output should indicate the predicted class for the test instance and the similarity values of the instance to each of the category prototypes. Also, as in the previous problem, you should create an evaluation function that iterates over all test instances and report the overall classification accuracy. Your program should work for any number of classes and not just for the number of classes in the provided data set. Note: for this problem do not perform TFxIDF transformation of the data. You should use the original raw term counts in the provided training and test data sets. In addition to your code, provide the following outputs as part of your submission (clearly identify which output corresponds to which part of the problem):
1. Provide an output showing actual and predicted class labels for the first 20 document instances (instances 0-19) in the test data along with the similarity of the test instance to the most similar class prototype. For example, see this sample output for test instances 30-39.
2. Using your evaluation function provide the overall classification accuracy across all test instances.

Part 2. Hyperlink Analysis

In this part of the assignment you will explore well-known, simple, yet powerful hyperlink analysis methods. Specifically, you will implement your own versions of the HITS algorithm (Hubs and Authorities) and the PageRank algorithm. For this problem, you should not use external libraries that directly compute PageRank or hubs/authorities for a graph. For the purposes of this problem, you can represent a directed graph with an adjacency list (for each node p in the graph representing a Web page, the adjacency list for that node includes all the nodes to q to which p points i.e., there is a directed edge from p to q). For example, the simple directed graph, G, given below:

may be represented using the folllowing adjacency list:

G = {
    'A': ['B','C'],
    'B': ['C','D'],
    'C': ['B'],
    'D': ['C']
}

A bidirectional edge between nodes p and q in the graph represents two edges, one for p to q and one from q to p. Alternatively, you can use an adjacency matric to represent the graph (however, this type of representation may not be scalable for large graphs). Your program should work for any input graph, given an appropriate representation for the graph.

[20 pts] PageRank

Implement the iterative version of PageRank algorithm given in lecture notes on Hyperlink Analysis (see slide 46). Your implementation should include a function PageRank(G, alpha, num_iter) that takes the representation of any graph G, the PageRank damping factor, alpha, and maximum number of iterations, num_iter, and returns a vector R of PageRank values for each node in the graph (after the max number of iterations). Your function may optionally provide intermediate PageRank vectors at the end of each iteration (after the normalization step). For instance, using PageRank(G, 0.25, 5) with the above example graph may produce something similar to this sample output.

To demonstrate your program, submit the output of your program on the following graph, G1, using alpha = 0.5 and number of iterations = 5.

The adjacency list for the graph is given below:
```
G1 = {
    'A': ['C','D','E'],
    'B': ['C','D','E','F'],
    'C': ['D','F'],
    'D': ['C','E'],
    'E': ['A','C'],
    'F': ['A','B','C','D','E']
}
```
[20 pts] HITS Algorithm (Hubs and Authorities)

Implement the iterative version of HITS algorithm given in lecture notes on Hyperlink Analysis (see slide 27). The main component of your implementation should be a function, HITS(G, num_iter) which takes, as input, the representation of any graph G and the maximum number of iterations, num_iter, and returns two vectors: a vector A representing authority values for each node and a vector H representing the hub values for each node. These vectors represent the updated hub and authority values for the nodes in the graph after num_iter update cycles. Your function may optionally display intermediate hub and authority vectors at the end of each iteration (after the normalization step). You may find it helpful to have helper functions for updating the hub and the authority vectors during each iteration in the main function. As an example, using HITS(G, 5) with the aforementined example graph, G, may produce something similar to this sample output.

To demonstrate your program, submit the output of your program on the graph G1, from the previous problem, using num_iter = 5.

Back to Assignments/strong>