| Intelligent
Information
Retrieval
 CSC 575
 
Assignment 4 Due: Wednesday, March 8, 2023
 
 
 
 Part 1. Text CategorizationIn this problem you will explore two approaches for automatic text categorization (a.k.a. 
document classification). You will first implement a basic multinomial Naïve Bayesian 
classifier. Then you will implement a simple document classifier based on the Rocchio method (adapted 
for classification). For both methods you will use this modified 
subset of BBC News stories to train and test your classifiers. The 
assignment dataset consists of an already processed set of 2225 documents from 
the BBC news website corresponding to stories in five topical areas from 
2004-2005. Each document is represented as a feature vector over a set of 6167 
features (after tokenization, stop word removal, stemming, and frequency 
filtering). Each document belongs to one of 5 classes (business, entertainment, 
politics, sport, tech). The data has already been split (80%, 20%) into training and test data both of which are 
provided in document x term matrix format containing a row for each document and a column 
for each feature in the vocabulary (note, however, that the first column of each 
row represents the documeet id in the original dataset, and this column should 
be removed to get the doc-term frequency matrices). The class 
labels for instances in the training and the test data are provided in separate 
files. Also, the set of features used in the data (representing the columns in 
the train and test matrices) is provided as a separate file. Please be sure to read the readme.txt file in the distribution for a more 
detailed description of the data.
 
[40 pts] Document Categorization Using Naïve Bayesian Classifier)
 Implement your own version of a Naïve Bayesian classifier. Your program 
should consist of two separate components: training and test. The training 
component, NB_Train(train_matrix, train_class_labels),  
learns from the training data by computing the priors for each class and the 
conditional (posterior) class probabilities associated with different features 
(in this case terms in the vocabulary). The test component, 
NB_classify(posteriors, priors, test_instance),  uses the learned model (posterior and prior
probabilities returned by the training function) to classify a 
test instance that was not used for training. Your program should work for any 
number of classes and not just for the number of classes in the provided data set. 
Note: to avoid floating point 
underflow, you should use the sum of log probabilities when 
computing the class labels using Bayes's rule (for uniformity, please 
use natural log when computing the log probabilities).
 
 Your task is to train your classifier using the provided training 
			data. Then you need to evaluate your classifier using the provided 
			test data. For the latter task you should create an evaluation 
			function to measure the accuracy of your trained classifier on the 
full test set. This 
			function will apply your learned model returned from the training 
			component to predict a class label for each of the test instances in the test data. 
			Then for each test instance, it compares the actual test class label to the predicted class 
			label. Your evaluation function should return the overall Classification 
			Accuracy (ratio of correct predictions to the number of test 
			instances in the full test set). It should also have the option to output the predicted 
			class labels together with the sum of log
probabilities for a portion of the test data (this is 
			to illustrate that your classifier is working properly). Your 
program should also provide the capability to view the learned class 
			probabilities for each class label and the posteriors for specified terms (again, this is to provide 
			optional intermediate results demonstrating the proper functioning 
			of your training function). In addition to your code, provide the 
following outputs as part of your submission (clearly identify which output 
corresponds to which part of the problem):
 
 
 
		
		After training your classifier, provide the prior probabilities 
		associated with each class label.
 
Based on your trained model, provide the posteriors (conditional 
			class probabilities) for each of the following features:  "hollywood", "footbal", "comput", 
			"labour". Note that this 
			is Pr(feature | class) for each class. Here is
			an example output using the term 
			"tax". 
 
Provide an output showing actual and 
					predicted class labels for the first 20 document instances 
					(instances 0-19) in the test data along with the log 
		probability values for each test instance. For example, see
		this sample output for test instances 
		30-39.
 
Using your evaluation function provide the overall classification 
		accuracy across all test instances. 
 
 
 
[20 pts] Document Categorization Using Rocchio Method
 Implement your own version of the Rocchio method for text 
	classification discussed in class (See: 	
	Text Categorization - Part 1). This is a simple 
	algorithm, but often produces good results in text classification tasks. You will train and test 
	your implementation using the same BBC News dataset 
	used in the previous problem. As before, you should separate the training 
	function from the classification function. The training part for the 
	classifier can be implemented as a function that takes as input the training 
	data matrix and the training labels, returning the prototype vectors for 
	each class. The classification part can be implemented as another function 
	that would take as input the prototypes returned from the training function 
	and the instance to be classified. This function should measure Cosine 
	similarity of the test instance to each prototype vector. Your output should 
	indicate the predicted class for the test instance and the similarity values 
	of the instance to each of the category prototypes. Also, as in the previous 
	problem, you should create an evaluation function that iterates over all 
	test instances and report the overall classification accuracy. Your program 
	should work for any number of classes and not just for the number of classes 
	in the provided data set. Note: for this problem do not 
	perform TFxIDF transformation of the data. You should use the original raw 
	term counts in the provided training and test data sets. In addition to your 
	code, provide the following outputs as part of your submission (clearly 
	identify which output corresponds to which part of the problem):
 
 
		
		Provide an output showing actual and 
					predicted class labels for the first 20 document instances 
					(instances 0-19) in the test data along with the similarity of the test instance to the most similar class prototype. For 
		example, see this sample output 
		for test instances 30-39.
 
Using your evaluation function provide the overall classification 
		accuracy across all test instances. 
 
 
 Part 2. Hyperlink Analysis
In this part of the assignment you will explore well-known, simple, yet powerful hyperlink analysis methods.
Specifically, you will implement your own versions of the HITS algorithm (Hubs and Authorities)
and the PageRank algorithm. 
For this problem, you should not use external libraries that directly compute 
PageRank or hubs/authorities for a graph. For the purposes of this problem, you 
can represent a directed graph with an adjacency list (for each node p 
in the graph representing a Web page, the 
 adjacency  list for that node includes 
all the nodes to q to which p points i.e., 
there is a directed edge from p to q). For 
example, the simple directed graph, G, given below:   may be represented using the folllowing adjacency list:
 
G = {
    'A': ['B','C'],
    'B': ['C','D'],
    'C': ['B'],
    'D': ['C']
}A bidirectional edge between nodes p and q 
in the graph represents two edges, one for p to q
and one from q to p. Alternatively, 
you can use an adjacency matric to represent the graph (however, this type of 
representation may not be scalable for large graphs). Your program should work 
for any input graph, given an appropriate representation for the graph.
[20 pts] PageRank
 Implement the iterative version of PageRank algorithm given in lecture notes on
Hyperlink Analysis (see slide 46). Your implementation should include a 
function  PageRank(G, alpha, num_iter) that takes the 
representation of any graph G, the PageRank damping factor, 
alpha, and maximum 
number of iterations, num_iter, and returns a vector R of 
PageRank values for each node in the graph (after the max number of 
iterations). Your function may optionally provide intermediate PageRank vectors 
at the end of each iteration (after the normalization step). For 
instance, using PageRank(G, 0.25, 5) with the above example 
graph may produce something similar to this 
sample output.
 
 To demonstrate your program, submit the output of 
your program on the 
following graph, G1, using alpha = 0.5 and
number of iterations = 5.
 
 
  
 The
adjacency list for the graph is given below:
 
 
G1 = {
    'A': ['C','D','E'],
    'B': ['C','D','E','F'],
    'C': ['D','F'],
    'D': ['C','E'],
    'E': ['A','C'],
    'F': ['A','B','C','D','E']
}
 
[20 pts] HITS Algorithm (Hubs and Authorities)
 Implement the iterative version of HITS algorithm given in lecture notes on
Hyperlink Analysis (see slide 27). The main component of your implementation 
should be a function, HITS(G, num_iter) which takes, as 
input, the representation of any graph G and the maximum number 
of iterations, num_iter, and returns two vectors: a vector A representing 
authority values for each node and a vector H representing the 
hub values for each node. These vectors represent the updated hub and authority 
values for the nodes in the graph after num_iter update cycles. 
Your function may optionally display intermediate hub and authority vectors at 
the end of each iteration (after the normalization step). You may find it 
helpful to have helper functions for updating the hub and the authority vectors 
during each iteration in the main function. As an example, using HITS(G, 
5) with the aforementined example 
graph, G,  may produce something similar to
this 
sample output.
 
 To demonstrate your program, submit the output of 
your program on the graph G1, from the previous problem, using
num_iter = 5.
 
 
 Back to Assignments/strong>
 
 |