Assignment 2

Due: Thursday, October 24

For this assignment you will experiment with various classification models using subsets of some real-world data sets. In particular, you will use the K-Nearest-Neighbor algorithm to classify text documents, experiment with and compare classifiers that are part of the Scikit-learn machine learning package for Python, and you will also construct and apply a decision tree classifier using the Weka data mining software.

K-Nearest-Neighbor (KNN) classification on Newsgroup [Dataset: newsgroups.zip]

For this problem you will use a subset of the 20 Newsgroup data set. The full data set contains 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups and has been often used for experiments in text applications of machine learning techniques, such as text classification and text clustering (see the description of the full dataset). The assignment data set contains a subset of 1000 documents and a vocabulary of terms. Each document belongs to one of two classes Hockey (class label 1) and Microsoft Windows (class label 0). The data has already been split (80%, 20%) into training and test data. The class labels for the training and test data are also provided in separate files. The training and test data contain a row for each term in the vocabulary and a column for each document. The values in the table represent raw term frequencies. The data has already been preprocessed to extract terms, remove stop words and perform stemming (so, the vocabulary contains stems not full terms). Please be sure to read the readme.txt file in the distribution.

Your tasks in this problem are the following [Note: for this problem you should not use scikit-learn or other external libraries other than NumPy, standard Python libraries, and Matplotlib (if you would like to add some visualizations to your answers.]

Create your own KNN classifier. Your classifier should allow as input the training data matrix, the training labels, the instance to be classified, the value of K, and should return the predicted class for the instance and the top K neighbors. Your classifier should work with Euclidean distance as well as Cosine Similarity (see class examples). You may create two separate classifiers, or add this capability as a parameter for the classifier function.
Create a function to compute the classification accuracy over the test data set (ratio of correct predictions to the number of test instances). This function will call the classifier function on all the test instances and in each case compares the actual test class label to the predicted class label.
Run your accuracy function on a range of values for K in order to compare accuracy values for different numbers of neighbors. Do this both using Euclidean Distance as well as Cosine similarity measure. [For example, you can try evaluating your classifiers on a range of values of K from 1 through 20 and present the results as a table or a graph].
Using Python and Numpy, modify the training and test data sets so that term weights are converted to TFxIDF weights (instead of raw term frequencies). [See class notes on Text Categorization]. Then, rerun your evaluation on the range of K values (as above) and compare the results to the results without using TFxIDF weights.
Discuss your observations based on the above experiments.

Classification using scikit-learn [Dataset: bank_data.csv]

For this problem you will experiment with various classifiers provided as part of the scikit-learn (sklearn) machine learning module, as well as with some of its preprocessing and model evaluation capabilities. [Note: If this module was not already installed as part of your Python distribution, you will need to first obtain and install it]. You will work with a modified subset of a real data set of customer for a bank. The data is provided in a CSV formatted file with the first row containing the attribute names. The description of the the different fields in the data are provided in this document.

Your tasks in this problem are the following:

Load and preprocess the data using Numpy and preprocessing functions from scikit-learn. Specifically, you need to separate the target attribute from the portion of the data to be used for training and testing. You will need to convert the selected dataset into the Standard Spreadsheet format (scikit-learn functions generally assume that all attributes are in numeric form). Finally, split the transformed data into training and test sets (using 80%-20% split). [For an illustration of some of these tasks, see class examples.]
Run scikit-learn's KNN classifier on the test set. Note: in the case of KNN, you must first normalize the data so that all attributes are in the same scale (normalize so that the values are between 0 and 1). Generate the confusion matrix (visualize it using Matplotlib), as well as the classification report. Experiment with different values of K and the weight parameter to see if you can improve accuracy (you do not need to provide the details of your experimentation, but provide a short discussion what worked best).
Repeat the classification using scikit-learn's decision tree classifier and the naive Bayes (Gaussian) classifier. As above, generate the confusion matrix and classification report for each classifier.
Discuss your observations based on the above experiments.

Decision Tree Classification using Weka [Dataset: bank_data.csv]

For this problem you will experiment with Weka's decision tree classifier (See online resources for Week 4). You will be using the same data set as in the previous problem for training the classifier, but a set of unclassified instances (in the file: bank_new.csv) for prediction.

Your tasks in this problem are the following:

As demonstrated in class, use the WEKA version of the "C4.5" classification algorithm to construct a decision tree based on the training data. In WEKA, the C4.5 algorithm is called J48 and is implemented by "weka.classifiers.trees.J48". You should use the GUI Explorer instead of the command line interface. Use 10-fold cross-validation to evaluate your model accuracy. Record the final decision tree and model accuracy statistics obtained from your model. Be sure to indicate the parameters you use in building your classification model. You can save the statistics and results by right-clicking the last result set in the "Result list" window and selecting "Save result buffer." You should also generate and create a screen shot of your tree by selecting the "Visualize tree" command from the same menu. You should provide the decision tree together with the accuracy results from the cross-validation as part of your submission.
Next, apply the classification model from the previous part to the new customers data set (bank_new.csv). For your convenience, this data has been saved in Weka's native ARFF format (bank_new.arff) after removing the ID attribute and is ready to be loaded as the test set into Weka. Be sure to select "Output Predictions" in the classifier evaluation options. Rank the new customers in the decreasing order their probability to respond positively to the offer. Note that you will first need to map the predictions back to the original customer "id" field for the new customers, so that it's clear which customer corresponds to which instance (this could be done again using a spreadsheet program such as Excel). Provide your resulting predictions for the 200 new cases and other supporting documentation as part of your submission.

Notes on Submission: You must submit the (documented) code for your scripts, any output files, as well as your interactive Python sessions showing the results of your computations. The preferred option would be to use IPython Notebook (similar to examples in class), in order to record your interactive session as well as your comments/answers to various questions and submit the notebook in HTML format (along with any auxiliary files). Another option is to copy and paste into a Word document and then add your discussion and answers as necessary. All files should be submitted as a single Zip archive via COL Web.