Assignment 2
Due: Thursday, October 24
For this assignment you will experiment with various classification models using
subsets of some real-world data sets. In particular, you will use the
K-Nearest-Neighbor algorithm to classify text documents, experiment with and
compare classifiers that are part of the
Scikit-learn machine learning
package for Python, and you will also construct and apply a decision tree
classifier using the Weka data
mining software.
- K-Nearest-Neighbor (KNN) classification on Newsgroup [Dataset:
newsgroups.zip]
For this problem you will use a subset of the 20 Newsgroup data set.
The full data set contains 20,000 newsgroup documents, partitioned (nearly)
evenly across 20 different newsgroups and has been often used for
experiments in text applications of machine learning techniques, such as
text classification and text clustering (see the
description of the full
dataset). The assignment data set contains a subset of 1000 documents
and a vocabulary of terms. Each document belongs to one of two classes
Hockey (class label 1) and Microsoft Windows (class label 0). The data has
already been split (80%, 20%) into training and test data. The class labels
for the training and test data are also provided in separate files. The
training and test data contain a row for each term in the vocabulary and a
column for each document. The values in the table represent raw term
frequencies. The data has already been preprocessed to extract terms, remove
stop words and perform stemming (so, the vocabulary contains stems not full
terms). Please be sure to read the readme.txt file in the
distribution.
Your tasks in this problem are the following [Note: for this
problem you should not use scikit-learn or other external libraries other
than NumPy, standard Python libraries, and Matplotlib (if you would like to
add some visualizations to your answers.]
- Create your own KNN classifier. Your classifier should allow as input
the training data matrix, the training labels, the instance to be
classified, the value of K, and should return the predicted class for the
instance and the top K neighbors. Your classifier should work with Euclidean
distance as well as Cosine Similarity (see
class examples). You may create
two separate classifiers, or add this capability as a parameter for the
classifier function.
- Create a function to compute the classification accuracy over the test
data set (ratio of correct predictions to the number of test instances).
This function will call the classifier function on all the test instances
and in each case compares the actual test class label to the predicted class
label.
- Run your accuracy function on a range of values for K in order to
compare accuracy values for different numbers of neighbors. Do this both
using Euclidean Distance as well as Cosine similarity measure. [For example,
you can try evaluating your classifiers on a range of values of K from 1
through 20 and present the results as a table or a graph].
- Using Python and Numpy, modify the training and test data sets so that
term weights are converted to TFxIDF weights (instead of raw term
frequencies). [See class notes on Text Categorization]. Then, rerun your
evaluation on the range of K values (as above) and compare the results to
the results without using TFxIDF weights.
- Discuss your observations based on the above experiments.
- Classification using scikit-learn [Dataset:
bank_data.csv]
For this problem you will experiment with various classifiers provided as
part of the scikit-learn (sklearn) machine learning module, as well
as with some of its preprocessing and model evaluation capabilities. [Note:
If this module was not already installed as part of your Python
distribution, you will need to first
obtain and install it].
You will work with a modified subset of a real data set of customer for a
bank. The data is provided in a CSV formatted file with the first row
containing the attribute names. The description of the the different
fields in the data are provided in this
document.
Your tasks in this problem are the following:
- Load and preprocess the data using Numpy and preprocessing functions
from scikit-learn. Specifically, you need to separate the target attribute
from the portion of the data to be used for training and testing. You will
need to convert the selected dataset into the Standard Spreadsheet format (scikit-learn
functions generally assume that all attributes are in numeric form).
Finally, split the transformed data into training and test sets (using
80%-20% split). [For an illustration of some of these tasks, see
class examples.]
- Run scikit-learn's KNN classifier on the test set. Note: in the case of
KNN, you must first normalize the data so that all attributes are in the
same scale (normalize so that the values are between 0 and 1). Generate the
confusion matrix (visualize it using Matplotlib), as well as the
classification report. Experiment with different values of K and the weight
parameter to see if you can improve accuracy (you do not need to provide the
details of your experimentation, but provide a short discussion what worked
best).
- Repeat the classification using scikit-learn's decision tree classifier
and the naive Bayes (Gaussian) classifier. As above, generate the confusion
matrix and classification report for each classifier.
- Discuss your observations based on the above experiments.
- Decision Tree Classification using
Weka [Dataset:
bank_data.csv]
For this problem you will experiment with Weka's decision tree classifier
(See online resources for Week 4). You will be
using the same data set as in the previous problem for training the
classifier, but a set of unclassified instances (in the file:
bank_new.csv) for prediction.
Your tasks in this problem are the following:
- As demonstrated in class, use the WEKA version of the "C4.5" classification
algorithm to construct a decision tree based on the training data. In WEKA, the C4.5 algorithm is
called J48 and is implemented by "
weka.classifiers.trees.J48 ". You
should use the GUI Explorer instead of the command line interface. Use 10-fold cross-validation to evaluate your model
accuracy. Record the final decision tree and model accuracy statistics obtained from your
model. Be sure to indicate the parameters you use in building your classification model.
You can save the statistics and results by right-clicking the last result set in the
"Result list" window and selecting "Save result buffer." You should also generate and create a
screen shot of your tree by selecting the "Visualize tree" command from the same menu.
You should provide the decision tree together with the accuracy results from the
cross-validation as part of your submission.
- Next, apply the classification model from the previous part to the new customers
data set (bank_new.csv). For your
convenience, this data has been saved in Weka's native ARFF format (bank_new.arff)
after removing the ID attribute and is ready to be loaded as the test set into
Weka. Be sure to select "Output Predictions" in the classifier evaluation
options. Rank the new customers in the decreasing order their probability to
respond positively to the offer. Note that
you will first need to map the predictions back to the original customer "id" field
for the new customers, so that it's clear which customer corresponds to which
instance (this could be done again using a spreadsheet program such as
Excel). Provide your resulting predictions for the 200 new cases and other supporting
documentation as part of your submission.
Notes on Submission: You must submit the (documented) code
for your scripts, any output files, as well as your interactive Python sessions
showing the results of your computations. The preferred option would be to use
IPython Notebook (similar to examples in class), in order to record your
interactive session as well as your comments/answers to various questions and
submit the notebook in HTML format (along with any auxiliary files). Another
option is to copy and paste into a Word document and then add your discussion
and answers as necessary. All files should be submitted as a single Zip archive
via COL Web.
|