Assignment 4
Due: Sunday, November 6
For this assignment you will experiment with Principal Component Analysis as a
dimensionality reduction approach to assist in clustering high-dimensional data.
You will also experiment with item-based recommendation
for a joke recommender system.
- PCA for Reduced Dimensionality in Clustering [Dataset:
segmentation_data.zip]
For this problem you will use an image segmentation data set
for clustering. You will experiment with using PCA as an approach to reduce
dimensionality and noise in the data. You will compare the results of
clustering the data with and without PCA using the provided image class
assignments as the ground truth. The data set is divided into three files.
The file "segmentation_data.txt" contains data about images with each line
corresponding to one image. Each image is represented by 19 features (these
are the columns in the data and correspond to the feature names in the file
"segmentation_names.txt". The file "segmentation_classes.txt" contains the
class labels (the type of image) and a numeric class label for each of the
corresponding images in the data file. After clustering the image data, you
will use the class labels to measure completeness and homogeneity of the
generated clusters. The data set used in this problem is based on the
Image Segmentation data set at the UCI Machine Learning Repository.
Your tasks in this problem are the following:
- [5 pts] Load in the image data matrix (with rows
as images and columns as features). Also load in the numeric class labels
from the segmentation class file. Using your favorite method (e.g.,
sklearn's min-max scaler), perform min-max normalization on the data matrix
so that each feature is scaled to [0,1] range.
- [10 pts] Using the Kmeans implementation in scikit-learn,
perform clustering on the
image data (use K = 7 in your clustering so that later we can compare the
clusters to the 7 pre-assigned image classes). Print the cluster centroids (use
some formatting so that they are visually understandable). To evaluate your
clusters, first perform Silhouette analysis on the clusters (compute
Silhouette values for all instances in the data, and then compute the
overall mean Silhouette value; optionally, you can provide a visualization
of the Silhouettes). Next, compare your 7
clusters to the 7 pre-assigned classes by computing the
Completeness and
Homogeneity values of the generated clusters.
- [10 pts] Do your own experiments with the number of clusters to
see if a different value of K results in more cohesive clustering based on
Silhouette analysis. Please do not provide all your clustering results, but
you should include the best result according to your analysis and provide a
brief discussion of why this particular clustering was selected.
- [10 pts] Perform PCA on the normalized image data matrix. You may use the
linear algebra package in Numpy or the Decomposition module in scikit-learn
(the latter is much more efficient).
Analyze the principal components to determine the number, r, of PCs needed
to capture at least 95% of variance in the data. Provide a
Scree
plot of PC variances. Then use these r components
as features to transform the data into a reduced dimension space.
- [5 pts] Perform Kmeans again, but this time on
the lower dimensional transformed data. Then compare Silhouette values as
well as completeness and
Homogeneity values of the new clusters. Compare these results with those
obtained on the full data in part b.
- Item-Based Joke Recommendation [Dataset:
jokes.zip]
For this problem you will use a modified version of the item-based
recommender algorithm from Ch. 14 of Machine Learning in Action and use it
on joke ratings data based on
Jester Online Joke
Recommender System. The modified version of the code is provided in the
module itemBasedRec.py. Most of the
module will be used as is, but you will add some additional functionality.
The data set contains two files. The file "modified_jester_data.csv"
contains the ratings on 100 jokes by 1000 users (each row is a user
profile). The ratings have been normalized to be between 1 and 21 (a
20-point scale), with 1 being the lowest rating. A zero indicated a missing
rating. The file "jokes.csv" contains the joke ids mapped
to the actual text of the jokes.
Your tasks in this problem are the following (please also see comments
for the function stubs in the provided module):
- [15 pts] Load in the joke ratings data and the joke text data into appropriate data
structures. Use the "recommend" function to provide top 5 joke
recommendations for users with id 4 using both Pearson
and cosine similarity measures. Note the differences. Use the standard item-based collaborative
filtering (based on the rating prediction function "standEst").
Next, find the top 5 recommendations for user with id 25 only
with Pearson similarity using both the standard estimator
and the SVD-based version (using "svdEst"
as the prediction engine) to generate these recommendations. Note the
differences. When outputting recommendations, you should show both the id and the text of the recommended jokes
(in decreasing order of predicted rating) as
well as the predicted ratings for each.
- [15 pts] Complete the definition for the function "test".
This function iterates over all users and for
each performs evaluation (by calling the provided "cross_validate_user"
function) and returns the error information necessary to compute Mean Absolute
Error (MAE). Use this function to perform evaluation (with 20%
test-ratio for each user) comparing MAE results using the rating prediction function "standEst" with results
using the "svdEst"
prediction function (in both cases using Pearson similarity measure. Note that
this may take several minutes depending on your computational environment. [Note: See comments provided in the module
for hints on accomplishing these tasks.]
- [15 pts] Write a new function "print_most_similar_jokes" which
outputs the most similar jokes (based on user ratings) to a specified query
joke. You function should take as input the joke ratings
data, a query joke id, a parameter k for the number similar
jokes, and a
similarity metric function. It should output the text of the query joke as well as the
texts of the top k most similar jokes in decreasing order of similarity (you
should also provide the similarity values). Test your function as follows:
* Show the top 3 most similar jokes to joke with id
9 using Pearson similarity. * Show the top 3 most
similar jokes to Joke with id 9 using cosine similarity.
[Note:
see comments at the end of the
provided module as well as comments for the provided stub function.]
- [15 pts] The implementation of
item-based collaborative filtering provided in the module is not scalable
since for each prediction it attempts to compute pairwise similarities among
all items. Develop your own
item-based collaborative filtering recommender that uses a model-based
approach (separating the training and the prediction tasks). In the
training component, item-item similarities for all pairs of items are
computed and stored in an appropriate data structure such as a pairwise
similarity matrix. Your training function
should be able to use different similarity functions (passed as a parameter)
including cosine Similarity or Pearson correlation. The prediction (or
estimation) function should take as parameters a target user, an item, a
value of k, and the similarities matrix computed in the
training phase. It should then return
the predicted rating on the target item for the target user. The predicted
rating should be the weighted average of the target user's ratings
on the k most similar items to the target item (obtained
from the similarity matrix). Demonstrate that your function works by
computing predicted ratings for users 4 and 25, using k = 10, on top two items recommended
to each user on part a (using both Pearson and cosine similarities).
- [Extra Credit - 10 pts] Modify the "cross_validate_user"
and "test" functions as necessary to use the new
version of the prediction function (from part d). First test
the prediction accuracy of your prediction function (similarly to part b, above)
using both cosine and Pearson similarity measures. Next, provide a plot of
cross-validation accuracies across a range of values of k.
(running the "test" function for each value of k).
Your plot may look similar to this
example. Next, Modify the "recommend" function to use your new
prediction function. Using the best observed value of k
from your plot demonstrate the functionality of your recommender by
generating top 3 recommendations for users 4 and 25.
Notes on Submission: You must submit your Jupyter Notebook
(similar to examples in class) which includes your documented code, results of
your interactions, and any discussions or explanations of the results. Please
organize your notebook and label sections so that it's clear what parts of the notebook correspond
to which problems in the assignment (submissions that are not
well-organized, not well-documented, or are difficult to read will be penalized). Please submit the notebook in both IPYNB
and HTML formats (along with any auxiliary files). Do not compress or
Zip your submission files; each file should be submitted independently. Your assignment should be submitted
via D2L.
|