Assignment 4
Due: Saturday, November 23
For this assignment you will experiment with Principal Component Analysis as a
dimensionality reduction approach to assist in clustering high-dimensional data.
You will also perform association rule mining using the implementation provided
in the textbook (Machine Learning in Action, Chapter 11). Finally, you'll experiment with item-based recommendation
for a joke recommender system.
- PCA for Reduced Dimensionality in Clustering [Dataset:
segmentation_data.zip]
For this problem you will use an image segmentation data set
for clustering. You will experiment with using PCA as an approach to reduce
dimensionality and noise in the data. You will compare the results of
clustering the data with and without PCA using the provided image class
assignments as the ground truth. The data set is divided into three files.
The file "segmentation_data.txt" contains data about images with each line
corresponding to one image. Each image is represented by 19 features (these
are the columns in the data and correspond to the feature names in the file
"segmentation_names.txt". The file "segmentation_classes.txt" contains the
class labels (the type of image) and a numeric class label for each of the
corresponding images in the data file. After clustering the image data, you
will use the class labels to measure completeness and homogeneity of the
generated clusters. The data set used in this problem is based on the
Image Segmentation data set at the UCI Machine Learning Repository.
Your tasks in this problem are the following:
- Load in the image data matrix (with rows
as images and columns as features). Also load in the numeric class labels
from the segmentation class file. Using your favorite method (e.g.,
sklearn's min-max scaler), perform min-max normalization on the data matrix
so that each feature is scaled to [0,1] range.
- Next, Perform
Kmeans clustering on the
image data (since there are a total 7 pre-assigned image classes, you should
use K = 7 in your clustering). Use Euclidean distance as
your distance measure for the clustering. Print the cluster centroids (use
some formatting so that they are visually understandable). Compare your 7
clusters to the 7 pre-assigned classes by computing the
Completeness and
Homogeneity values of the generated clusters.
- Perform PCA on the normalized image data matrix. You may use the
linear algebra package in Numpy or the Decomposition module in scikit-learn
(the latter is much more efficient).
Analyze the principal components to determine the number, r, of PCs needed
to capture at least 95% of variance in the data. Then use these r components
as features to transform the data into a reduced dimension space. [See the
PCA Clustering
Notebook from class for an example of how these steps are performed.]
- Perform Kmeans again, but this time on
the lower dimensional transformed data. Then, compute the Completeness and
Homogeneity values of the new clusters.
- Discuss your observations based on the
comparison of the two clustering results.
- Association Rule Discovery [Dataset:
playlists.zip]
For this problem you will experiment with association rule mining using
Apriori algorithm discussed in class. You will use a
modified version of the Apriori
implementation in Machine Learning in Action (it has been modified to
compute lift values for rules in addition to confidence). [See
Associations
Notebook from class for an example of using this module.]
The data set you will use is based on a music playlist data set obtained
from Yes.com. [See the
full description of this data]. We will only use a portion of this data.
The provided data archive contains two files. The file "playlists.txt"
contains on each line a sequence of songs played as part of one playlist.
The songs are represented by integer values. The file "song_names.txt"
contains the mapping between the integer codes and song tiles and artists
(format of the song names is [song title]::[artist]). You will need both of
these files to generate association rules.
Your tasks in this problem are the following:
- Load the playlist data into a Python nested list, and the
song_names data into an appropriate data structure.
- Run Apriori on the playlist data using a
min-support value of 0.002. The generate
- Generate rules (you can try different metrics ('lift' or 'confidence' to
see which gives you more useful and interesting results). At minimum, you
should generate rules with a minimum lift value of 20.0. If you use
confidence, you may want to set the min-confidence threshold to about 0.5.
- Identify 3-4 rules and explain why you think they are
interesting.
- Item-Based Joke Recommendation [Dataset:
jokes.zip]
For this problem you will use a modified version of the item-based
recommender algorithm from Ch. 14 of Machine Learning in Action and use it
on joke ratings data based on
Jester Online Joke
Recommender System. The modified version of the code is provided in the
module itemBasedRec.py. Most of the
module will be used as is, but you will add some additional functionality.
The data set contains two files. The file "modified_jester_data.csv"
contains the ratings on 100 jokes by 1000 users (each row is a user
profile). The ratings have been normalized to be between 1 and 21 (a
20-point scale), with 1 being the lowest rating. A zero indicated a missing
rating. The file "jokes.csv" contains the joke ids mapped
to the actual text of the jokes.
Your tasks in this problem are the following (please also see comments
for the function stubs in the provided module):
- Load in the joke ratings data and the joke text data into appropriate data
structures.
- Complete the definition for the
function "test". This function iterates over all users and for
each perform cross-validation on items (by calling the provided "cross_validate_user"
function), and returns the error information necessary to compute Mean Absolute
Error (MAE). Use this function to perform 5-fold cross-validation (i.e., 20%
test-ratio) comparing MAE results using standard item-based collaborative
filtering (based on the rating prediction function "standEst") with results
using the SVD-based version of the rating item-based CF (using "svdEst" as the
prediction engine). [Note: See comments provided in the module
for hints on accomplishing these tasks.]
-
Write a new function "print_most_similar_jokes" which takes the joke ratings
data, a query joke id, a parameter k for the number of nearest neighbors, and a
similarity metric function, and prints the text of the query joke as well as the
texts of the top k most similar jokes based on user ratings. [Note:
For hints on how to accomplish this task, please see comments at the end of the
provided module as well as comments for the provided stub function.]
Notes on Submission: You must submit the (documented) code
for your scripts, any output files, as well as your interactive Python sessions
showing the results of your computations. The preferred option would be to use
IPython Notebook (similar to examples in class), in order to record your
interactive session as well as your comments/answers to various questions and
submit the notebook in HTML format (along with any auxiliary files). Another
option is to copy and paste into a Word document and then add your discussion
and answers as necessary. All files should be submitted as a single Zip archive
via COL Web.
|