Intelligent
Information
Retrieval
CSC 575
Assignment 3
Due:
Saturday,
February 24
 Query Expansion Using
Automatic Global Analysis
For this problem you will use the
documentterm matrix given in problem 3 of Assignment 2. It is provided here as
an
Excel spreadsheet.
 Create a termbyterm association matrix based corresponding to this
documentterm matrix (see Lecture: "Query Operations; Relevance
Feedback; Personalization"  Slide 6
 on the Class Material
page). Hint: You can facilitate this computation by using matrix operations
in Excel (see: Using
Microsoft Excel 2007 to Perform Matrix Operations). An
intermediate termterm association matrix can be created by multiplying the
transpose of the above documentterm matrix (which is a termdocument
matrix) by the documentterm matrix, itself. This will result in a symmetric
termterm matrix where each entry corresponds to the dot product of the
corresponding terms. The final association matrix can be computed by
normalizing the entries as described in the lecture material. For your
convenience, the transpose of the above matrix is already provided in the
Excel document.
 Consider a query
containing the terms A, C, F. What would be the new query after expansion
based on global analysis with n=1 (i.e., for each query term, the top 1
associated term is added to the original query)? Explain your answer.
 Relevance Feedback Using Rocchio's Method
Suppose that after receiving the results of a query Q_{0} = "dog
race", a user has provided relevance feedback by rating the following 3 document as nonrelevant:
DOC1: "greyhound race track betting"
DOC2: "dog race betting"
DOC3: "greyhound dog training"
and the following 4documents as relevant:
DOC4: "iditarod dog sled race"
DOC5: "husky dog sled race malamute dog sled"
DOC6: "betting alaska dog sled race"
DOC7: "dog race alaska iditarod"
Assuming simple term frequency weights, use Rocchio’s relevance feedback method
to compute a new query Q_{1} (use a positive feedback factor of
1.0 and negative feedback factor of 0.5). Show Q1 as a vector over the above
index terms with the corresponding weights generated by Rocchio. Explain any significant increase or
decrease in term weights. Show your work.
 Collaborative Filtering for Recommendation
Suppose that an online bookseller has collaborative filtering recommender
system. The bookseller has collected ratings information from
20 past users (U1U20) on a selection of recent books. The ratings range from 1 = worst to 5 = best.
Two new users (NU1 and NU2) who have recently visited the site and rated some of the books
as follows ("?" represents missing ratings):
Using the KNearest Neighbor algorithm predict the ratings of these new users for
each of the books they have not yet rated. Use the Pearson correlation coefficient as the similarity measure. For your convenience, this data is given in
the Excel spreadsheet "knn.xls".
Note: In Microsoft Excel, you can use the CORREL function to
compute correlation.
 First compute the correlations between the new users (NU1 and NU2) and all
other users (you can show these as added columns in original spreadsheet). Then
for each new user give the predicted rating for each of the unrated items using
K=3
(i.e., 3 nearest neighbors). Use the
weighted average function to compute the
predictions based on ratings of the nearest neighbors. Be sure to show
the intermediate steps in your work (or provide a short explanation of how you
computed the predictions).
 Measure the Mean Absolute Error (MAE) on the predictions for NU1 and
NU2. You can compute MAE by generating predictions for items already rated by
the target user (e.g., for NU1 these are all items except "The DaVinci Code"
and "Runny Babbit"). Then, for each of these items you can compute the
absolute value of the difference between the predicted and the actual ratings.
Finally, you can average these errors across all compared items to obtain the
MAE.
 Consider the following simple "popularitybased" recommendation algorithm: Given a user U and an item
I, compute the predicted rating of U on
I as the mean rating for I among all users who have rated I.
Using this algorithm instead of KNN recompute the MAE on the predictions
for NU1 and NU2 (as in part b). Which one of the algorithms (KNN or this algorithm)
perform better? Briefly explain what you think are the pros and cons of each
of these two recommendation approaches.

Document Categorization
Consider the following documentterm matrix:
Assume that documents have been manually assigned to two
prespecified categories as follows:Cat1 = {Doc1, Doc2,
Doc5} Cat2 = {Doc3, Doc4, Doc6, Doc7}
 Using the KNearestNeighbor approach for document
categorization with K = 3 (see notes on
Document Categorization), determine
how each of the new documents given below will be classified. Show your steps.
Note: Use nonnormalized tf*idf weights and
cosine
similarity when computing similarities.

T1 
T2 
T3 
T4 
T5 
T6 
T7 
T8 
Doc8 
3 
1 
0 
4 
1 
0 
2 
1 
Doc9 
0 
0 
3 
0 
1 
5 
0 
1 
 Repeat the classification problem above, but this time use the RocchioBased vector
space model to determine how Doc 8 and Doc 9 given above will be classified. As
before, use nonnormalized tf*idf weights and cosine similarity when
computing similarities.
Back to Assignments
