﻿ Intelligent Information Retrieval - DePaul University

Intelligent Information Retrieval
CSC 575

Assignment 3
Due:
Saturday, February 24

1. Query Expansion Using Automatic Global Analysis

For this problem you will use the document-term matrix given in problem 3 of Assignment 2. It is provided here as an Excel spreadsheet.

1. Create a term-by-term association matrix based corresponding to this document-term matrix (see Lecture: "Query Operations; Relevance Feedback; Personalization" -- Slide 6 -- on the Class Material page). Hint: You can facilitate this computation by using matrix operations in Excel (see: Using Microsoft Excel 2007 to Perform Matrix Operations). An intermediate term-term association matrix can be created by multiplying the transpose of the above document-term matrix (which is a term-document matrix) by the document-term matrix, itself. This will result in a symmetric term-term matrix where each entry corresponds to the dot product of the corresponding terms. The final association matrix can be computed by normalizing the entries as described in the lecture material. For your convenience, the transpose of the above matrix is already provided in the Excel document.
2. Consider a query containing the terms A, C, F. What would be the new query after expansion based on global analysis with n=1 (i.e., for each query term, the top 1 associated term is added to the original query)? Explain your answer.

2. Relevance Feedback Using Rocchio's Method

Suppose that after receiving the results of a query Q0 = "dog race", a user has provided relevance feedback by rating the following 3 document as non-relevant:
DOC1: "greyhound race track betting"
DOC2: "dog race betting"

DOC3: "greyhound dog training"
and the following 4documents as relevant:
DOC4: "iditarod dog sled race"
DOC5: "husky dog sled race malamute dog sled"
DOC6: "betting alaska dog sled race"
DOC7: "dog race alaska iditarod"

Assuming simple term frequency weights, use Rocchio’s relevance feedback method to compute a new query Q1 (use a positive feedback factor of 1.0 and negative feedback factor of 0.5). Show Q1 as a vector over the above index terms with the corresponding weights generated by Rocchio. Explain any significant increase or decrease in term weights. Show your work.

3. Collaborative Filtering for Recommendation

Suppose that an online bookseller has collaborative filtering recommender system. The bookseller has collected ratings information from 20 past users (U1-U20) on a selection of recent books. The ratings range from 1 = worst to 5 = best. Two new users (NU1 and NU2) who have recently visited the site and rated some of the books as follows ("?" represents missing ratings):

Using the K-Nearest Neighbor algorithm predict the ratings of these new users for each of the books they have not yet rated. Use the Pearson correlation coefficient as the similarity measure. For your convenience, this data is given in the Excel spreadsheet "knn.xls". Note: In Microsoft Excel, you can use the CORREL function to compute correlation.

1. First compute the correlations between the new users (NU1 and NU2) and all other users (you can show these as added columns in original spreadsheet). Then for each new user give the predicted rating for each of the unrated items using K=3 (i.e., 3 nearest neighbors). Use the weighted average function to compute the predictions based on ratings of the nearest neighbors. Be sure to show the intermediate steps in your work (or provide a short explanation of how you computed the predictions).
2. Measure the Mean Absolute Error (MAE) on the predictions for NU1 and NU2. You can compute MAE by generating predictions for items already rated by the target user (e.g., for NU1 these are all items except "The DaVinci Code" and "Runny Babbit"). Then, for each of these items you can compute the absolute value of the difference between the predicted and the actual ratings. Finally, you can average these errors across all compared items to obtain the MAE.
3. Consider the following simple "popularity-based" recommendation algorithm: Given a user U and an item I, compute the predicted rating of U on I as the mean rating for I among all users who have rated I. Using this algorithm instead of KNN re-compute the MAE on the predictions for NU1 and NU2 (as in part b). Which one of the algorithms (KNN or this algorithm) perform better? Briefly explain what you think are the pros and cons of each of these two recommendation approaches.

4. Document Categorization

Consider the following document-term matrix:

Assume that documents have been manually assigned to two pre-specified categories as follows:

Cat1 = {Doc1, Doc2, Doc5}   Cat2 = {Doc3, Doc4, Doc6, Doc7}

1. Using the K-Nearest-Neighbor approach for document categorization with K = 3 (see notes on Document Categorization),  determine how each of the new documents given below will be classified. Show your steps. Note: Use non-normalized tf*idf weights and cosine similarity when computing similarities.

 T1 T2 T3 T4 T5 T6 T7 T8 Doc8 3 1 0 4 1 0 2 1 Doc9 0 0 3 0 1 5 0 1

2. Repeat the classification problem above, but this time use the Rocchio-Based vector space model to determine how Doc 8 and Doc 9 given above will be classified. As before, use non-normalized tf*idf weights and cosine similarity when computing similarities.

Back to Assignments