Intelligent
Information
Retrieval
CSC 575
Assignment 2
Due: Tuesday, February 13, 2018
 Retrieval from an Inverted Index:
Consider the inverted index constrcuted from
three documents (similar to the inverted index of Assignment 1). Using the
cosine similarity measure determine which document is more
relevant to the query: "search engine index". Do this
by handtracing the retrieval algorihtm provided in slides 18 an 19 of
Implementation Notes on Vector Space Retrieval. Show the
internediate document scores computed at each iteration. Also, show the fnal ranking and the
corresponding similarity scores.
 Indexing Models and Term Weighting
Consider the following documentterm table containing raw term frequencies.
Answer the following questions, and in each case give the formulas you used to perform
the necessary computations. Note: you should not do these computations manually. You
may use a spreadsheet program such as Microsoft Excel, or you can considering writing your
own program do the computations., In either case, include your spreadsheet or program in
your assignment submissions.
Term1 Term2 Term3 Term4 Term5 Term6 Term7 Term8

DOC1 0 3 1 0 0 2 1 0
DOC2 5 0 0 0 3 0 0 2
DOC3 3 0 4 3 4 0 0 5
DOC4 1 8 0 3 0 1 4 0
DOC5 0 1 0 0 0 5 4 2
DOC6 2 0 2 0 0 4 0 1
DOC7 2 5 0 3 0 1 4 2
DOC8 3 3 0 2 0 0 1 3
DOC9 0 0 3 3 3 0 0 0
DOC10 1 0 5 0 2 4 0 2

 Compute the new weights for all the terms in document DOC4 using the
tf x idf approach.
 Compute the new weights for all the terms in documents DOC4 using the
signaltonoise ratio approach.
 Using the Keyword Discrimination approach, determine if
Term4
is a good index term or not (by computing it's discriminant). To compute average
similarities use Cosine similarity as your
similarity measure. Show your work.
 VectorSpace Retrieval Model
Consider the following documentterm
table with 10 documents and 8 terms (A through H) containing raw term frequencies. We also have a
specified query, Q, with the indicated raw term weights (the bottom row in the
table). Answer the
following questions, and in each case give the formulas you used to perform
the necessary computations. Note: You should do this problem using a
spreadsheet program such as Microsoft Excel. Alternatively, you can write a
program to perform the computations. Please include your worksheets or code
in the assignment submission). [Download the table below as an Excel
Spreadsheet]
A B C D E F G H

DOC1 0 3 4 0 0 2 4 0
DOC2 5 5 0 0 4 0 4 3
DOC3 3 0 4 3 4 0 0 5
DOC4 0 7 0 3 2 0 4 3
DOC5 0 1 0 0 0 5 4 2
DOC6 2 0 2 0 0 4 0 1
DOC7 3 5 3 4 0 0 4 2
DOC8 0 3 0 0 0 4 4 2
DOC9 0 0 3 3 3 0 0 1
DOC10 0 5 0 0 0 4 4 2

Query 2 1 1 0 2 0 3 0
 Compute the ranking score for each document based on each of the
following querydocument similarity measures (sort the documents in the
decreasing order of the rank score):
 dot product
 Cosine similarity
 Dice's coefficient
 Jaccard's Coefficient
 Compare the ranking obtained when, instead, binary term weights are used to the
ranking obtained in part a where raw term weights were used (do this only
with dot product as the similarity measure). Explain any discrepancy between the two
rankings.
 Construct a similar table to above, but instead of raw term frequencies
compute the (nonnormalized) tfxidf
weights for the terms. Then compute
the ranking scores using the Cosine similarity. Explain any
significant differences between the ranking you obtained here and the
Cosine ranking from the previous part.
 Probabilistic Retrieval Model
We are interested in using the following documentterm matrix and the
associated relevance information as training data for a probabilistic retrieval
model. A 1 entry indicates that
the term occurs in a document, and 0 means it does not: R or NR indicate the
relevance of the document with respect to queries in the training data. Using the
basic probabilistic retrieval model, compute the relevance and nonrelevance
probabilities associated with terms T1 through T6 (show these probabilities in a
table). Then, using these probabilities and the given query Q = (1,1,0,1,0,1), compute
the discriminant Disc(Q, D11) and Disc(Q, D12) for each of the two new documents:
 D11 = (0,1,1,0,0,1)
 D12 = (1,0,1,1,0,1)
Based on the discriminants, should these documents be retrieved? Explain your answer.
 Read the paper Understanding User Goals in Web
Search by Rose and Levinson of Yahoo!. Then write a short summary (about one
singlespaced page) which includes the following:
 How can the underlying user goals in Web search be categorized and what
are the primary differences between these search types?
 What are some of the behavioral clues from which the search engine can
deduce a user's search goals?
 What were some of the main findings of this study and how might they be
used to improve future Web search engines?
Back to Assignments
