go to CTI home page

Intelligent Information Retrieval
CSC 575

Assignment 2
Due: Tuesday, February 13, 2018


  1. Retrieval from an Inverted Index:

    Consider the inverted index constrcuted from three documents (similar to the inverted index of Assignment 1). Using the cosine similarity measure determine which document is more relevant to the query: "search engine index". Do this by hand-tracing the retrieval algorihtm provided in slides 18 an 19 of Implementation Notes on Vector Space Retrieval. Show the internediate document scores computed at each iteration. Also, show the fnal ranking and the corresponding similarity scores.


  2. Indexing Models and Term Weighting

    Consider the following document-term table containing raw term frequencies. Answer the following questions, and in each case give the formulas you used to perform the necessary computations. Note: you should not do these computations manually. You may use a spreadsheet program such as Microsoft Excel, or you can considering writing your own program do the computations., In either case, include your spreadsheet or program in your assignment submissions.
            Term1 Term2 Term3 Term4 Term5 Term6 Term7 Term8
            -----------------------------------------------
       DOC1   0     3     1     0     0     2     1     0
       DOC2   5     0     0     0     3     0     0     2
       DOC3   3     0     4     3     4     0     0     5
       DOC4   1     8     0     3     0     1     4     0
       DOC5   0     1     0     0     0     5     4     2
       DOC6   2     0     2     0     0     4     0     1
       DOC7   2     5     0     3     0     1     4     2
       DOC8   3     3     0     2     0     0     1     3
       DOC9   0     0     3     3     3     0     0     0
       DOC10  1     0     5     0     2     4     0     2
             ----------------------------------------------
    
    1. Compute the new weights for all the terms in document DOC4 using the tf x idf approach.

    2. Compute the new weights for all the terms in documents DOC4 using the signal-to-noise ratio approach.

    3. Using the Keyword Discrimination approach, determine if Term4 is a good index term or not (by computing it's discriminant). To compute average similarities use Cosine similarity as your similarity measure. Show your work.



  3. Vector-Space Retrieval Model

    Consider the following document-term table with 10 documents and 8 terms (A through H) containing raw term frequencies. We also have a specified query, Q, with the indicated raw term weights (the bottom row in the table). Answer the following questions, and in each case give the formulas you used to perform the necessary computations. Note: You should do this problem using a spreadsheet program such as Microsoft Excel. Alternatively, you can write a program to perform the computations. Please include your worksheets or code in the assignment submission). [Download the table below as an Excel Spreadsheet]

           A     B     C     D     E     F     G     H
         -----------------------------------------------
    DOC1   0     3     4     0     0     2     4     0
    DOC2   5     5     0     0     4     0     4     3
    DOC3   3     0     4     3     4     0     0     5
    DOC4   0     7     0     3     2     0     4     3
    DOC5   0     1     0     0     0     5     4     2
    DOC6   2     0     2     0     0     4     0     1
    DOC7   3     5     3     4     0     0     4     2
    DOC8   0     3     0     0     0     4     4     2
    DOC9   0     0     3     3     3     0     0     1
    DOC10  0     5     0     0     0     4     4     2
          ----------------------------------------------
    Query  2     1     1     0     2     0     3     0
    
    1. Compute the ranking score for each document based on each of the following query-document similarity measures (sort the documents in the decreasing order of the rank score):
      • dot product
      • Cosine similarity
      • Dice's coefficient
      • Jaccard's Coefficient
    2. Compare the ranking obtained when, instead, binary term weights are used to the ranking obtained in part a where raw term weights were used (do this only with dot product as the similarity measure). Explain any discrepancy between the two rankings.
    3. Construct a similar table to above, but instead of raw term frequencies compute the (non-normalized) tfxidf weights for the terms. Then compute the ranking scores using the Cosine similarity. Explain any significant differences between the ranking you obtained here and the Cosine ranking from the previous part.



  4. Probabilistic Retrieval Model

    We are interested in using the following document-term matrix and the associated relevance information as training data for a probabilistic retrieval model. A 1 entry indicates that the term occurs in a document, and 0 means it does not: R or NR indicate the relevance of the document with respect to queries in the training data.

     

    Using the basic probabilistic retrieval model, compute the relevance and non-relevance probabilities associated with terms T1 through T6 (show these probabilities in a table). Then, using these probabilities and the given query Q = (1,1,0,1,0,1), compute the discriminant
    Disc(Q, D11) and Disc(Q, D12) for each of the two new documents:

    • D11 = (0,1,1,0,0,1)
    • D12 = (1,0,1,1,0,1)

    Based on the discriminants, should these documents be retrieved? Explain your answer.



  5. Read the paper Understanding User Goals in Web Search by Rose and Levinson of Yahoo!. Then write a short summary (about one single-spaced page) which includes the following:

    1. How can the underlying user goals in Web search be categorized and what are the primary differences between these search types?
    2. What are some of the behavioral clues from which the search engine can deduce a user's search goals?
    3. What were some of the main findings of this study and how might they be used to improve future Web search engines?



    Back to Assignments

Copyright © 2018-2019, Bamshad Mobasher, DePaul University.