go to CTI home page

Intelligent Information Retrieval
CSC 575

Assignment 1
Due: Saturday, January 27, 2018

  1. Suppose that we have a standard IR evaluation data set containing 1000 documents. Assume that a particular query in this data set is deemed to be relevant to the following 25 documents in the collection:


    REL = { d1, d5, d6, d10, d88, d150, d200, d210, d250, d300, d400, d405, d450, d472, 
                   d500, d501, d530, d545, d590, d600, d635, d700, d720, d800, d900 }

    Two different retrieval systems S1 and S2 are used to retrieve ranked lists of documents from this collection using the above query. The top 10 retrieved documents for these two systems are given below (each list is in decreasing order of relevance).

    RET(S1) = d2, d5, d150, d250, d11, d33, d50, d600, d500, d520
    RET(S2) = d250, d400, d150, d210, d999, d3, d501, d800, d205, d300

    1. Compute the Precision and the Recall graphs for each system as a function of the number of documents returned (for 1 document returned, 2 documents returned, etc). First, compile your calculations in a table which shows the Precision and Recall for S1 and S2 using these query results as a function of the number of documents returned (from 1 to 10). Then, create  graphs comparing Precision and Recall of the two systems.
    2. A single metric that can be used to combine precision and recall is the F Measure (see Section 8.3 of the IR Book). Using the F1 measure (F measure with B=1), create graph similar to the above comparing the two systems.
    3. Which system is better? Explain your answer.
       

  2. Use the query "jaguar" on three search engines Google, Yahoo, and Bing. Assume that you are interested in the sense of "Jaguar" as an animal (so a document is relevant if it is about the animal Jaguar). For each of your searches record the top 10 results (do not count the "sponsored" results and only count the main results, not sub-categories of results which are often other documents from the same site as the main result). Make a precision plot for the first 10 returned links as function of the number of returned documents, comparing the precision of the three search engines for this query (use a similar approach as part (b) of problem 1, except only for precision). Which of the search engines did better? Which would have done better if you were interested in "jaguar" as a car? Why do you think there might be a difference among these search engines?


  3. Inverted Index Construction:

    Consider the following three short documents:

    Doc #1
    Glimpse is an indexing and query system that allows for search through a file system or document collection quickly. Glimpse is the default search engine in a larger information retrieval system. It has also been used as part of some web based search engines.

    Doc #2
    The main processes in an retrieval system are document indexing, query processing, query evaluation and relevance feedback. Among these, efficient updating of the index is critical in large scale systems.

    Doc #3
    Clusters are created from short snippets of documents retrieved by web search engines which are as good as clusters created from the full text of web documents.

    1. First remove stop words, and punctuation, and apply Porter's stemming algorithm to the 3 documents (Note: use the online stemming application for this purpose).

    2. Create an inverted index of the three documents, including the dictionary and the postings. The dictionary should also contain (for each term) statistics such as total number of occurrences in the collection and the document frequency. The postings for each term should contain the document ids and the term frequencies (depict multiple postings for a term as a linked list, similar to Figure 1.3 in the IR Book).

    3. What are the search results for the following Boolean queries (in each case explain how you obtained them from the inverted index):
      1. index AND query
      2. index OR query
      3. index AND (NOT query)
      4. (search AND query) OR (search AND retrieve)
      5. (index OR cluster) AND (web OR system)


  4. Finite State Automata and Lexical Analysis (Note: your FSAs should only accept all terms corresponding to the specified regular expression or pattern, but no other terms).

    1. Construct a Finite State Automata recognizing the set of all binary strings with an odd number of 1's. Examples include: 1, 010, 111, 10110, 0010100111, etc.
    1. Suppose that during document indexing we are interested in identifying all dates in the format mm/dd/yyyy, where mm is a 2 digit string representing the month (ranging from 01 through 12), dd is a 2-digit string representing the day (01-31), and yyyy are 4 digits representing the year. Furthermore, we are only interested in the years between 1990 and 2019. Example: 02/30/2012. Construct a Finite State Automata that exactly recognizes date tokens as described above. Notes: You do not need to match the month with the correct number of days. You may assume that dig is a special token representing all digits (0-9). Also, specify a regular expression that corresponds to the finite automata for this language.

  5. Ngrams: Using bigrams, which of the following terms is more likely to be considered "equivalent" to the term "informational"? Justify your answer by using Dice's coefficient to measure similarity of each term to "informational".
    • informal
    • formalization
     
  6. The following table shows the raw occurrence frequencies of some words in a hypothetical collection. If x and y represent two distinct words, f(x) and f(y) denote the raw frequencies for words x and y, respectively. Assume that the frequency of co-occurrence of x and y has already been computed and is denoted by f(x,y). [Note that the probability of occurrence is estimated as the raw frequency divided by the total number of terms.]
        x        y      f(x)      f(y)     f(x,y)
      -------------------------------------------
      color    blue      20        40        10
      color    green     20        10         5
      blue     green     40        10         5
      united   states    25        20        15
      united   airline   25        10         5
      states   airline   20        10         2
    
    Assuming that the total number of unique words, N,  in the collection is 1000, rank these pairs of words in decreasing order of expected mutual information. Show your detailed calculations in at least one case and the final values for all cases.
     
  7. WordNet is an advanced lexical ontology and thesaurus for the English language with many applications in information retrieval and Web information agents. Write a one page document about WordNet containing the following elements. (1) Your own description of WordNet. (2) What are the unique features of WordNet that distinguish it from other types of standard thesauri? And (3) what are some of the specific ways WordNet can be (or has been) used in the context of information retrieval?



Back to Assignments
 

Copyright © 2018-2019, Bamshad Mobasher, DePaul University.