Intelligent
Information
Retrieval
CSC 575
Assignment 1
Due: Saturday, January 27, 2018

Suppose that we have a standard IR evaluation data set containing 1000
documents. Assume that a particular query in this data set is deemed to be
relevant to the following 25 documents in the collection:
REL = { d1, d5, d6, d10, d88, d150, d200, d210, d250, d300, d400, d405, d450,
d472,
d500, d501, d530, d545, d590, d600, d635, d700, d720, d800, d900 }
Two different retrieval systems S1 and S2 are used to retrieve ranked lists
of documents from this collection using the above query. The top 10 retrieved
documents for these two systems are given below (each list is in decreasing
order of relevance).
RET(S1) = d2, d5, d150, d250, d11, d33, d50, d600, d500, d520
RET(S2) = d250, d400, d150, d210, d999, d3, d501, d800, d205, d300
 Compute the Precision and the Recall graphs for each system as a
function of the number of documents returned (for 1 document returned, 2
documents returned, etc). First, compile your calculations in a table which
shows the Precision and Recall for S1 and S2 using these query results as a
function of the number of documents returned (from 1 to 10). Then, create graphs comparing Precision
and Recall of the two systems.
 A single metric that can be used to combine precision and recall is the
F Measure (see Section 8.3 of the IR Book). Using the F1 measure (F measure
with B=1), create graph similar to the above
comparing the two systems.
 Which system is better? Explain your answer.
 Use the query "jaguar" on three search engines Google, Yahoo, and
Bing. Assume that you are interested in the sense of "Jaguar" as an animal
(so a document is relevant if it is about the animal Jaguar). For each of your
searches record the top 10 results (do not count the "sponsored" results and
only count the main results, not subcategories of results which are often other
documents from the same site as the main result).
Make a precision plot for the first 10 returned links as function of the number
of returned documents, comparing the precision of the three search engines for
this query (use a similar approach as part (b) of problem 1, except only for
precision). Which of the search engines did better? Which would have done better
if you were interested in "jaguar" as a car? Why do you think there might be a
difference among these search engines?
 Inverted Index Construction:
Consider the following three short documents:
 Doc #1
 Glimpse is an indexing and query system that allows for search through a file system or document collection quickly. Glimpse is the default search engine in a larger information retrieval system.
It has also been used as part of some web based search engines.
 Doc #2
 The main processes in an retrieval system are document indexing, query processing, query evaluation and relevance feedback.
Among these, efficient updating of the index is critical in large scale systems.
 Doc #3
 Clusters are created from short snippets of documents retrieved by web search engines which are as good as clusters created from the full text of web documents.

 First remove stop words, and punctuation, and apply Porter's stemming algorithm to the 3 documents (Note:
use the
online stemming application
for this purpose).
 Create an inverted index of the three documents, including the dictionary
and the postings. The dictionary should also contain (for each term) statistics such as
total number of occurrences in the collection and the document frequency. The postings for each
term should contain the document ids and the term frequencies (depict
multiple postings for a term as a linked list, similar to Figure 1.3 in the IR
Book).
 What are the search results for the following Boolean queries (in each case
explain how you obtained them from the inverted index):
 index AND query
 index OR query
 index AND (NOT
query)
 (search AND query) OR (search AND
retrieve)
 (index OR cluster) AND (web
OR
system)
 Finite State Automata and Lexical Analysis (Note: your FSAs should only accept all terms corresponding to the
specified regular expression or pattern, but no other terms).
 Construct a Finite State Automata recognizing the set of all
binary strings with an odd number of 1's. Examples include: 1, 010, 111, 10110,
0010100111, etc.
 Suppose that
during document indexing we are interested in identifying all dates in the
format mm/dd/yyyy, where mm is a 2
digit string representing the month (ranging from 01 through 12), dd
is a 2digit string representing the day (0131), and yyyy are 4
digits representing the year. Furthermore, we are only interested in the years
between 1990 and 2019. Example: 02/30/2012.
Construct a Finite State Automata that exactly recognizes date tokens as
described above. Notes: You do not need to match the month with the
correct number of days. You may assume that dig
is a special token representing all digits (09).
Also, specify a regular expression that corresponds to the finite automata for
this language.
 Ngrams: Using bigrams, which of the following terms
is more likely to be considered "equivalent" to the term "informational"? Justify your answer by using Dice's coefficient
to measure similarity of each term to "informational".
 The following table shows the raw occurrence frequencies of some words in a hypothetical collection. If
x and y represent two distinct words, f(x) and
f(y) denote the raw frequencies for words x and
y, respectively.
Assume that the frequency of cooccurrence of x and y
has already been computed and is denoted by f(x,y).
[Note that the probability of occurrence is estimated as the raw frequency
divided by the total number of terms.]
x y f(x) f(y) f(x,y)

color blue 20 40 10
color green 20 10 5
blue green 40 10 5
united states 25 20 15
united airline 25 10 5
states airline 20 10 2
Assuming that the total number of unique words, N, in the collection is
1000, rank these pairs of words in decreasing order of expected mutual information. Show your detailed calculations in at least one case
and the final values for all cases.
 WordNet is an advanced lexical ontology and thesaurus for the English language with many applications in information retrieval and Web information agents. Write a one page document about WordNet containing the following elements. (1) Your own description of WordNet. (2) What are the unique features of WordNet that distinguish it from other types of standard thesauri? And (3) what are some of the specific ways WordNet can be (or has been) used in the context of information retrieval?
Back to Assignments
