go to CTI home page

Intelligent Information Retrieval
CSC 575

Assignment 4
Due:
Saturday, March 10

  1. Consider the following document-term matrix containing raw term frequencies (See Excel Spreadsheet).
    1. Construct a term-term distance matrix using Euclidean distance as the measure.

    2. Determine the binary term relationship matrix using a distance threshold of 4.00, and then construct the graph corresponding to this matrix. Note: there is an edge between each pair of nodes whose Euclidean distance is less than or equal to 4.00.

    3. Using the above graph, perform clustering based on the Clique algorithm.

    4. Using the above graph, perform clustering based on the Single Link algorithm.

     



  2. Consider again the same document-term matrix for the previous problem. In this problem we focus on clustering documents.

    1. Starting with documents 1 and 2 in Cluster 1, and with documents 5, 6 and 7 in Cluster 2, perform K-means clustering of the documents using K = 2 (show the reallocation of the documents to clusters after each iteration). Use the cosine similarity measure in the clustering algorithm.

    2. Perform clustering of documents using the one-pass (single-pass) technique (show your work). Use dot product as your similarity measure, and use a pre-specified similarity threshold of 10. Allow for overlapping clusters if an item satisfies the similarity threshold for more than one cluster. The algorithm for this techniques was given lecture notes on clustering. For an example of how this technique is applied see single-pass.html.

     



  3. Read the paper Hyperlink Analysis for the Web by Monika Henzinger, Research Director at Google, and answer the following questions (in each case, briefly discuss your answers or provide short explanations).
    1. What are the basic assumptions behind the hyperlink analysis algorithms?
    2. What is meant by connectivity-based ranking, and how does it relate to co-citation analysis?
    3. What are some of the primary applications of hyperlink analysis in Web information retrieval?
    4. Briefly describe the PageRank algorithm used by the Google search engine.
    5. Briefly describe the HITS algorithms. What are some of the problems associated with this algorithm, and how might one resolve these problems?
    6. How can one spam hyperlink-based search engines? Are they subject to spamming attacks as keyword based approaches? If so, can you think of some approaches to reduce or counter such attacks?



  4. Hyperlink Analysis.
     
    1. Consider the following Web pages and their linkage structure:

           Page A has links to pages B and C
           Page B has a link to Page C
           Page C has a link to Page A

      Run the PageRank algorithm on this subgraph of pages (see slides 51 and 52 of Lecture 8 for the algorithm and an example). Assume that a = 0.15. Simulate the algorithm for three iterations and show PageRank scores twice for each iteration, both before and after normalization.
       
    2. Consider the following Web pages and their linkage structure:

           Page A has links to pages C, D, and E
           Page B has a link to Page D and E

      Run the HITS (Hubs and Authorities) algorithm on this subgraph of pages (see slides 34 and 35 of Lecture 8 for the algorithm and an example). Simulate the algorithm for three iterations and show the hub and authority scores (represented as two vectors over A, B, C, D, E) twice for each iteration, both before and after normalization.



    Back to Assignments

Copyright 2018-2019, Bamshad Mobasher, DePaul University.