CS 313: Data Structures in Java

Autumn 2002

Individual Assignment 7

Trees and Web site structure

Due Wednesday November 13 before 11:30

For this assignment, you will write a program that accesses Web pages from a site and builds an internal data-structure that represents the site. The internal data-structure is a tree consisting of PageNode objects, which can be constructed and displayed using the algorithms discussed in class. Here is a link to the PageNode class.

To simplify accessing Web pages, you will make use of a pre-existing package that allows you to easily access pages and extract their links. This package is used for an Information Retrieval course at the University of Texas (Instructor: Ray Mooney). To best understand how the classes in this package work, review this demonstation program, which accesses one Web page, lists its title, extracts its links and lists the URLs of its links.

To use the classes in the information retrieval package, you will need to follow steps similar to what you did for the first assignment:

  1. Download the jar file containing the information retrieval package.
  2. Do one of the following:
  3. Insert the statement import ir.webutils.*; at the top your program's file.

Assignment requirements

Your program should read in a URL as a string and then construct a tree of PageNode objects with the root being the given URL's page. The program should then perform a preorder traversal, writing out the titles of each page.

Here are the specific requirements and guidelines:

Example output

Here is an example run (user input is in bold):

Starting URL:
http://facweb.cs.depaul.edu/cmiller/csc313/syllabus.html
Site prefix:
http://facweb.cs.depaul.edu/cmiller
Number of levels:
2

Traversing web site structure...

Syllabus for CSC 313
   CSC 313 Individual Assignment 1
      : Class  SetOfCountables
   CSC 313 Individual Assignment 2
   CSC 313 Individual Assignment 3
   CSC 313 Individual Assignment 4
   CSC 313 Individual Assignment 5
   CSC 313 Individual Assignment 6


Actual output on current web site

Starting URL:
http://facweb.cs.depaul.edu/cmiller/csc313/syllabus.html
Site prefix:
http://facweb.cs.depaul.edu/cmiller
Number of levels:
2

Traversing the tree...

Syllabus for CSC 313
   CSC 313 Individual Assignment 1
      : Class  SetOfCountables
      : Class  CountablesIterator
   CSC 313 Individual Assignment 2
      : Class  Card
   CSC 313 Individual Assignment 3
   CSC 313 Individual Assignment 3
   CS 313 Individual Assignment 5
   CS 313 Individual Assignment 6
   CSC 313 Individual Assignment 7
   CSC 313 Individual Assignment 8
   : Class  MarbleJar
      : Class Hierarchy
      : Deprecated List
      : Index
      : API Help
      Generated Documentation (Untitled)
      : Class  MarbleJar
      : Class  MarbleJar
      : Class  MarbleJar
      : Class  MarbleJar
      : Class  MarbleJar
      : Class  MarbleJar
      : Class  MarbleJar
      : Class Hierarchy
      : Deprecated List
      : Index
      : API Help
      Generated Documentation (Untitled)
      : Class  MarbleJar
   : Class  Card
      : Class Hierarchy
      : Deprecated List
      : Index
      : API Help
      Generated Documentation (Untitled)
      : Class  Card
      : Class  Card
      : Class  Card
      : Class  Card
      : Class  Card
      : Class  Card
      : Class  Card
      : Class Hierarchy
      : Deprecated List
      : Index
      : API Help
      Generated Documentation (Untitled)
      : Class  Card
   Program Grading Guide

Notes from class

Here are some notes from class that will help you construct this program.

Pseudo-code for building the tree structure (without level limits)

The method createWebTree creates a Web tree starting at given URL and returns the reference to the PageNode for this URL. You will need additional parameter(s) to keep track of the level.

    PageNode createWebTree(String url)

            page = get page from URL
            String text = page.getText();
            String title = extractTitle(text);

            // create the node for the current Web page
            current = new PageNode(title, url);

            links = get links from page

            itr = get iterator through links
            while (itr.hasNext())
                linkedURL = get url from itr.next()

                if linkedURL is valid
                    PageNode linkedNode = createWebTree(linkedURL)
                    current.addLink(linkedNode);

            return current

Preorder traversal (without level limits)

    public static void traverse(PageNode current)
    {
        System.out.println(current.getTitle());
        List links = current.getLinks();
        Iterator itr = links.iterator();
        while (itr.hasNext()) {
            PageNode nextChild = (PageNode) itr.next();
            traverse(nextChild);
        }
    }

Submission

Before the due date and time, you should submit your WebCrawler.java file and any supporting classes (e.g. PageNode.java and keyboard class) through the submission Web page.

The Web submission page should provide you with a confirmation that the file was received. If you do not receive a confirmation page, either try again with a different browser or email me (cmiller@cs.depaul.ed) your file as an attachment.