For this assignment, you will write a program that accesses Web pages from a site and builds an internal data-structure that represents the site. The internal data-structure is a tree consisting of PageNode objects, which can be constructed and displayed using the algorithms discussed in class. Here is a link to the PageNode class.
To simplify accessing Web pages, you will make use of a pre-existing package that allows you to easily access pages and extract their links. This package is used for an Information Retrieval course at the University of Texas (Instructor: Ray Mooney). To best understand how the classes in this package work, review this demonstation program, which accesses one Web page, lists its title, extracts its links and lists the URLs of its links.
To use the classes in the information retrieval package, you will need to follow steps similar to what you did for the first assignment:
import ir.webutils.*;
at the top your program's file.
Your program should read in a URL as a string and then construct a tree of PageNode objects with the root being the given URL's page. The program should then perform a preorder traversal, writing out the titles of each page.
Here are the specific requirements and guidelines:
Here is an example run (user input is in bold):
Starting URL:
http://facweb.cs.depaul.edu/cmiller/csc313/syllabus.html
Site prefix:
http://facweb.cs.depaul.edu/cmiller
Number of levels:
2
Traversing web site structure...
Syllabus for CSC 313
CSC 313 Individual Assignment 1
: Class SetOfCountables
CSC 313 Individual Assignment 2
CSC 313 Individual Assignment 3
CSC 313 Individual Assignment 4
CSC 313 Individual Assignment 5
CSC 313 Individual Assignment 6
Starting URL:
http://facweb.cs.depaul.edu/cmiller/csc313/syllabus.html
Site prefix:
http://facweb.cs.depaul.edu/cmiller
Number of levels:
2
Traversing the tree...
Syllabus for CSC 313
CSC 313 Individual Assignment 1
: Class SetOfCountables
: Class CountablesIterator
CSC 313 Individual Assignment 2
: Class Card
CSC 313 Individual Assignment 3
CSC 313 Individual Assignment 3
CS 313 Individual Assignment 5
CS 313 Individual Assignment 6
CSC 313 Individual Assignment 7
CSC 313 Individual Assignment 8
: Class MarbleJar
: Class Hierarchy
: Deprecated List
: Index
: API Help
Generated Documentation (Untitled)
: Class MarbleJar
: Class MarbleJar
: Class MarbleJar
: Class MarbleJar
: Class MarbleJar
: Class MarbleJar
: Class MarbleJar
: Class Hierarchy
: Deprecated List
: Index
: API Help
Generated Documentation (Untitled)
: Class MarbleJar
: Class Card
: Class Hierarchy
: Deprecated List
: Index
: API Help
Generated Documentation (Untitled)
: Class Card
: Class Card
: Class Card
: Class Card
: Class Card
: Class Card
: Class Card
: Class Hierarchy
: Deprecated List
: Index
: API Help
Generated Documentation (Untitled)
: Class Card
Program Grading Guide
Here are some notes from class that will help you construct this program.
The method createWebTree creates a Web tree starting at given URL
and returns the reference to the PageNode for this URL. You will need
additional parameter(s) to keep track of the level.
PageNode createWebTree(String url)
page = get page from URL
String text = page.getText();
String title = extractTitle(text);
// create the node for the current Web page
current = new PageNode(title, url);
links = get links from page
itr = get iterator through links
while (itr.hasNext())
linkedURL = get url from itr.next()
if linkedURL is valid
PageNode linkedNode = createWebTree(linkedURL)
current.addLink(linkedNode);
return current
public static void traverse(PageNode current)
{
System.out.println(current.getTitle());
List links = current.getLinks();
Iterator itr = links.iterator();
while (itr.hasNext()) {
PageNode nextChild = (PageNode) itr.next();
traverse(nextChild);
}
}
Before the due date and time, you should submit your WebCrawler.java file and any supporting classes (e.g. PageNode.java and keyboard class) through the submission Web page.
The Web submission page should provide you with a confirmation that the file was received. If you do not receive a confirmation page, either try again with a different browser or email me (cmiller@cs.depaul.ed) your file as an attachment.