The program "extract.pl" does the follwing:

1. performs rudimentary parsing of an HTML file to extract the text (note
that the file extension of the input file must be ".html" or ".htm";

2. removes stop words - the list of stop words is built-in within a hash (a perl
associative array);

3. stems the extracted words using Porter's stemming algorithm;

4. outputs the stems along with theit text frequency in the input document
one per line (sorted in the descending order of text frequency).

The program can be run as follows:

	perl extract.pl <input_file> > <output_file>
	
For example:

	perl extract.pl test.html > test.out
	
will parse the file test.html and writes the resutls into the file test.out.