Online Resources & Reference Material
Introduction to Information Retrieval - The
primary textbook for the course; by
Christopher D. Manning,
and Hinrich Schutze,
Cambridge University Press. 2008.
- Porter's Stemmer Online - Try doing some stemming with this online implementation of Porter's algorithm. You can enter a set of words, a sentence, or a paragraph and get the stemming results.
- General Resources
- Tools and Software
- Google API - use
Google's indexing & search services to build your own application.
- Tools for Preprocessing
- includes stemming and stop word removal, and a program to extract text
and text frequency from HTML files.
Lucene - a full-featured text search engine library in Java
- Nutch - Apache's open-source
web crawler based on Java.
BeautifulSoup: A general parsing library for Python particularly
useful for parsing html and xml.
Natural Language Toolkit for Python, including tools for text
preprocessing, tokenization, and vectorization (you may also be
interested in an online book that
shows how NLTK is used).
- Other Related and Useful Links