Online Resources & Reference Material
Introduction to Information Retrieval - The
primary textbook for the course; by
Christopher D. Manning,
and Hinrich Schutze,
Cambridge University Press. 2008.
- General Resources
- Tools and Software
- Google API - use
Google's indexing & search services to build your own application.
Lucene - a full-featured text search engine library in Java
- Nutch - Apache's open-source
web crawler based on Java.
BeautifulSoup: A general parsing library for Python particularly
useful for parsing html and xml.
Natural Language Toolkit for Python, including tools for text
preprocessing, tokenization, and vectorization (you may also be
interested in an online book that
shows how NLTK is used).
Apache OpenNLP: A Java-based library for the processing of natural
language tex with components such as sentence detector, tokenizer, name
finder, document categorizer, part-of-speech tagger, etc.
Mallet: MAchine Learning for LanguagE Toolkit is a Java-based
package for statistical natural language processing, document
classification, clustering, topic modeling, information extraction, and
other machine learning applications to text.
Stanford CoreNLP: CoreNLP is a one stop shop for natural language
processing in Java! CoreNLP enables users to derive linguistic
annotations for text, including token and sentence boundaries, parts of
speech, named entities, numeric and time values, dependency and
constituency parses, coreference, sentiment, quote attributions, and
relations. CoreNLP currently supports 8 languages: Arabic, Chinese,
English, French, German, Hungarian, Italian, and Spanish.
- Other Related and Useful Links