Final Project Checklist - Information about
what you need to submit for the final project.
- Each group or individual must submit a specific project proposal to
be approved no later than February 13. Written research projects must be done individually.
Implementation projects can be done in groups of up to 3, depending on the
size and complexity of the project. Note that the size and make up of the groups must also be approved
along with the project proposal.
- Due date for final project: Monday, March 13.
The following are a list of ideas for the class project. You may
choose any of these ideas or their variations. You may also choose
to combine parts of these projects, or come up with your own idea. In
all cases, however, your project idea is subject to approval based on a project
proposal that specifies on a set of
project requirements and deliverables.
Written projects involve doing an in-depth study, survey, or evaluation of one or more topics
related to information retrieval and filtering. The project can take a form of a
examining the use of a specific technique or model in various IR systems, or it can be a detailed
case study involving two or more existing IR systems. In either case, the paper should contain a
summary and a technical evaluation of the state-of-the-art related to the particular topic studied.
If the paper involves a case study, then a thorough comparative evaluation with other similar
systems must be provided. A research
paper should present a new idea or provide a detailed survey of methods to solve a specific IR-related problem. The approach
presented should be, at least in part, a novel and original contribution, and should ideally be
evaluated experimentally. A research paper could be good start for a Masters or Ph.D. research
project. The maximum length for the written projects is 20 single-spaced pages (12 point font),
including figures and references. The evaluation of the papers will be based on clarity,
thoroughness, and soundness of ideas and concepts presented, as well as the overall organization
of the paper.
Note: Written project should not simply be a summary of some of the material covered
directly in the lectures, but rather should go beyond this material in one or more specific
areas related to that material. The following is a non-exhaustive list of ideas for a written
project (very broadly stated):
- Personalized Search: A study of various techniques and approaches used to
create personalized search applications on the Web. The study should include a
survey of techniques for re-ranking or filtering search results based on user
profiles, as well as intelligent agents that take into account user
characteristics or profiles to assist users in search.
- Exploring various techniques for Web IR based on hyperlink analysis and
The study should include examination of techniques based on linkage as a measure of
authority of the information source (e.g., HITS or Pagerank algorithms), as well as other techniques
to use ratings or popularity as measures of quality or authority.
- A comparative study of implementation techniques for scalable information
retrieval on large-scale search engines or Web-based information systems (such
as Google, Facebook, etc.). This study must include an analysis of challenges in
managing and leveraging large data repositories and various proposed and
implemented solutions (such as Big Table, Map Reduce, and other approaches based
on "cloud computing"). The study can also focus on implementation platforms that
enable scalable retrieval (e.g., Hadoop).
- Study of the use of social network analysis and its use in information
retrieval. This study should include a detailed summary of various techniques
from SNA and their use in providing relevant information to users in online
social network and/or traditional search engines.
- Web Content Mining: a study of various techniques to mine information and patterns
from semi-structured data on the Web. Examples include the use of agents designed to extract
specific types of information (e.g., shopping agents), the use of XML to integrated available
"meta-data" into current search technologies, Web data warehousing, etc.
- Web Usage Mining: a study of the feasibility and effectiveness of techniques to incorporate
Web usage data (e.g., clickthrough data, search query logs, and user behavior
data) into search and retrieval, and how this can be used to develop more
effective search engines.
- Collaborative Filtering and Recommender Systems: A comparative study of various collaborative
filtering techniques and their applications in several recommender systems. The study should include
a technical summary of various techniques, and evaluation of existing methods in use today on the Web.
Implementation projects involve the development and evaluation of an original
application using information retrieval, text mining, and/or machine learning
techniques. The application must be tested and evaluated using appropriate test
data sets. The application must also involve the use of one or more of the
modeling techniques relevant to the course topics. Your application may also
include a significant extension of an existing applications or technqiues
discussed in class materials or other sources (in this case, the application
must be extended to include additional or more sophisticated types of featrues).
The deliverable for the project must include the fully documented code,
distribution files, including any third party sources, installation/deployment
documents (including examples, screen shots of test runs, etc.), data used for
the evaluation of the application, and a detailed project report providing a
description of the components of the application and the results of evaluation.
Many different types of applications are possible, but some examples of such
applications include (but are not limited to):
- Build your own search/retrieval system:
- Should include implementations for the basic components including
separate crawler, indexer, and query processing components (including a
reasonable query interface)
- Should work on a local document corpus in a
directory structure or as a Web search engine (applied to a limited set of
Web sites or for a specific domain)
- The indexing component should parse and index documents using inverted file format
with relevant term frequency information
- Should make use of stemming and stop lists (you can
existing tools for this part).
- The system should use TF-IDF weights (and possibly additional weighting
schemes) for index terms
- The base implementation should use the
vector-space model with Cosine similarity to be used for the matching
queries and indexed documents. Optionally, you can implement other retrieval
models such as probabilistic models or models based on link analysis.
should be possible to save the index to an offline storage and reload it for
subsequent retrieval sessions (during a retrieval session, the search engine
should run in the background as a server process and handle incoming
- Optional components or functionality can be added depending the
desired features or complexity of the project, including: additional
weighting schemes, part-of-speech tagging, phrase indexing, n-gram indexing,
proximity operators, personalized search, and relevance feedback.
- Implement a personalized information filtering system:
- Your system should provide the capability for selective
dissemination of information based on a user profile.
- The system should obtain and subsequently update a user's profile
represented as a set of topics (e.g., using a vector-space representation)
Based on the user profile, the system (in the background) should search for
items of interest to the user. Depending on the type of target domain, these
items could be interesting Web pages, news stories, blog posts, tweets or
posts on other social networking sites, or even objects of interest (movies,
books, consumer items, etc.). The applications can be a general information
filtering agent, or an agent designed to work in a specific target domain
(e.g., a personalized shopping agent, a news filtering agent, etc.).
user's profile should be updated when the user provides feedback on one or
more of recommended items (e.g., using relevance feedback).
- The system should minimally include components to create and maintain an
index of items/documents selected, a component to maintain and update a user
profile, and a component to search selected Web sites in the background for
items of interest.
- Optionally, the system can include additional features
such as clustering and categorization of items selected for the user; the
ability to update search for items similar to a recommended item selected by
the user (e.g., "more like this" capability), etc.
- Design an enhanced user interface for a retrieval system:
- Your interface should help guide the user in formulating a
query. You can explore options such as the use of a
classification hierarchy (such as Yahoo's category labels), providing the
capability for natural language queries (possibly through the use
of WordNet and basic natural language processing tools such as
part-of-speech tagging), adding context-awareness by maintaining a user
profile (based on past searches or other types of preference elicitation) in
order to reduce ambiguity in queries, etc.
- Your system should also provide and enhanced interface for
the user to browse the retrieved documents and provide mechanisms
such as relevance feedback and query by example.
- Finally, the system should have the ability to cluster the retrieved
documents (preferably using hierarchical clustering) and present the
clusters to users for easier navigation and browsing.
- For this project you don't have to implement your own indexing
and matching algorithms, however, you may need to modify an
existing system (with source code) to incorporate the additional
You may also need to do post processing of documents retrieved as a results
of a search.
- Build a simple recommender system:
- Allow multiple users to access a server and rate items based on their
preferences (e.g., movies, music, Web pages, etc.);
- Use collaborative filtering technology (or other profiling techniques
such as clustering) to find similar groups of users.
- Based on ratings of other similar users create dynamic recommendations
for the current user of the system.
- Alternatively or in conjunction to collaborative filtering, you may use
content-based filtering approaches that compare items in a user's profile
with other similar items as a way to generate recommenations.
- Many different variations of this idea is possible.