Final Project Checklist - Information about
what you need to submit for the final project.
- Projects may be done in groups of up to 3, depending on the size and
complexity of the project. Each group or individual must submit a specific project proposal to
be approved by February 11. Note that the size and make up of the groups must also be approved
along with the project proposal.
- Due date for final project: Wednesday, March 15.
The following are a list of ideas for the class project. You may
choose any of these ideas or their variations. You may also choose
to combine parts of these projects, or come up with your own idea. In
all cases, however, your project idea is subject to approval based on a project
proposal that specifies on a set of
project requirements and deliverables.
Implementation ProjectsImplementation projects involve the development and evaluation of an original
application using information retrieval, text mining, and/or machine learning
techniques. The application must be tested and evaluated using appropriate test
data sets. The application must also involve the use of one or more of the
modeling techniques relevant to the course topics. Your application may also
include a significant extension of an existing applications or technqiues
discussed in class materials or other sources (in this case, the application
must be extended to include additional or more sophisticated types of featrues).
The deliverable for the project must include the fully documented code,
distribution files, including any third party sources, installation/deployment
documents (including examples, screen shots of test runs, etc.), data used for
the evaluation of the application, and a detailed project report providing a
description of the components of the application and the results of evaluation.
Many different types of applications are possible, but some examples of such
applications include (but are not limited to) the following.
- Build your own search/retrieval system:
- Should include implementations for the basic components including
separate crawler, indexer, and query processing components (including a
reasonable query interface)
- Should work on a local document corpus in a
directory structure or as a Web search engine (applied to a limited set of
Web sites or for a specific domain)
- The indexing component should parse and index documents using inverted file format
with relevant term frequency information
- Should make use of stemming and stop lists (you can
existing tools for this part).
- The system should use TF-IDF weights (and possibly additional weighting
schemes) for index terms
- The base implementation should use the
vector-space model with Cosine similarity to be used for the matching
queries and indexed documents. Optionally, you can implement other retrieval
models such as probabilistic models or models based on link analysis.
should be possible to save the index to an offline storage and reload it for
subsequent retrieval sessions (during a retrieval session, the search engine
should run in the background as a server process and handle incoming
- Optional components or functionality can be added depending the
desired features or complexity of the project, including: additional
weighting schemes, part-of-speech tagging, phrase indexing, n-gram indexing,
proximity operators, personalized search, and relevance feedback.
- Implement a personalized information filtering system:
- Your system should provide the capability for selective
dissemination of information based on a user profile.
- The system should obtain and subsequently update a user's profile
represented as a set of topics (e.g., using a vector-space representation)
Based on the user profile, the system (in the background) should search for
items of interest to the user. Depending on the type of target domain, these
items could be interesting Web pages, news stories, blog posts, tweets or
posts on other social networking sites, or even objects of interest (movies,
books, consumer items, etc.). The applications can be a general information
filtering agent, or an agent designed to work in a specific target domain
(e.g., a personalized shopping agent, a news filtering agent, etc.).
user's profile should be updated when the user provides feedback on one or
more of recommended items (e.g., using relevance feedback).
- The system should minimally include components to create and maintain an
index of items/documents selected, a component to maintain and update a user
profile, and a component to search selected Web sites in the background for
items of interest.
- Optionally, the system can include additional features
such as clustering and categorization of items selected for the user; the
ability to update search for items similar to a recommended item selected by
the user (e.g., "more like this" capability), etc.
- Design an enhanced user interface for a retrieval system:
- Your interface should help guide the user in formulating a
query. You can explore options such as the use of a
classification hierarchy (such as Yahoo's category labels), providing the
capability for natural language queries (possibly through the use
of WordNet and basic natural language processing tools such as
part-of-speech tagging), adding context-awareness by maintaining a user
profile (based on past searches or other types of preference elicitation) in
order to reduce ambiguity in queries, etc.
- Your system should also provide and enhanced interface for
the user to browse the retrieved documents and provide mechanisms
such as relevance feedback and query by example.
- Finally, the system should have the ability to cluster the retrieved
documents (preferably using hierarchical clustering) and present the
clusters to users for easier navigation and browsing.
- For this project you don't have to implement your own indexing
and matching algorithms, however, you may need to modify an
existing system (with source code) to incorporate the additional
You may also need to do post processing of documents retrieved as a results
of a search.
- Build and evaluate a recommender system:
- Allow multiple users to access a server and rate items based on their
preferences (e.g., movies, music, Web pages, etc.);
- Use different methods such as collaborative and content-based filtering technology (or other profiling techniques
such as clustering.
- Based on ratings of other similar users create dynamic recommendations
for the current user of the system.
- Alternatively or in conjunction to collaborative filtering, you may use
content-based filtering approaches that compare items in a user's profile
with other similar items as a way to generate recommenations.
- Many different variations of this idea is possible.
Research projects involve doing an in-depth study, survey, or evaluation of one or more topics
related to information retrieval and filtering. The project can take a form of a
examining the use of a specific technique or model in various IR systems, or it can be a detailed
case study involving two or more existing IR systems. In either case, the paper should contain a
summary and a technical evaluation of the state-of-the-art related to the particular topic studied.
If the paper involves a case study, then a thorough comparative evaluation with other similar
systems must be provided. A research
paper should present a new idea or provide a detailed survey of methods to solve a specific IR-related problem. The approach
presented should be, at least in part, a novel and original contribution, and should be
evaluated experimentally. A research paper could be good start for a Masters or Ph.D. research
project. The maximum length for the written projects is 20 single-spaced pages (12 point font),
including figures and references. The evaluation of the papers will be based on clarity,
thoroughness, soundness, originality, and evaluation of ideas and concepts presented, as well as the overall organization
of the paper.
Note: Research projects should not simply be a summary of some of the
material covered directly in the lectures, but rather should go beyond this
material in one or more specific focus areas and attempt to survey and
synthesize some of the recent research ideas and methods in that focus area. A
typical research paper should also include implementations of one or more such
techniques and their evaluation against some baselines using at least one data
A list of potential general areas are as follows.
- Personalization in Search: A study of various techniques and approaches used to
create personalized search applications on the Web. The study should include a
survey and a comparative evaluation of techniques for re-ranking or filtering search results based on user
profiles, as well as intelligent agents that take into account user
characteristics or profiles to assist users in search.
- Topic/event prediction and tracking: using pattern extraction from
unstructured data (such as news stories, social media posts, tweets, etc.)
possibly in conjunction with the underlying graph structures inherent in social
networks to identify and track topics, or to predict events.
- A comparative study of implementation techniques for scalable information
retrieval on large-scale search engines or Web-based information systems (such
as Google, Facebook, etc.). This study must include an analysis of challenges in
managing and leveraging large data repositories and various proposed and
implemented solutions (such as Big Table, Map Reduce, and other approaches based
on "cloud computing"). The study can also focus on implementation platforms that
enable scalable retrieval (e.g., Hadoop).
Study of the use of social network analysis in information
retrieval. This study should include a detailed summary of various techniques
from SNA and their use in providing relevant information to users in online
social network and/or traditional search engines. The study should also explore
the use of network and graph structures in social networks to identify or
predict trends, patterns, and relationships
- Integration of semantic knowledge in search: a study of various techniques
to mine information and patterns from semi-structured data on the Web for more
intelligent search. The study may include the use of agents designed to extract
specific types of information, the integration and
semantic knowledge such as ontologies and knowledge graphs into current search
technologies, etc., as well as the use of natural language processing and other
relevant techniques to extract meaningful semantics information from
- Recommender Systems: A comparative study of various recommender systems
techqniues including different approaches to collaborative
and content-based filtering and their applications in several recommender systems. The study should include
a technical summary of various techniques, and evaluation of existing methods in use today on the Web.