CSC 575
Winter 2023

 Syllabus 

 Course Material 

 Assignments 

 Class Project 

 Online Resources 

 Home




Comments/Suggestions



Intelligent Information Retrieval

Class Project

Notes:

  • Final Project Checklist - Information about what you need to submit for the final project.
  • Projects may be done in groups of up to 3, depending on the size and complexity of the project. Each group or individual must submit a specific project proposal to be approved by February 11.  Note that the size and make up of the groups must also be approved along with the project proposal.
  • Due date for final project: Wednesday, March 15.

The following are a list of ideas for the class project. You may choose any of these ideas or their variations. You may also choose to combine parts of these projects, or come up with your own idea. In all cases, however, your project idea is subject to approval based on a project proposal that specifies on a set of project requirements and deliverables.

Implementation Projects

Implementation projects involve the development and evaluation of an original application using information retrieval, text mining, and/or machine learning techniques. The application must be tested and evaluated using appropriate test data sets. The application must also involve  the use of one or more of the modeling techniques relevant to the course topics. Your application may also include a significant extension of an existing applications or technqiues discussed in class materials or other sources (in this case, the application must be extended to include additional or more sophisticated types of featrues). The deliverable for the project must include the fully documented code, distribution files, including any third party sources, installation/deployment documents (including examples, screen shots of test runs, etc.), data used for the evaluation of the application, and a detailed project report providing a description of the components of the application and the results of evaluation. Many different types of applications are possible, but some examples of such applications include (but are not limited to) the following.
  1. Build your own search/retrieval system:
    • Should include implementations for the basic components including separate crawler, indexer, and query processing components (including a reasonable query interface)
    • Should work on a local document corpus in a directory structure or as a Web search engine (applied to a limited set of Web sites or for a specific domain)
    • The indexing component should parse and index documents using inverted file format with relevant term frequency information
    • Should make use of stemming and stop lists (you can existing tools for this part).
    • The system should use TF-IDF weights (and possibly additional weighting schemes) for index terms
    • The base implementation should use the vector-space model with Cosine similarity to be used for the matching queries and indexed documents. Optionally, you can implement other retrieval models such as probabilistic models or models based on link analysis.
    • It should be possible to save the index to an offline storage and reload it for subsequent retrieval sessions (during a retrieval session, the search engine should run in the background as a server process and handle incoming queries).
    • Optional components or functionality can be added depending the desired features or complexity of the project, including: additional weighting schemes, part-of-speech tagging, phrase indexing, n-gram indexing, proximity operators, personalized search, and relevance feedback.

  2. Implement a personalized information filtering system:
    • Your system should provide the capability for selective dissemination of information based on a user profile.
    • The system should obtain and subsequently update a user's profile represented as a set of topics (e.g., using a vector-space representation)
    • Based on the user profile, the system (in the background) should search for items of interest to the user. Depending on the type of target domain, these items could be interesting Web pages, news stories, blog posts, tweets or posts on other social networking sites, or even objects of interest (movies, books, consumer items, etc.). The applications can be a general information filtering agent, or an agent designed to work in a specific target domain (e.g., a personalized shopping agent, a news filtering agent, etc.).
    • The user's profile should be updated when the user provides feedback on one or more of recommended items (e.g., using relevance feedback).
    • The system should minimally include components to create and maintain an index of items/documents selected, a component to maintain and update a user profile, and a component to search selected Web sites in the background for items of interest.
    • Optionally, the system can include additional features such as clustering and categorization of items selected for the user; the ability to update search for items similar to a recommended item selected by the user (e.g., "more like this" capability), etc.

  3. Design an enhanced user interface for a retrieval system:
    • Your interface should help guide the user in formulating a query. You can explore options such as the use of a classification hierarchy (such as Yahoo's category labels), providing the capability for natural language queries (possibly through the use of WordNet and basic natural language processing tools such as part-of-speech tagging), adding context-awareness by maintaining a user profile (based on past searches or other types of preference elicitation) in order to reduce ambiguity in queries, etc.
    • Your system should also provide and enhanced interface for the user to browse the retrieved documents and provide mechanisms such as relevance feedback and query by example.
    • Finally, the system should have the ability to cluster the retrieved documents (preferably using hierarchical clustering) and present the clusters to users for easier navigation and browsing.
    • For this project you don't have to implement your own indexing and matching algorithms, however, you may need to modify an existing system (with source code) to incorporate the additional capabilities. You may also need to do post processing of documents retrieved as a results of a search. 

  4. Build and evaluate a recommender system:
    • Allow multiple users to access a server and rate items based on their preferences (e.g., movies, music, Web pages, etc.);
    • Use different methods such as collaborative and content-based filtering technology (or other profiling techniques such as clustering.
    • Based on ratings of other similar users create dynamic recommendations for the current user of the system.
    • Alternatively or in conjunction to collaborative filtering, you may use content-based filtering approaches that compare items in a user's profile with other similar items as a way to generate recommenations.
    • Many different variations of this idea is possible.

Research Papers

Research projects involve doing an in-depth study, survey, or evaluation of one or more topics related to information retrieval and filtering. The project can take a form of a research paper examining the use of a specific technique or model in various IR systems, or it can be a detailed case study involving two or more existing IR systems. In either case, the paper should contain a summary and a technical evaluation of the state-of-the-art related to the particular topic studied. If the paper involves a case study, then a thorough comparative evaluation with other similar systems must be provided. A research paper should present a new idea or provide a detailed survey of methods to solve a specific IR-related problem. The approach presented should be, at least in part, a novel and original contribution, and should be evaluated experimentally. A research paper could be good start for a Masters or Ph.D. research project. The maximum length for the written projects is 20 single-spaced pages (12 point font), including figures and references. The evaluation of the papers will be based on clarity, thoroughness, soundness, originality, and evaluation of ideas and concepts presented, as well as the overall organization of the paper.

Note: Research projects should not simply be a summary of some of the material covered directly in the lectures, but rather should go beyond this material in one or more specific focus areas and attempt to survey and synthesize some of the recent research ideas and methods in that focus area. A typical research paper should also include implementations of one or more such techniques and their evaluation against some baselines using at least one data set.

A list of potential general areas are as follows.

  • Personalization in Search: A study of various techniques and approaches used to create personalized search applications on the Web. The study should include a survey and a comparative evaluation of techniques for re-ranking or filtering search results based on user profiles, as well as intelligent agents that take into account user characteristics or profiles to assist users in search.

  • Topic/event prediction and tracking: using pattern extraction from unstructured data (such as news stories, social media posts, tweets, etc.) possibly in conjunction with the underlying graph structures inherent in social networks to identify and track topics, or to predict events.

  • A comparative study of implementation techniques for scalable information retrieval on large-scale search engines or Web-based information systems (such as Google, Facebook, etc.). This study must include an analysis of challenges in managing and leveraging large data repositories and various proposed and implemented solutions (such as Big Table, Map Reduce, and other approaches based on "cloud computing"). The study can also focus on implementation platforms that enable scalable retrieval (e.g., Hadoop).

  • Study of the use of social network analysis in information retrieval. This study should include a detailed summary of various techniques from SNA and their use in providing relevant information to users in online social network and/or traditional search engines. The study should also explore the use of network and graph structures in social networks to identify or predict trends, patterns, and relationships

  • Integration of semantic knowledge in search: a study of various techniques to mine information and patterns from semi-structured data on the Web for more intelligent search. The study may include the use of agents designed to extract specific types of information, the integration and semantic knowledge such as ontologies and knowledge graphs into current search technologies, etc., as well as the use of natural language processing and other relevant techniques to extract meaningful semantics information from unstructured data.  

  • Recommender Systems: A comparative study of various recommender systems techqniues including different approaches to collaborative and content-based filtering and their applications in several recommender systems. The study should include a technical summary of various techniques, and evaluation of existing methods in use today on the Web.



Copyright ©, Bamshad Mobasher, DePaul University.