Final Project Checklist - Information about
what you need to submit for the final project.
In addition to assignments that will be completed throughout the
quarter, you will be required to complete a final project in the class by the
end of the quarter. There are two important due dates to keep in mind:
The project proposals are due no later than
Sunday, May 3. The project due date is Monday, June 8.
You may choose one of three types of projects: an implementation project, a
research paper, or a data analysis project. A project proposal must be submitted and approved prior to
proceeding with the project. The project proposal (1-2 pages) must include partner names
Project type (one of the 3 categories below), project description, proposed methodology, techniques, approaches, implementation choices,
and the tentative schedule. In the case of research papers, the proposal must contain an abstract,
a short (but detailed) outline, and a list of reference sources to be used in the research.
In the case of data analysis project, the proposal must include a detailed
description (and samples) of the data to be analyzed, the data mining problems
to be solved, the techniques to be employed to solve the problems, and the tools
to be used.
Note: Research papers must be done individually. Implementation or data
analysis projects can be
done individually or in groups of up to 3 students (depending on the scope and complexity
of the project, and with prior approval).
More details on and examples of different project types are provided below.
Research papers involve doing an in-depth study, survey, or evaluation of
one or more topics related to Web data mining. A research paper may examine the use of specific data mining or Web mining techniques
in one or more application areas. A research paper must relate to one or more of the topics
discussed in class, but must not be simply a summary of the material covered in class or the
readings. The goal of such a "research project" is to go beyond the class material and
examine one of the topics in a much more in-depth manner. The evaluation of the papers will
be based on thoroughness (including adequate coverage of relevant
issues/techniques as well as references to related work), soundness (including justification for any claims made, illustrative example, correct and
adequate analysis of connections or relationships among concepts or techniques),
clarity/organization, and significance (defined by the degree to which the paper
covers new material, and the extent of the original work by the author in
drawing conclusions and synthesizing scholarly work in this area). Note that a
research paper must not simply be a concatenation of material from several other
papers, but must include some original analysis of that work in the context of
the paper topic. The suggested length for a research paper is 15-20 single-spaced pages.
Some examples of topics for research papers include
but are not limited to the following:
Recommender Systems and
Personalization: a comparative study of various techniques in
data mining and machine learning to learn user profiles and predict future user
behavior. The study should examine variety of techniques and approaches used in the
design of recommender systems (including collaborative filtering, content-based filtering,
model-based approaches which use data mining, and hybrid recommender systems).
Web content mining/Text mining: a study of various techniques to mine information and
patterns from semi-structured data on the web such as text, Web documents, news,
reviews, etc. The paper may examine in detail some of the related
topics such as the applications of text mining on the Web, information extraction and
mining of data records from the Web, concept discovery from document collections,
event detection and topic tracking, sentiment analysis and opinion mining, etc.
Web structure mining: a study of various techniques to mine knowledge from the linkage
structure of the Web. Among the topics that can be explored in this study are application
of structure mining in information retrieval (such as Google's Pagerank algorithm),
algorithms based on the notions of Hubs and Authorities, and the automatic discovery
of Web communities. The paper may also examine approaches based on graph
analysis and mining used in social or information networks.
Data Mining on the Social Web:
Social Web (Web 2.0) technologies allow users to connect, share resources, and
actively generate content on Web sites. How can the rich data in such social
networking sites such as Facebook and Twitter, or in resource sharing
sites such as Flickr, Last.fm, and Delicious be mined to help users interact
with these sites more effectively. Among the topics that can be explored are the
use of Social Network Analysis algorithms to discover interesting relationships,
using data mining and machine learning algorithms to predict user behavior or to
recommend resources and users (friends), and application for tag suggestion/
recommendation in social tagging Web sites.
A study of platforms and
approaches to mine and analyze data at the large scale. This study must include an analysis of challenges in
managing and leveraging large data repositories and various proposed and
implemented solutions (such as Big Table, Map Reduce, and other approaches based
on "cloud computing"). The study can also focus on implementation platforms that
enable scalable data mining (e.g., Hadoop), and ideally provide illustrative
examples or experiments with one or more of these platforms.
Data Analysis Projects
Data Analysis projects involve the application of data mining
and Web mining techniques discussed in class or in readings to one or more
specific data sets. The goal of DA projects is to go through the full data
mining cycle with respect to a particular data set (including the specification
of the business problem to be solved, the specification of the data mining tasks
to be performed, selection, preprocessing, integration, and transformation of
the data, application of several DM tasks and the discovery of patterns,
evaluation of patterns, and recommending specific actions with respect to
relevant findings). A DA projects may involve the application of data mining in
a particular domain and data set with which you are familiar such as your work,
the Web, e-commerce, etc. The final report should include a detailed analysis of
the complete scenario for the application of the KDD process, including
specification of the DM problem (based on application objectives), data
collection, data preparation, pattern discovery using a variety of data mining
and statistical techniques, interpretation of results, and conclusions).
Some examples of data analysis projects include:
Performing data mining on Web
usage (or e-commerce) data from a particular Web site in order to analyze the
behavior of users, including various site metrics, user metrics, user segments,
associations, and opportunities for personalization. The project plan must
include all aspects of Web usage preprocessing discussed in class. Note: in lieu of a real data set from another source,
you can use the CTI Web usage data available from the Online Resources section.
Performing the full KDD cycle of a real data
set (other than Web usage or e-commerce data). Examples of such data may include
(but are not limited to) user or customer data from a real business or
organization, census or demographic data, sensor data for device diagnostics,
network traffic data, music playlist data obtained from music sharing Web sites,
social networking data (such as social tags obtained from sites such as
Examination of one or more specific commercial or freely available data mining packages, other than those
used in class. In this option you must be able to install and experiment with the system. The package must
be compared in detail with other comparable products. The final report must include
a technical evaluation and provide a critical analysis of the results of applying
various KDD capabilities provided by the software on at least two realistic data sets,
such as test data sets available from the
UCI KDD Archive.
The implementation projects may involve implementing and performing experimental evaluation on
one or more techniques discussed in the course (e.g., clustering, association rules,
classification, etc.), or combining various data mining techniques (possibly using available
tools) into a Web data mining solution for a specific problem.
Some specific examples of implementation projects include:
Implementing or extending one of the data mining techniques discussed in class
(or related techniques) and testing the implementation on various test data sets,
such as Web usage or e-commerce data or other test data sets available from the
UCI KDD Archive.
Implementing or extending the techniques discussed in class for preprocessing of
Web usage data, including user/session identification, path completion, automatic
discovery and filtering of robot navigation, and pageview identification.
Ideally, this should be implemented as an extension of the WEKA data mining
package, so that the results of preprocessing can be directly used as input for
Designing and implementing a data warehouse for integration and management of
Web usage, structure, content, and e-commerce data, and analyzing this data by
performing OLAP queries against the data warehouse, and using the results as
input to data mining algorithms.
Implementing a system to analyze the effectiveness of a Web site by comparing the
site structure to the navigational behavior of users, analyzing site and user
e-metrics, and predict user behavior for individual or segments of users.
Implementing a recommender system based on usage mining, content filtering,
or collaborative filtering techniques discussed in class.
Designing an automated classification tool that uses machine
learning to automatically identify and classify robot navigation sessions from
Web usage log files.
Designing a query language for querying interesting rules or patterns resulting
from Web usage or Web content data.
Developing your own mail, news, or Web information filtering
agent (e.g., an agent that extracts information about a particular topic/product
from specific sites). The agent design must include one or more machine learning
and data mining techniques such as classification, clustering, association rule
mining, Markov models, etc.