Online Resources & Reference Material
Data Mining Resources
& Reference Material
Data Sets and Sources of Data
Preprocessed DePaul CTI Web Usage Data -
Cleaned, filtered, and sessionized data of visits to the main CTI site during a 2 week period
in April 2002. The data also includes basic statistics on users and
Cleaned DePaul CTI Web Usage Data - The full cleaned CTI
Web usage data for April 2002. This data set has been cleaned
(including spider removal) and converted into tab delimited format.
However, no user identification, sessionization, or other data
preparation steps have been performed.
Non-Preprocessed DePaul CTI Web Usage Data -
The full CTI Web usage data for April 2002. The only cleaning step
performed on this data was the removal of references to auxiliary
files (e.g., image files). No other cleaning or preprocessing has been
performed. The data is in the original log format used by Microsoft
- UCI Machine Learning
Repository - A repository of more than 200 data sets for machine
learning and data mining
Movie Ratings Data - Real movie ratings data
from www.movielens.org Web
site. Contains ratings on 1600+ movies by 1000 users
Kaggle.com Competition Data Sets - Data sets from a
variety of competitions. Also a good source for class project ideas.
- Stanford Large Network
Dataset Collection - A variety of network data sets, including
data from social networks, product reviews, online communities, etc.
- Yelp Data Set
Challenge - Reviews and check-in data on thousands of
Song Dataset - Freely-available collection of audio features and
metadata for a million contemporary popular music tracks.
- Public Data sets
on Amazon Web Services - Large public data sets (including data
sets for US Census, Wikipedia, Freebase, human genome project),
ready for big data analytics on the cloud.
- Data.gov -
Publically available data sets from Federal, State, and local
government, including economic, geological, demographic and many
other types of data sources. This site also includes a list of other
Open Data Sites with
similar publicly available data sources from various cities, states,
- KDnugget's list of
data sets for data mining
- Infochimps Data
Market - Thousands of data sets, including data from various
social networks and collaborative tagging sites such as Twitter,
Delicious, Last.fm, MusicBrainz, as well as data sets from many