Online Resources & Reference Material
General Python Resources
Important Tools and Libraries
IPython: A REPL for
easy interactive python development. Extremely useful for testing ideas
out one line of code at a time. We will use IPython Notebook (a Web
based interactive shell for Python) extensively in this class.
- Jupyter Notebook
(formerly IPython Notebook)
Jupyter Notebook Tutorial - Nice tutorial video by Corey
A very nice plotting library, capable of generating production-level
visualizations programmatically. Matlab-like syntax makes plotting very
The fundamental package for scientific computing with Python.
the open source library for mathematics, science and engineering
scikit-learn: a robust machine learning library
building on top of NumPy, SciPy and matplotlib. Includes of a wide
variety of modeling techniques.
Pandas (python data
analysis library): data structures and tools for common
data analysis tasks, including an efficient data frame implementation
(similar to R).
BeautifulSoup: A general parsing library particularly
useful for parsing html and xml.
Natural Language Toolkit for Python, including tools for text
preprocessing, tokenization, and vectorization (you may also be
interested in an online book that
shows how NLTK is used).
Python language library for the creation, manipulation, and analysis of
graphs and networks.
Installation of Python and Scientific Libraries
References for Data Analysis in Python
Other Relevant Tools & Resources
- UCI Machine Learning
Repository - A repository of more than 200 data sets for machine
learning and data mining
Kaggle.com Competition Data Sets - Data sets from a
variety of competitions. Also a good source for class project ideas.
- Yelp Data Set
Challenge - Reviews and check-in data on thousands of
Product Data - Dataset includes reviews (ratings, text,
helpfulness votes), product metadata (descriptions, category
information, price, brand, and image features), and links (also
viewed/also bought graphs). In addition, this version provides the
and Tag Data - Real movie ratings data
from www.movielens.org Web
site. Often used for testing recommender systems
Recommender Systems Data Sets - A collection of recommender
systems datasets used in a variety of research projects.
Song Dataset - Freely-available collection of audio features and
metadata for a million contemporary popular music tracks.
- Stanford Large Network
Dataset Collection - A variety of network data sets, including
data from social networks, product reviews, online communities, etc.
Online Grocery Shopping Data from Instacart.
- All the
News - 143,000 articles from 15 American publications
- Public Data sets
on Amazon Web Services - Large public data sets (including data
sets for US Census, Wikipedia, Freebase, human genome project),
ready for big data analytics on the cloud.
- Data.gov -
Publically available data sets from Federal, State, and local
government, including economic, geological, demographic and many
other types of data sources. This site also includes a list of other
Open Data Sites with
similar publicly available data sources from various cities, states,
- KDnugget's list of
data sets for data mining