Online Resources & Reference Material
General Python Resources
Important Tools and Libraries
IPython: A REPL for
easy interactive python development. Extremely useful for testing ideas
out one line of code at a time. We will use IPython Notebook (a Web
based interactive shell for Python) extensively in this class.
- Jupyter Notebook
(formerly IPython Notebook)
Jupyter Notebook Tutorial - Nice tutorial video by Corey
A very nice plotting library, capable of generating production-level
visualizations programmatically. Matlab-like syntax makes plotting very
The fundamental package for scientific computing with Python.
the open source library for mathematics, science and engineering
scikit-learn: a robust machine learning library
building on top of NumPy, SciPy and matplotlib. Includes of a wide
variety of modeling techniques.
Pandas (python data
analysis library): data structures and tools for common
data analysis tasks, including an efficient data frame implementation
(similar to R).
BeautifulSoup: A general parsing library particularly
useful for parsing html and xml.
Natural Language Toolkit for Python, including tools for text
preprocessing, tokenization, and vectorization (you may also be
interested in an online book that
shows how NLTK is used).
Python language library for the creation, manipulation, and analysis of
graphs and networks.
Installation of Python and Scientific Libraries
References for Data Analysis in Python
Other Relevant Tools & Resources
- UCI Machine Learning
Repository - A repository of more than 200 data sets for machine
learning and data mining
- mldata.org - A large machine
learning data repository
Movie Ratings Data - Real movie ratings data
from www.movielens.org Web
site. Contains ratings on 1600+ movies by 1000 users
Kaggle.com Competition Data Sets - Data sets from a
variety of competitions. Also a good source for class project ideas.
- Stanford Large Network
Dataset Collection - A variety of network data sets, including
data from social networks, product reviews, online communities, etc.
- Yelp Data Set
Challenge - Reviews and check-in data on thousands of
Online Grocery Shopping Data from Instacart.
Song Dataset - Freely-available collection of audio features and
metadata for a million contemporary popular music tracks.
- All the
News - 143,000 articles from 15 American publications
- Public Data sets
on Amazon Web Services - Large public data sets (including data
sets for US Census, Wikipedia, Freebase, human genome project),
ready for big data analytics on the cloud.
- Data.gov -
Publically available data sets from Federal, State, and local
government, including economic, geological, demographic and many
other types of data sources. This site also includes a list of other
Open Data Sites with
similar publicly available data sources from various cities, states,
- KDnugget's list of
data sets for data mining
- Infochimps Data
Market - Thousands of data sets, including data from various
social networks and collaborative tagging sites such as Twitter,
Delicious, Last.fm, MusicBrainz, as well as data sets from many