DSC 478
Fall 2022
Syllabus
Course Material
Assignments
Class Project
Online Resources
Home
Comments/Suggestions
|
Online Resources & Reference Material
General Python Resources
Important Tools and Libraries
-
IPython: A REPL for
easy interactive python development. Extremely useful for testing ideas
out one line of code at a time. We will use IPython Notebook (a Web
based interactive shell for Python) extensively in this class.
- Jupyter Notebook
(formerly IPython Notebook)
-
Jupyter Notebook Tutorial - Nice tutorial video by Corey
Schafer.
-
matplotlib:
A very nice plotting library, capable of generating production-level
visualizations programmatically. Matlab-like syntax makes plotting very
easy.
-
NumPy:
The fundamental package for scientific computing with Python.
-
SciPy:
the open source library for mathematics, science and engineering
-
scikit-learn: a robust machine learning library
building on top of NumPy, SciPy and matplotlib. Includes of a wide
variety of modeling techniques.
-
Pandas (python data
analysis library): data structures and tools for common
data analysis tasks, including an efficient data frame implementation
(similar to R).
-
BeautifulSoup: A general parsing library particularly
useful for parsing html and xml.
- NLTK:
Natural Language Toolkit for Python, including tools for text
preprocessing, tokenization, and vectorization (you may also be
interested in an online book that
shows how NLTK is used).
- NetworkX:
Python language library for the creation, manipulation, and analysis of
graphs and networks.
Installation of Python and Scientific Libraries
References for Data Analysis in Python
Other Relevant Tools & Resources
Data Sets
- UCI Machine Learning
Repository - A repository of more than 200 data sets for machine
learning and data mining
-
Kaggle.com Competition Data Sets - Data sets from a
variety of competitions. Also a good source for class project ideas.
- Yelp Data Set
Challenge - Reviews and check-in data on thousands of
businesses.
- Amazon
Product Data - Dataset includes reviews (ratings, text,
helpfulness votes), product metadata (descriptions, category
information, price, brand, and image features), and links (also
viewed/also bought graphs). In addition, this version provides the
following features.
-
Movie Ratings
and Tag Data - Real movie ratings data
from www.movielens.org Web
site. Often used for testing recommender systems
algorithms.
-
Recommender Systems Data Sets - A collection of recommender
systems datasets used in a variety of research projects.
- Million
Song Dataset - Freely-available collection of audio features and
metadata for a million contemporary popular music tracks.
- Stanford Large Network
Dataset Collection - A variety of network data sets, including
data from social networks, product reviews, online communities, etc.
-
Online Grocery Shopping Data from Instacart.
- All the
News - 143,000 articles from 15 American publications
- Public Data sets
on Amazon Web Services - Large public data sets (including data
sets for US Census, Wikipedia, Freebase, human genome project),
ready for big data analytics on the cloud.
- Data.gov -
Publically available data sets from Federal, State, and local
government, including economic, geological, demographic and many
other types of data sources. This site also includes a list of other
Open Data Sites with
similar publicly available data sources from various cities, states,
and countries.
- KDnugget's list of
data sets for data mining
|