CSC 478 - Programming Data Mining Applications

DSC 478
Fall 2022

Syllabus

Course Material

Assignments

Class Project

Online Resources

Home

Comments/Suggestions

Course Syllabus

INSTRUCTOR

Bamshad Mobasher
Email: mobasher@cs.depaul.edu
Office: Loop Campus, CDM Building, Room 833
Phone: (312) 362-5174
Office Hours: Tue, Thu 4:00-5:00 PM (held online or by phone; appointments required)

COURSE DESCRIPTION

The course will focus on the implementations of various data mining and machine learning techniques and their applications in various domains. The primary tools used in the class are the Python programming language and several associated libraries. Additional open source machine learning and data mining tools may also be used as part of the class material and assignments. Students will develop hands on experience developing supervised and unsupervised machine learning algorithms and will learn how to employ these techniques in the context of popular applications such as automatic classification, recommender systems, searching and ranking, text mining, group and community discovery, and social media analytics.

PREREQUISITES

[DSC 441 and (DSC 430 or CSC 403)] or CSC 480

TEXTBOOKS & COURSE MATERIAL

We will use numerous online resources and documents throughout the course. The required and recommended textbooks are listed below. The resources directly relevant to topics covered in the course are listed in the Course Material section. Additional resources can be found on the Resources section.

Required Text
	Machine Learning in Action, by Peter Harrington, Manning Publications, 2012.
Recommended Texts
	Python Data Science Essentials, Third Edition, by Alberto Boschetti and Luca Massaron, Packt Publishing, 2018.
	Python Data Science Handbook: Essential Tools for Working with Data, by Jake VanderPlas, 2017.
	Python for Data Analysis, by Wes McKinney, 2nd Edition, O'Reilly, 2017.

GRADING & COURSE REQUIREMENTS

The structure and grading in the class will be centered around 4 assignments and a final project. The assignments will involve Python implementations of selected machine learning techniques and their applications in various domains. They will also involve the use of various Python libraries to perform preprocessing, data exploration/visualization, and data analysis using different data sets. These assignments must be done individually, unless otherwise specified. You may discuss the assignments with others in the class, but you must develop your own solution to the problems in the assignment. Late assignments will be penalized 10% per day (with weekends counting as one day).

The final project will involve either developing and evaluating an application that uses one or machine learning algorithms to perform a specific task; or they may involve performing a complete analysis of a complex data set using Python tools. The goal of the project is to integrate several concepts covered during the quarter to achieve a more complex task than those explored in the assignments. Student can propose their own project idea (to be approved) and may complete the project either individually or in groups of up to three members. More details on the final project are available in the Project section.

The final grade will be determined (tentatively) based on the following components:

Assignments = 65%
Final Project = 35%

The general grading scheme will be based on a curve. At the end of the quarter, some adjustments may be made based on overall class performance as well as signs of individual effort. Plusses and minuses will be given at the high/low ends of each grade range.

TENTATIVE LIST OF TOPICS

The following issues and topics will be covered throughout the course. Many of these topics will be revisited several times during the course in a variety of contexts.

Data Mining and Knowledge Discovery
- The KDD process and methodology
- Data preparation for knowledge discovery and machine learning
- Exploratory data analysis
- Overview of machine learning tasks
- Review of Python and overview of Python tools for data analysis and visualization
Supervised Learning
- Classification and prediction using K-Nearest-Neighbor
- Probabilistic models:Naïve Bayes
- Building decision trees
- Linear regression models and regularization
- Gradient descent optimization
- Support vector machines
- Evaluating predictive models
- Hyper-parameter tuning
- Feature selection
- Ensemble models: bagging and boosting, including Random Forest, AdaBoost, etc.
Unsupervised Learning
- Clustering: K-Means, HSCAN, etc.
- Hierarchical clustering algorihtms
- Principal Component Analysis and dimensionality reduction
- Singular Value Decomposition
- Matrix factroization
Possible Applications (covered throughout the course)
- Recommender systems and personalization
- Document categorization
- Concept discovery from text
- Finding groups using social or behavioral data
- Building predictive models for target marketing
- Customer or user segmentation
- Image segmentation