CMU 15-388/688, Fall 2016
Practical Data Science
Course Information

Course Overview

Data science is the study and practice of how we can extract insight and knowledge from large amounts of data. It is a burgeoning field, currently attracting substantial demand from both academia and industry.

This course provides a practical introduction to the "full stack" of data science analysis, including data collection and processing, data visualization and presentation, statistical model building using machine learning, and big data techniques for scaling these methods. Topics covered include: collecting and processing data using relational methods, time series approaches, graph and network models, free text analysis, and spatial geographic methods; analyzing the data using a variety of statistical and machine learning methods include linear and non-linear regression and classification, unsupervised learning and anomaly detection, plus advanced machine learning methods like kernel approaches, boosting, or deep learning; visualizing and presenting data, particularly focusing the case of high-dimensional data; and applying these methods to big data settings, where multiple machines and distributed computation are needed to fully leverage the data.

As the course name suggests, this course will focus on the practical aspects of data science, with a focus on implementing and making use of the above techniques. Students will complete weekly programming homework that emphasize practical understanding of the methods described in the course. In addition, students will develop a tutorial on an advanced topic, and will complete a group project that applies these data science techniques to a practical application chosen by the team; these two longer assignments will be done in lieu of a midterm or final.

Data collection and processing

Ingest data from unstructured and structured sources, and use relational models, time series algorithms, graph and network processing, natural language processing, geographic information system processes to store and manage the data.

Statistical modeling

Apply basic statistical techniques and analyses to understand properties of the data and to design experimental setups for testing hypotheses or collecting new data.

Advanced ML techniques

Apply advanced machine learning algorithms such as kernel methods, boosting, deep learning, anomaly detection, factorization models, and probabilistic modeling to analyze and extract insights from data.

Data visualization

Visualize the data and results from analysis, particularly focusing on visualizing and understanding high-dimensional structured data and the results of statistical and machine learning analysis.

Big data

Scale the methods to big data regimes, where distributed storage and computation are needed to fully realize capabilities of data analysis techniques.

Key information

Course number: 15-388 (undergraduate) / 15-688 (masters)
Both courses will have the same lectures, but on each assignment there will be additional advanced problems for the 600 level course.

Course location and time: Doherty Hall A302, MW 12:00-1:20

Units: 9 (15-388), 12 (15-688)

Prequisites: Programming experience is necessary for the course (assignments are in Python). For CS undergraduates, either 15-112 or 15-122 is required; undergraduates from other departments who have programming background but have not taken the course require instructor approval to enroll. Experience with linear algebra, probability, and statistic is recommended, but not strictly required (courses like 21-240/241/242 for linear algebra or 36-201 for probability/statistics are more than sufficient). Students concerned about whether they have a proper background should contact the course instructors to discuss.

Grading: 55% homeworks, 15% tutorial, 25% final project, 5% class participation.

Schedule

This schedule is tentative and subject to change, and precise dates will be added closer to the course start date. All course material, including slides, lecture videos, and assignments, will be publicly available.

Date Topic Lecture Assignments
8/29 Introduction video
Data collection and management
8/31 Data collection and scraping video HW1 Out (pdf) (notebooks)
9/7 Jupyter notebook lab (notebook and data files) video
9/12 Relational Data video
9/14 Visualization and data exploration (notebook) video HW1 Due, HW 2 Out (notebooks)
9/19 Vector, matrices, and linear algebra (notebook) video Tutorial Out (instructions)
9/21 Graph and network processing (notebook) video
9/26 Free text and natural language processing video
9/28 Free text, continued video HW3 Out (notebooks)
Statistical modeling and machine learning
10/3 Linear regression (notebook + data) video HW2 Due
10/5 Linear classification video
10/10 Nonlinear modeling, cross-validation, regularization video Geospatial Analysis Tutorial
Final Project Out
10/12 Model regularization and evaluation video HW3 Due, HW4 Out (notebooks) (data)
10/17 Basic probability and statistics: basics of probability video Detailed tutorial instructions
Length checker script
10/19 Maximum likelihood estimation, naive Bayes video Tutorial Check-in Due
10/21 Recitation (Numpy, Scipy.sparse, Scipy.stats) video (notebook)
10/24 Hypothesis testing and experimental design video
Advanced modeling techniques
10/26 Decision trees and boosting video HW4 Due
10/31 Unsupervised learning: clustering and dimensionality reduction video
11/2 Anomaly detection and mixture of Gaussians video Tutorial Due, HW5 Out (notebooks)
10/21 Recitation, HW5 video (notebook)
11/7 Recommender systems video
11/9 Deep learning video Student Tutorial Evaluation Due
11/11 Midterm Report
11/14 Guest lecture (Jen Mankoff, HCI): information visualization video
Additional topics
11/16 Probabilistic modeling video HW5 Due, HW6 Out (notebooks) (data)
11/21 Big data and MapReduce methods video
11/22 HW 6 Recitation video (notebook)
11/28 Debugging data science (working notes) video
11/30 A data science walkthrough (notebook) video
12/5 Data scientist positions video
12/6 HW6 Due
12/7 Future of data science
12/9 Final project report due

Assignments

The course will consist of three main types of assignments. First, there will be biweekly homes that work through the material presented in class, and require students to implement or evaluate relevant algorithms. These workbooks will be distributed as Jupyter notebooks, and will be submitted for the course via Autolab. In addition to these assignments, students will themselves develop a new tutorial workbook to teach an advanced topic, and will also work through the content developed by at least two other students. Finally, students will complete a final class project (in groups), which will be a chance to apply these data science techniques to a problem of the group's choosing.

There is no midterm or final in the course. All assignments will be posted to this page as they are available.

Instructors

Zico Kolter

Assistant Professor

Office lunches: See Piazza

Eric Wong

Teaching Assistant

Office hours: Tuesday and Thursday 3-4pm Gates 8th Floor Kitchen area

Dhivya Eswaran

Teaching Assistant

Office hours: Monday 3:30pm-4:30pm, Fridays 9am-10am GHC 6008

FAQ

Q: What is the situation will all the different course numbers and sections (15-388, 15-688 A/B)?
A: The demand for this course has been very right: as of the start of classes there are more than 250 people registered or on the waitlist for the course. We're thrilled about the level of interest, but unfortunately the only available classroom during this time fits at most 136 students.
To accomodate as many as people as possible, we created a section DNM (Does Not Meet) section of the 600 level course, which is the B section of the 15-688 course. This version is identical to the A section, except that students are expected to watch the lectures online (all lectures will be available online within a few hours after the end of class). We expect that attendance will shake out significantly during the first few weeks of the course, and our strong suspicious is that after the first month, there will be space in the lecture hall for anyone (from any section), to attend lectures, but we ask that until we make this clear, students in the B section not regularly attend lecture.

Q: I really want to take the in-person 15-688 Section A. Will I be able to get off the waitlist?
See above. We want to accomodate absolutely as many people as possible, but are ultimately limited by the classroom size (and following university policy, the undergraduate lectures have priority for in-class attendence). However, we really want to emphasize that the courses are exactly the same except for the in-class lecture and DNM section (same credit, same homeworks, same office hours, same access to TAs/professor, same tutorial and final project assigments, etc), and there is a good chance that Section B students will end up even being able to attend lectures by the middle of the semester. So please consider enrolling in Section B if you are in this position.

Q: I'm on the waitlist for the 15-388 version. Will I be able to get off the waitlist?
Yes, almost definitely. There is a relatively small waitlist for the 300 version currently, and we absolutely expect all students who stick around a few weeks to get off the waitlist.

Q: How does the 5% class participation grade work for students in the DNM section?
Given the size of this course, for all students (including 15-388/15-688A students) the class participation grade will be based upon participating in the course forums on Piazza, not upon speaking up in class.

Q: Is there a pointer to materials from a previous year?
A: No, Fall 2016 is the first time this course is being offered, so we don't have past materials to look over. Feel free to contact the instructors if you have questions about the content of the course which aren't answered here.

Q: Does this course count toward the MSCS AI requirement?
A: (Corrected version). We still need to confirm whether this will be the case or not. We'll update the website and soon as we know, and please email us if this situation affects you.

Q: Will the course be offered during the spring semester?
A: No, the soonest the course will be offered again is during Fall 2017.

Q: Will this course focus mainly on applying techniques from existing libraries to practical data science problems, or writing the underlying algorithms from scratch?
A: Both, to a certain extent. There will be plenty of focus on applying algorithms (often best used through existing libraries) to practical problems, but these libraries can be used more effectively when you understand the underlying algorithms well enough to implement them yourself. So, at least for the more straightforward algorithms that we cover, you will be implementing these yourselves. The 688 level course assignment will do a bit more of this underlying implementation than the 388 level course.

Q: I have a question that wasn't asked here.
A: Come to office hours, or ask on the course Piazza.