Data science is the study and practice of how we can extract insight and knowledge from large amounts of data. It is a burgeoning field, currently attracting substantial demand from both academia and industry.
This course provides a practical introduction to the "full stack" of data science analysis, including data collection and processing, data visualization and presentation, statistical model building using machine learning, and big data techniques for scaling these methods. Topics covered include: collecting and processing data using relational methods, time series approaches, graph and network models, free text analysis, and spatial geographic methods; analyzing the data using a variety of statistical and machine learning methods include linear and non-linear regression and classification, unsupervised learning and anomaly detection, plus advanced machine learning methods like kernel approaches, boosting, or deep learning; visualizing and presenting data, particularly focusing the case of high-dimensional data; and applying these methods to big data settings, where multiple machines and distributed computation are needed to fully leverage the data.
As the course name suggests, this course will focus on the practical aspects of data science, with a focus on implementing and making use of the above techniques. Students will complete weekly programming homework that emphasize practical understanding of the methods described in the course. In addition, students will develop a tutorial on an advanced topic, and will complete a group project that applies these data science techniques to a practical application chosen by the team; these two longer assignments will be done in lieu of a midterm or final.
Ingest data from unstructured and structured sources, and use relational models, time series algorithms, graph and network processing, natural language processing, geographic information system processes to store and manage the data.
Apply basic statistical techniques and analyses to understand properties of the data and to design experimental setups for testing hypotheses or collecting new data.
Apply advanced machine learning algorithms such as kernel methods, boosting, deep learning, anomaly detection, factorization models, and probabilistic modeling to analyze and extract insights from data.
Visualize the data and results from analysis, particularly focusing on visualizing and understanding high-dimensional structured data and the results of statistical and machine learning analysis.
Scale the methods to big data regimes, where distributed storage and computation are needed to fully realize capabilities of data analysis techniques.
Course number: 15-388 (undergraduate) / 15-688 (masters)
Both courses will have the same lectures, but on each assignment there will be additional advanced problems for the 600 level course.
Course location and time: Doherty Hall A302, MW 12:00-1:20
Units: 9 (15-388), 12 (15-688)
Prequisites: Programming experience is necessary for the course (assignments are in Python). For CS undergraduates, either 15-112 or 15-122 is required; undergraduates from other departments who have programming background but have not taken the course require instructor approval to enroll. Experience with linear algebra, probability, and statistic is recommended, but not strictly required (courses like 21-240/241/242 for linear algebra or 36-201 for probability/statistics are more than sufficient). Students concerned about whether they have a proper background should contact the course instructors to discuss.
Grading: 55% homeworks, 15% tutorial, 25% final project, 5% class participation.
This schedule is tentative and subject to change, and precise dates will be added closer to the course start date. All course material, including slides, lecture videos, and assignments, will be publicly available.
|Data collection and management|
|8/31||Data collection and scraping||video||HW1 Out (pdf) (notebooks)|
|9/7||Jupyter notebook lab (notebook and data files)||video|
|9/14||Visualization and data exploration (notebook)||video||HW1 Due, HW 2 Out (notebooks)|
|9/19||Vector, matrices, and linear algebra (notebook)||video||Tutorial Out (instructions)|
|9/21||Graph and network processing (notebook)||video|
|9/26||Free text and natural language processing||video|
|9/28||Free text, continued||video||HW3 Out (notebooks)|
|Statistical modeling and machine learning|
|10/3||Linear regression (notebook + data)||video||HW2 Due|
|10/10||Nonlinear modeling, cross-validation, regularization||video||Geospatial Analysis Tutorial
Final Project Out
|10/12||Model regularization and evaluation||video||HW3 Due, HW4 Out (notebooks) (data)|
|10/17||Basic probability and statistics: basics of probability||video||Detailed tutorial instructions
Length checker script
|10/19||Maximum likelihood estimation, naive Bayes||video||Tutorial Check-in Due|
|10/21||Recitation (Numpy, Scipy.sparse, Scipy.stats)||video||(notebook)|
|10/24||Hypothesis testing and experimental design||video|
|Advanced modeling techniques|
|10/26||Decision trees and boosting||video||HW4 Due|
|10/31||Unsupervised learning: clustering and dimensionality reduction||video|
|11/2||Anomaly detection and mixture of Gaussians||video||Tutorial Due, HW5 Out (notebooks)|
|11/9||Deep learning||video||Student Tutorial Evaluation Due|
|11/14||Guest lecture (Jen Mankoff, HCI): information visualization||video|
|11/16||Probabilistic modeling||video||HW5 Due, HW6 Out (notebooks) (data)|
|11/21||Big data and MapReduce methods||video|
|11/22||HW 6 Recitation||video||(notebook)|
|11/28||Debugging data science (working notes)||video|
|11/30||A data science walkthrough (notebook)||video|
|12/5||Data scientist positions||video|
|12/7||Future of data science|
|12/9||Final project report due|
The course will consist of three main types of assignments. First, there will be biweekly homes that work through the material presented in class, and require students to implement or evaluate relevant algorithms. These workbooks will be distributed as Jupyter notebooks, and will be submitted for the course via Autolab. In addition to these assignments, students will themselves develop a new tutorial workbook to teach an advanced topic, and will also work through the content developed by at least two other students. Finally, students will complete a final class project (in groups), which will be a chance to apply these data science techniques to a problem of the group's choosing.
There is no midterm or final in the course. All assignments will be posted to this page as they are available.
Q: What is the situation will all the different course numbers and sections (15-388, 15-688 A/B)?
A: The demand for this course has been very right: as of the start of classes there are more than 250 people registered or on the waitlist for the course. We're thrilled about the level of interest, but unfortunately the only available classroom during this time fits at most 136 students.
To accomodate as many as people as possible, we created a section DNM (Does Not Meet) section of the 600 level course, which is the B section of the 15-688 course. This version is identical to the A section, except that
Q: I really want to take the in-person 15-688 Section A. Will I be able to get off the waitlist?
See above. We want to accomodate absolutely as many people as possible, but are ultimately limited by the classroom size (and following university policy, the undergraduate lectures have priority for in-class attendence). However, we really want to emphasize that the courses are
Q: I'm on the waitlist for the 15-388 version. Will I be able to get off the waitlist?
Yes, almost definitely. There is a relatively small waitlist for the 300 version currently, and we absolutely expect all students who stick around a few weeks to get off the waitlist.
Q: How does the 5% class participation grade work for students in the DNM section?
Given the size of this course, for all students (including 15-388/15-688A students) the class participation grade will be based upon participating in the course forums on Piazza, not upon speaking up in class.
Q: Is there a pointer to materials from a previous year?
A: No, Fall 2016 is the first time this course is being offered, so we don't have past materials to look over. Feel free to contact the instructors if you have questions about the content of the course which aren't answered here.
Q: Does this course count toward the MSCS AI requirement?
A: (Corrected version). We still need to confirm whether this will be the case or not. We'll update the website and soon as we know, and please email us if this situation affects you.
Q: Will the course be offered during the spring semester?
A: No, the soonest the course will be offered again is during Fall 2017.
Q: Will this course focus mainly on applying techniques from existing libraries to practical data science problems, or writing the underlying algorithms from scratch?
A: Both, to a certain extent. There will be plenty of focus on applying algorithms (often best used through existing libraries) to practical problems, but these libraries can be used more effectively when you understand the underlying algorithms well enough to implement them yourself. So, at least for the more straightforward algorithms that we cover, you will be implementing these yourselves. The 688 level course assignment will do a bit more of this underlying implementation than the 388 level course.
Q: I have a question that wasn't asked here.
A: Come to office hours, or ask on the course Piazza.