A major component of 15-388/688 is a final group project in which you investigate some applied data science problem and gain experience with the full data science pipeline. This is an open-ended project, where you find some data and select a method of analysis, apply it, and draw conclusions about the data.

Component Due date (midnight ET)
Proposal / Groups April 22
Video May 14
Report May 17


If you have any questions about the requirements, please talk to the course staff or post on the course forums.

Topic & Data

  1. The focus of the project should be on analyzing a data set to ask some underlying question about the data or the process generating the data.
  2. While you can use advanced algorithms to help you answer this question, the focus on the final project should not be solely on the algorithms themselves, but should be in some practical question you want to understand from the data itself.
  3. The project must analyze a real data set, not a synthetic data set (but you may collect your data from a computational process).
  4. You cannot use a pre-curated data set, such as data sets from Kaggle or something similar (unless you substantially build on the data).
  5. You can pick topics that overlap with your existing areas of research.


  1. The final project should be done in groups of 2-3 students.
  2. One student in each group will submit the project proposal and the Andrew IDs of other students in the group.

We will not approve any groups of four or more students. In very rare cases, we may approve students who want to work on a project alone, but there must be a well-founded reason for this beyond not being able to find a group to work on a particular topic of interest: for example, one student was approved to work alone on data covered by federal privacy regulations.

Students taking the course as 15-388 and 15-688 can be in the same group.

Students auditing the course (who are not required to do this project) may not be in the same group as students not auditing the course.


You should submit your proposal and your group via Mugrade by midght ET, on April 22. Only one student in each group needs to submit the proposal, and it will be visible to the rest.

The proposal should be no more than 1000 characters. In it, you should briefly cover:

  1. what underlying question you are trying to answer,
  2. how your data will be collected, and–
  3. what analysis you will be performing

Project elements

To complete this project you must communicate your findings through a video and a report.

Video presentations

You must submit a 90 second video explaining your work. This can be:

  • a recording of a slideshow with voiceover,
  • a screen-capture of the system you’ve built,
  • an animation explaining your project, or
  • anything as long as it is a video file and it lasts no more than 90 seconds.

You must submit a link to this video hosted on YouTube. If you’re unable to access YouTube due to firewall restrictions (but only for this reason), you may submit a link to the video file and the TAs will be able to upload.

Final Report

The final report must be submitted as colab notebook or nbviewer link to a Jupyter notebook.

The instructors will be grading your notebook using a static rendering of the notebook. The notebook should still be written as a traditional report, just one that makes reproducing your report much easier for anyone who is interested in diving more deeply into your work. It should be readable as a narrative explaining your project without requiring the reader to run any code.

The final report has a constraint of 2000 words of prose, and no limit on the amount of code. However, if you develop very large code blocks as part of the project, you should include them in a separate Python file. Your notebook should be executable in cell order.


We will grade your report and video based on this rubric:

  • Does the project clearly identify the problem?
  • Does the project clearly describe the relevant data and its collection?
  • Does the project clearly explain how the data can be used to draw conclusions about the underlying system?
  • Does the report clearly explain the work that was done?
  • Is the project innovative or novel?
  • Does the project use techniques presented in the course (or clearly related to topics covered in the course) to understand and analyze the data for this problem?
  • Does the report explain how this work fits around related work in this subject area?
  • Does the report provide directions for further investigation?


Plagiarism, the unattributed copying of ideas, prose, pictures, program code, or any product of human effort, is forbidden. As with the tutorial, anything that you include as part of the written code, prose, or figures must be your own work, and cannot be copied from elsewhere (i.e., even with citation). However, you are permitted to use datasets from other souces (with the caveat above that you cannot completely use a pre-curated data set), use outside resources to learn about your topic, etc.


The score breakdown is:

Part Weight
Proposal 5%
Video 20%
Report (Instructor Feedback) 75%