A major component of 15-388/688 is a final group project in which you investigate some applied data science problem and gain experience with the full data science pipeline. This is an open-ended project, where you find some data and select a method of analysis, apply it, and draw conclusions about the data.

Component Due date (11:59 pm)
Proposal / Groups Apr 15
Video May 4
Report May 9


If you have any questions about the requirements, please meet with the instructor or post on the course forums.

Topic & Data

  1. The focus of the project should be on analyzing a data set to ask some underlying question about the data or the process generating the data.
  2. While you can use advanced algorithms to help you answer this question, the focus on the final project should not be solely on the algorithms themselves, but should be in some practical question you want to understand from the data itself.
  3. The project must analyze a real data set, not a synthetic data set (but you may collect your data from a computational process).
  4. You cannot use a pre-curated data set, such as data sets from Kaggle or something similar (unless you substantially build on the data).
  5. You can pick topics that overlap with your existing areas of research.


  1. The final project must be done in groups of 2-3 students.
  2. One student in each group will submit the project proposal and the Andrew IDs of other students in the group.

Students taking the course as 15-388 and 15-688 can be in the same group.


You should submit your proposal and your group via Google Form (TBD) by the due date above. Only one student in each group needs to submit the proposal.

The proposal should be no more than 1000 characters. In it, you should briefly cover:

  1. what underlying question you are trying to answer,
  2. how your data will be collected, and–
  3. what analysis you will be performing

Project elements

To complete this project you must communicate your findings through a video and a report.

Video presentations

You must submit a 120 second video explaining your work. This can be:

  • a recording of a slideshow with voiceover,
  • a screen-capture of the system you’ve built,
  • an animation explaining your project, or
  • anything as long as it is a video file and it lasts no more than 120 seconds.

You must submit a link to this video hosted on YouTube. Let us know if you have any issues uploading to Youtube.

Final Report

The final report must be submitted as Colab or nbviewer link to a Jupyter notebook.

The instructors will be grading your notebook using a static rendering of the notebook. The notebook should still be written as a traditional report, just one that makes reproducing your report much easier for anyone who is interested in diving more deeply into your work. It should be readable as a narrative explaining your project without requiring the reader to run any code.

The final report has a constraint of 2000 words of prose, and no limit on the amount of code. However, if you develop very large code blocks as part of the project, you should include them in a separate Python file. Your notebook should be executable in cell order.


We will grade your report and video based on this rubric:

  • Does the project clearly identify the problem?
  • Does the project clearly describe the relevant data and its collection?
  • Does the project clearly explain how the data can be used to draw conclusions about the underlying system?
  • Does the report clearly explain the work that was done?
  • Is the project innovative or novel?
  • Does the project use techniques presented in the course (or clearly related to topics covered in the course) to understand and analyze the data for this problem?
  • Does the report explain how this work fits around related work in this subject area?
  • Does the report provide directions for further investigation?


Plagiarism, the unattributed copying of ideas, prose, pictures, program code, or any product of human effort, is forbidden. As with the tutorial, anything that you include as part of the written code, prose, or figures must be your own work, and cannot be copied from elsewhere (i.e., even with citation). However, you are permitted to use datasets from other sources (with the caveat above that you cannot completely use a pre-curated data set), use outside resources to learn about your topic, etc.


The score breakdown is:

Part Weight
Proposal 5%
Video 20%
Report (Instructor Feedback) 75%