A major component of 15-388/688 is a final group project in which you investigate some applied data science problem and gain experience with the full data science pipeline. This is an open-ended project, where you find some data and select a method of analysis, apply it, and draw conclusions about the data.
|Release date||Due date|
|Proposal||Oct 17||Nov 1|
|Video Screening||Dec 6|
|Video Feedback||Dec 6||Dec 8|
There will be a mandatory video screening section for all students in the Pittsburgh campus, on Dec 6 from 4:30 PM to 7:30PM in Rashid Auditorium (GHC 4401). Food will be provided, with vegetarian options.
If you have any questions about the requirements, please talk to the course staff.
Topic & Data
- The focus of the project should be on analyzing a data set to ask some underlying question about the data or the process generating the data.
- While you can use advanced algorithms to help you answer this question, the focus on the final project should not be solely on the algorithms themselves, but should be in some practical question you want to understand from the data itself.
- The project must analyze a real data set, not a synthetic data set. You may collect your data from a computational process.
- You cannot use a pre-curated data set, such as data sets from Kaggle or something similar (unless you substantially build on the data).
- You can pick topics that overlap with your existing areas of research.
- The final project should be done in groups of 2-3 students.
- One student in each group will submit the project proposal and the Andrew IDs of other students in the group.
We will not approve any groups of four or more students. In very rare cases, we may approve students who want to work on a project alone, but there must be a well-founded reason for this beyond not being able to find a group to work on a particular topic of interest: for example, one student was approved to work alone on data covered by federal privacy regulations.
Students taking the course as 15-388 and 15-688 can be in the same group.
Students auditing the course (who are not required to do this project) may not be in the same group as students not auditing the course.
You should submit your proposal and your group here by midnight, Nov 1. Only one student in each group needs to submit the form.
The proposal should be no more than 1200 characters. In it, you should briefly cover:
- what underlying question you are trying to answer,
- how your data will be collected, and–
- what analysis you will be performing
To complete this project you must communicate your findings through a video and a report.
You must submit a 90 second video explaining your work. This can be:
- a recording of a slideshow with voiceover,
- a screen-capture of the system you’ve built,
- an animation explaining your project, or
- anything as long as it is a video file and it lasts no more than 90 seconds.
You must submit a link to this video hosted on YouTube for screening during class.
You are required to provide simple feedback to 5 other student presentations. We will randomly assign other groups for you to provide feedback for. The provided feedback will not affect your grade, but will be anonymously provided to the group.
The final report must be submitted as an nbviewer link to a Jupyter notebook. We recommend you work in a public GitHub (or similar) repository and simply link to the final report, but you can also upload your final draft to a server.
The instructors will be grading your notebook using a static rendering of the notebook. The notebook should still be written as a traditional report, just one that makes reproducing your report much easier for anyone who is interested in diving more deeply into your work. It should be readable as a narrative explaining your project without requiring the reader to run any code. We (or your peer reviewers) may request your report, data, additional code, and requirements file.
The final report has a constraint of 2000 words of prose, and no limit on the amount of code. However, if you develop very large code blocks as part of the project, you should include them in a separate Python file. Your notebook should be executable in cell order.
We will grade your report and video based on this rubric:
- Does the project clearly identify the problem?
- Does the project clearly describe the relevant data and its collection?
- Does the project clearly explain how the data can be used to draw conclusions about the underlying system?
- Does the report clearly explain the work that was done?
- Is the project innovative or novel?
- Does the project use techniques presented in the course (or clearly related to topics covered in the course) to understand and analyze the data for this problem?
- Does the report explain how this work fits around related work in this subject area?
- Does the report provide directions for further investigation?
Plagiarism, the unattributed copying of ideas, prose, pictures, program code, or any product of human effort, is forbidden. If you use anything that you did not produce yourself, you are expected to cite it. This includes (but is not limited to) stock images, graphics, functions, data cleaning/preprocessing steps, and animations.
For your report, you must cite everything (1) where it is used and (2) in a section at the bottom. In your video, you should include an attribution when the external product is visible or audible, and additionally in your video description.
Students who are caught plagiarizing will receive a zero for the assignment and will be referred to the Office of the Vice President for Student Affairs for Academic Disciplinary Actions.
The score breakdown is:
|Video feedback given||5%|
|Report (Instructor Feedback)||70%|
Note that the video feedback score is based on the feedback you give other groups, not the feedback you receive from them.
|Exploring Fake News||
|Academic Performance of Universities Based on Alumni Data||
|Predicting Patient Outcomes of Artificial Heart Implant||
|Building Reputation on StackOverflow||