Data science is the study and practice of how we can extract insight and knowledge from large amounts of data. It is a burgeoning field, currently attracting substantial demand from both academia and industry.

This course provides a practical introduction to the “full stack” of data science analysis, including data collection and processing, data visualization and presentation, statistical model building using machine learning, and big data techniques for scaling these methods. Topics covered include: collecting and processing data using relational methods, time series approaches, graph and network models, free text analysis, and spatial geographic methods; analyzing the data using a variety of statistical and machine learning methods include linear and non-linear regression and classification, unsupervised learning and anomaly detection, plus advanced machine learning methods like kernel approaches, boosting, or deep learning; visualizing and presenting data, particularly focusing the case of high-dimensional data; and applying these methods to big data settings, where multiple machines and distributed computation are needed to fully leverage the data.

As the course name suggests, this course will highlight the practical aspects of data science, with a focus on implementing and making use of the above techniques. Students will complete programming homework that emphasizes practical understanding of the methods described in the course. In addition, students will develop a tutorial on an advanced topic, and will complete a group project that applies these data science techniques to a practical application chosen by the team; these two longer assignments will be done in lieu of a midterm or final.

Data collection and processing

Ingest data from unstructured and structured sources, and use relational models, time series algorithms, graph and network processing, natural language processing, geographic information system processes to store and manage the data.

Statistical modeling

Apply basic statistical techniques and analyses to understand properties of the data and to design experimental setups for testing hypotheses or collecting new data.

Advanced ML techniques

Apply advanced machine learning algorithms such as kernel methods, boosting, deep learning, anomaly detection, factorization models, and probabilistic modeling to analyze and extract insights from data.

Data visualization

Visualize the data and results from analysis, particularly focusing on visualizing and understanding high-dimensional structured data and the results of statistical and machine learning analysis.

Big data

Scale the methods to big data regimes, where distributed storage and computation are needed to fully realize capabilities of data analysis techniques.

Data science debugging

Learn to diagnose problems with data science pipelines, finding problems in data collection, problem setup, machine learning models, and conclusions.