Advanced Data Analysis from an Elementary Point of View

by Cosma Rohilla Shalizi

This is a draft textbook on data analysis methods, intended for a one-semester course for advance undergraduate students who have already taken classes in probability, mathematical statistics, and linear regression. It began as the lecture notes for 36-402 at Carnegie Mellon University.

By making this draft generally available, I am not promising to provide any assistance or even clarification whatsoever. Comments are, however, generally welcome.

The book is under contract to Cambridge University Press; it should be turned over to the press at the end of 2013 or beginning of 2014 in early before the end of 2015 by the end of 2018 2019, inshallah. A copy of the next-to-final version will remain freely accessible here permanently.

What you're probably looking for

Complete draft in PDF

Directory of chapter-by-chapter R files for examples

Directory of data sets used in examples

Table of contents



I. Regression and Its Generalizations Regression Basics The Truth about Linear Regression Model Evaluation Smoothing in Regression Simulation The Bootstrap Splines Additive Models Testing Regression Specifications Weighting and Variance Logistic Regression Generalized Linear Models and Generalized Additive Models Classification and Regression Trees

II. Distributions and Latent Structure Density Estimation Principal Components Analysis Factor Models Mixture Models Graphical Models

III. Causal Inference Graphical Causal Models Identifying Causal Effects Estimating Causal Effects Discovering Causal Structure

IV. Dependent Data Time Series Simulation-Based Inference

Online-only Appendices Big O and Little o Notation

Taylor Expansions

Propagation of Error, and Standard Errors for Derived Quantities

Optimization

Relative Distributions and Smooth Tests of Goodness of Fit

Nonlinear Dimensionality Reduction

Rudimentary Graph Theory

Missing Data

Writing R Functions

Data-Analysis Assignments

Planned changes

Remove redundant versions of the data-analysis assignments; provide solutions as a separate document through publisher

Unified treatment of information theory as an appendix

Improved treatment of nonparametric instrument variables

Trim time-series chapter so it's less of a catalog of everything that might be useful

Break out stuff on heuristic essential asymptotics as a separate appendix

Make sure notation is consistent throughout: insist that vectors are always matrices, or use more geometric notation?

Figure out how to cut at least 50 pages

(Text last updated 8 September 2019; this page last updated 9 September 2019)