36-402, Undergraduate Advanced Data Analysis

Spring 2011

This page has information about the 2011 version of the class. The 2012 version is over here.

Tuesdays and Thursdays, 10:30--11:50 Porter Hall 100

The goal of this class is to train students in using statistical models to analyze data — as data summaries, as predictive instruments, and as tools for scientific inference. We will build on the theory and applications of the linear model, introduced in 36-401, extending it to more general functional forms, and more general kinds of data, emphasizing the computation-intensive methods introduced since the 1980s. After taking the class, when you're faced with a new data-analysis problem, you should be able to (1) select appropriate methods, (2) use statistical software to implement them, (3) critically evaluate the resulting statistical models, and (4) communicate the results of their analyses to collaborators and to non-statisticians.

Graduate students from other departments wishing to take this course should register for it under the number "36-608".

Prerequisites

36-401, or, in unusual circumstances, an equivalent course approved by the instructor.

Instructors

Professor Cosma Shalizi cshalizi [at] cmu.edu 229 C Baker Hall 268-7826 Teaching assistants Gaia Bellone gbellone [at] stat.cmu.edu Zachary Kurtz zkurtz [at] stat.cmu.edu Shuhei Okumura sokumura [at] stat.cmu.edu

Topics, Notes, Readings

Model evaluation: statistical inference, prediction, and scientific inference; in-sample and out-of-sample errors, generalization and over-fitting, cross-validation; evaluating by simulating; bootstrap; penalized fitting; information criteria; mis-specification checks; model averaging Yet More Linear Regression: what is regression, really?; review of ordinary linear regression and its limits; extensions Smoothing: kernel smoothing, including local polynomial regression; splines; additive models; classification and regression trees; kernel density estimation GAMs: logistic regression; generalized linear models; generalized additive models. Latent variables and structured data: principal components; factor analysis and latent variables; graphical models in general; latent cluster/mixture models; hierarchical models and partial pooling Causality: estimating causal effects; discovering causal structure Time series: Markov models for time series without latent variables; hidden Markov models for time series with latent variables

Course Mechanics

Homework

There will be twelve eleven weekly homework assignments, nearly one every week; they will all count equally, and be 60% of your grade. The homework will give you practice in using the techniques you are learning to analyze data, and to interpret the analyses. Communicating your results to others is as important as getting good results in the first place. Raw computer output and R code is not acceptable, but should be put in an appendix to each assignment. Homework will be due, in hard-copy, at the beginning of class on Tuesdays. The lowest three homework grades will be dropped; consequently, no late homework will be accepted.

Exams

There will be two take-home mid-term exams (10% each), due at 5 pm on March 1st and April 12th. (Please let me know as soon as possible if you have a conflict with either date.) You will have one week to work on each midterm. There will be no homework in those weeks, and lecture on the day they are due will be replaced with special office hours. There will also be a take-home final exam (20%), due at 10 am on May 9, which you will have two weeks to do.

Office Hours

Blackboard

Textbook

Extending the Linear Model with R

R in a Nutshell

Statistical Learning From a Regression Perspective

Modern Applied Statistics with S

Collaboration, Cheating and Plagiarism

Physically Disabled and Learning Disabled Students

R

R is a free, open-source software package/programming language for statistical computing. You should have begun to learn it in 36-401 (if not before), and this class presumes that you have. Many of the problems will be easier with R, and some of them will require R. You should have no expectations of assistance from the instructors with programming in any other language. If you are not able to use R, or do not have ready, reliable access to a computer on which you can do so, let me know at once.

Here are some resources for learning R:

The official intro, "An Introduction to R", available online in HTML and PDF

John Verzani, "simpleR", in PDF

Quick-R. This is primarily aimed at those who already know a commercial statistics package like SAS, SPSS or Stata, but it's very clear and well-organized, and others may find it useful as well.

Patrick Burns, The R Inferno. "If you are using R and you think you're in hell, this is a map for you."

Thomas Lumley, "R Fundamentals and Programming Techniques" (large PDF)

There are now many books about R. Adler's R in a Nutshell , and Venables and Ripley, will be available at the campus bookstore. John M. Chambers, Software for Data Analysis: Programming with R (Springer, 2008, ISBN 978-0-387-75935-7) is the best book on writing programs in R, but we will not have to do much actual programming.

Schedule