1 Introduction
This is a course in regression analysis. That term is broad: Regression is about finding relationships between variables, and there are many ways to do this. Broadly, then, this course is a transition between the traditional and modern ways to do regression.
In the traditional mode, we use
- parametric models (such as linear or polynomial models),
- with an emphasis on special cases that are simple to compute (such as orthogonal design matrices, shortcuts like generalized cross-validation, and approximations to distributions and tests),
- mainly to conduct hypothesis tests,
- with motivations and notation adapted from traditional experimental design, as a great deal of linear model theory was developed for designed experiments.
In the modern mode, we
- insist we’re doing “machine learning” and not “statistics”,
- use nonparametric and algorithmic regression methods (like trees and neural networks),
- freely use computational methods to do everything, because computation is free and mathematical theory is hard,
- mainly examine prediction error and accuracy instead of conducting point-null hypothesis tests,
- focusing primarily on large-scale observational data with numerous variables outside the control of any experimenter.
We will hit the highlights of the traditional methods (i.e. we will cover them quickly), but I will assume you have seen the basics of regression before. We will cover, at a somewhat higher mathematical level, topics like multiple regression, variable selection, penalized regression, and generalized linear models, focusing on the computation, testing, diagnostic, and model selection tools necessary to put these methods to good use.
I will also assume you have a solid linear algebra background and can program in R. If you’re a little rusty, the syllabus refers to good books on both topics that can be used as references.
After covering regression, we’ll cover some more advanced topics, including more nonparametric and additive models, missing data, and hierarchical models. If there is time, we might also discuss the very basics of survival analysis and experimental design.
But overall, our focus will not be on the derivation of theoretical results about regression estimators. Our focus will instead be on the application of regression to answer substantive questions with real data. Often the most challenging part of any data analysis is figuring out what question is being asked, how that question can be translated into a statistical question, and determining if that statistical question is even answerable from the data—and so we will spend plenty of time practicing these skills. If there is one thing you should learn from this course, it is that a careful, thorough, and useful data analysis is a rare thing indeed, and any statistician handling real data well will find no end of interesting substantive, statistical, and even theoretical problems to work on.