Preface
These lecture notes are an attempt to provide a Ph.D.-level introduction to applied regression analysis. It is traditional, in the preface to every textbook, to explain that there are of course many other textbooks covering similar topics, but none of those books do it in the right way. So it goes without saying that there are plenty of regression textbooks but none of them do it right. All that remains is for me to give excuses for my delusions of authorial grandeur.
First, this is an applied regression course. Unlike, say, Seber and Lee (2003) or Christensen (2011), which focus on mathematical theory to the exclusion of almost any data, I seek to motivate each topic through applications to real data. Working code is included and homework exercises expect students to apply the methods to answer questions about data.
Second, I view statistics as applied epistemology. That is: Statistics is the study of methods to learn about the real world from data. It tells us what we can and can’t infer from data, and suggests data we could collect to answer questions. Regression problems are a common class of problems that help us learn about the world. To that end:
- Many questions are causal, so we need ways to reason about causal relationships. 2 Causality hence introduces counterfactuals and causal diagrams, and we use causal diagrams throughout the text to understand our regression models and decide which models answer the right questions.
- Real data examples are accompanied by real questions. In some textbooks, real data is introduced largely as an excuse—here’s some data, now fit a model to it. Whether that model would be useful for answering questions the researchers had in mind while collecting the data is irrelevant, since the data is just an illustration. But in this text, every dataset is presented with a motivating research question, so we can see how the statistical method helps answer the question.
- Many of the exercises and activities included in these notes ask students to answer real research questions or determine what data or methods would be necessary to do so.
Third, many classic techniques in regression exist because computation was difficult in the 1970s, not because they are the best techniques now. There are many classic inference results for special cases where computation is easier, but the special cases are no longer necessary. \(R^2\) is not necessary to summarize a model fit when we have tools to estimate prediction error. We do not need to manually explore transforming each covariate when splines and additive models are easily available. ANOVA tables are not necessary when we need not calculate tests by hand—and indeed, giving up the manipulation of sums of squares allows us to view regression geometrically instead, which is much simpler.
Fourth, a data analyst’s job is not complete until they communicate their findings. Students must be able to write data analysis reports that not only explain the analysis they conducted but interpret the results to answer the scientific questions. Writing is integrated into this course, and beyond the treatment of writing (e.g. in 25 Genre Conventions), each chapter introducing models discusses how they can be presented and interpreted in writing. And models presented in examples are presented as they would be in a full report, with formatted tables of coefficients, fully labeled graphs, descriptive captions for figures, confidence intervals for relevant quantities, and so on, so each example both illustrates a statistical method and demonstrates how to write about it.
Finally, there are some new approaches that I feel are useful to using and interpreting regression models:
- I distinguish between predictors, the measured variables of interest, and regressors, the quantities entered into the design matrix. In a simple linear model, the regressors are the predictors; but in a model with factors, interactions, polynomials, and so on, a single predictor might produce multiple regressors. The distinction is useful conceptually—we are interested in the relationships between predictor and response, not regressors and response—and helps illustrate certain points in variable selection and model-building.
- I use predictor effect plots to visualize predictor relationships when they are represented with complex regressors, such as interactions, splines, or nonparametric smoothers.
- I use partial residuals as a diagnostic tool for regression. Partial residuals can be computed for each predictor (not regressor!) and show the modeled relationship between predictor and response and deviations from it, making it much easier to correct model misspecification.
- For generalized linear models, I introduce randomized quantile residuals (Dunn and Smyth 1996), which are perhaps the only actually useful residuals for regressions with discrete responses.
I did not invent any of these new approaches, but no text presents them all as part of a coherent data analysis practice.
In short, this text exists for the only possible reason: I am opinionated about regression.
Prerequisite knowledge
These notes assume the reader is familiar with:
- Matrix and vector operations: matrix multiplication, inverses, projection, and eigenvectors.
- Multivariable calculus: derivatives, integrals, and gradients
- Statistical inference: sampling distributions, point estimation, hypothesis tests, and confidence intervals, at the level of an advanced undergraduate or introductory graduate course (e.g. Casella and Berger (2002))
- Basic linear regression: least squares, interpreting slopes and intercepts, making predictions
- Basic R programming: loading data, making plots, manipulating data frames
Acknowledgments
These lecture notes owe a great deal to many people: to Valérie Ventura, who taught this course to me so I can teach it to others; to Ann Lee, with whom I co-taught an advanced undergraduate modeling course for three years and from whom I learned many things; to the Ph.D. students who put up with early versions of this course; to the History of Statistics reading group (Peter Elliott, Lee Richardson, Taylor Pospisil, Kevin Lin, and others), whose discussions informed my thinking about the goals of statistical modeling; and more broadly to my instructors, colleagues, and fellow Ph.D. students, who somehow turned me from a physicist to a statistician in just five years.