Preface

These lecture notes are an attempt to provide a Ph.D.-level introduction to applied regression analysis. It is traditional, in the preface to every textbook, to explain that there are of course many other textbooks covering similar topics, but none of those books do it in the right way. So it goes without saying that there are plenty of regression textbooks but none of them do it right. All that remains is for me to give excuses for my delusions of authorial grandeur.

First, this is an applied regression course. Unlike, say, Seber and Lee (2003) or Christensen (2011), which focus on mathematical theory to the exclusion of almost any data, I seek to motivate each topic through applications to real data. Working code is included and homework exercises expect students to apply the methods to answer questions about data.

Second, I view statistics as applied epistemology. That is: Statistics is the study of methods to learn about the real world from data. It tells us what we can and can’t infer from data, and suggests data we could collect to answer questions. Regression problems are a common class of problems that help us learn about the world. To that end:

Third, many classic techniques in regression exist because computation was difficult in the 1970s, not because they are the best techniques now. There are many classic inference results for special cases where computation is easier, but the special cases are no longer necessary. \(R^2\) is not necessary to summarize a model fit when we have tools to estimate prediction error. We do not need to manually explore transforming each covariate when splines and additive models are easily available. ANOVA tables are not necessary when we need not calculate tests by hand—and indeed, giving up the manipulation of sums of squares allows us to view regression geometrically instead, which is much simpler.

Fourth, a data analyst’s job is not complete until they communicate their findings. Students must be able to write data analysis reports that not only explain the analysis they conducted but interpret the results to answer the scientific questions. Writing is integrated into this course, and beyond the treatment of writing (e.g. in 25  Genre Conventions), each chapter introducing models discusses how they can be presented and interpreted in writing. And models presented in examples are presented as they would be in a full report, with formatted tables of coefficients, fully labeled graphs, descriptive captions for figures, confidence intervals for relevant quantities, and so on, so each example both illustrates a statistical method and demonstrates how to write about it.

Finally, there are some new approaches that I feel are useful to using and interpreting regression models:

I did not invent any of these new approaches, but no text presents them all as part of a coherent data analysis practice.

In short, this text exists for the only possible reason: I am opinionated about regression.

Prerequisite knowledge

These notes assume the reader is familiar with:

  • Matrix and vector operations: matrix multiplication, inverses, projection, and eigenvectors.
  • Multivariable calculus: derivatives, integrals, and gradients
  • Statistical inference: sampling distributions, point estimation, hypothesis tests, and confidence intervals, at the level of an advanced undergraduate or introductory graduate course (e.g. Casella and Berger (2002))
  • Basic linear regression: least squares, interpreting slopes and intercepts, making predictions
  • Basic R programming: loading data, making plots, manipulating data frames

Acknowledgments

These lecture notes owe a great deal to many people: to Valérie Ventura, who taught this course to me so I can teach it to others; to Ann Lee, with whom I co-taught an advanced undergraduate modeling course for three years and from whom I learned many things; to the Ph.D. students who put up with early versions of this course; to the History of Statistics reading group (Peter Elliott, Lee Richardson, Taylor Pospisil, Kevin Lin, and others), whose discussions informed my thinking about the goals of statistical modeling; and more broadly to my instructors, colleagues, and fellow Ph.D. students, who somehow turned me from a physicist to a statistician in just five years.