Much of what I know about regression is codified in my lecture notes for 36-707 Regression Analysis.
Berk, R., Buja, A., Brown, L., George, E., Kuchibhotla, A. K., Su, W., & Zhao, L. (2019). Assumption lean regression. The American Statistician. doi:10.1080/00031305.2019.1592781
and
Buja, A., Berk, R., Brown, L., George, E., Pitkin, E., Traskin, M., Zhao, L., & Zhang, K. (2019). Models as approximations I: Consequences illustrated with linear regression. Statistical Science. https://arxiv.org/abs/1404.1578
If nothing is truly linear, what exactly are we estimating with a linear model? These papers discuss “assumption-lean” linear regression, where we place no particular parametric form on the joint distribution between regressors and response. We can still define what we are estimating, and it still has useful properties, but we must be more careful about interpreting the model and estimating uncertainty.
This is a useful framework: it provides a single explanation for many questions about misspecified models. The American Statistician article is a more accessible version of the lengthier and more mathematically detailed Statistical Science article.
Lumley, T., Diehr, P., Emerson, S., & Chen, L. (2002). The importance of the normality assumption in large public health data sets. Annual Review of Public Health, 23, 151–169. doi:10.1146/annurev.publhealth.23.100901.140546
How much does the assumption of normally distributed residuals matter in regression? A simulation study, with a clear answer: not much, provided you have a decent sample size.
Also known as the marginality principle, this is the “principle” that one must include main effects in a linear model when including their interactions.
Peixoto, J. L. (1990). A property of well-formulated polynomial regression models. The American Statistician, 44(1), 26–30. doi:10.1080/00031305.1990.10475687
Formalizes a motivation for the hierarchy principle. Consider a matrix of predictors X and let Z(X) be the model matrix formed by adding polynomial terms, interactions, and so on. Let W be a linear transformation of X of the form W = XA + \mathbf{1} b^T, where A is diagonal and \mathbf{1} indicates a vector of ones. W hence represents a rescaling (given by A) and shift (given by b) of each column of X. A model that obeys the hierarchy principle is one where the column space of Z(X) is identical to the column space of Z(W), for any W. In other words, models that do not obey the hierarchy principle change their predictions under rescaling of the covariates, so changing units or origins of predictors in X could give different fits. Note this argument also implies one should include interaction terms when add quadratic terms.
Nelder, J. A. (1998). The selection of terms in response-surface models—how strong is the weak-heredity principle? The American Statistician, 52(4), 315–218. doi:10.1080/00031305.1998.10480588
An alternate formulation of the principle, pointing out that omitting main effects (or other terms demanded by the hierarchy principle) is equivalent to enforcing in the model that zeros of specific predictors have special meanings. This is only reasonable if a priori knowledge is available to justify this. For instance, “suppose that x_1 is the dose of a drug, and that x_2 is the amount of a harmless synergist added to enhance its effect; suppose also that the response to x_1 is linear for each x_2. Then if the dose is zero, so that the synergist has nothing to act on, the response will be the same (not necessarily zero) and independent of x_2.” Hence omitting the main effect for x_2 would be reasonable.
See also Experimental design.
Lin, W. (2013). Agnostic notes on regression adjustments to experimental data: Reexamining Freedman’s critique. The Annals of Applied Statistics, 7(1), 295–318. doi:10.1214/12-aoas583
Freedman showed that in a randomized experiment, adjusting for pre-treatment covariates with linear regression can harm your treatment effect estimates. Lin shows that problem is not as bad as Freedman thought, and reviews the behavior of regression in designed experiments. Asymptotically, adjusting for covariates improves precision even under misspecification, though I’m unsure what practical implications this has, as typical experimental design settings have small sample sizes.
Ding, P. (2019). Two paradoxical results in linear models: The variance inflation factor and the analysis of covariance. arXiv. https://arxiv.org/abs/1903.03883
In linear model texts, the Variance Inflation Factor measures the increase in estimation variance of regression coefficients caused by adding new variables to the model. This increase is always greater than or equal to 1. But in an experiment, ANCOVA increases the efficiency of the treatment effect (otherwise, why bother with covariates?). What’s the deal? Essentially, VIF applies in settings where all covariates are considered fixed, including treatment assignment; but if the treatment assignment is random, as is the case in an experiment, asymptotically the adjustment for covariates reduces estimation variance.
Cook, R. D. (1993). Exploring partial residual plots. Technometrics, 35(4), 351–362. doi:10.1080/00401706.1993.10485350
Cook, R. D., & Croos-Dabrera, R. (1998). Partial residual plots in generalized linear models. Journal of the American Statistical Association, 93(442), 730–739. doi:10.1080/01621459.1998.10473725
Partial residual plots give a more interpretable set of regression diagnostic plots for linear models and GLMs, mainly because one can read off the shape of any curvature/nonlinearity directly from the plot. They can also be easily placed on top of effects plots (below).
Incidentally, Gelman & Hill’s Data Analysis Using Regression and Multilevel/Hierarchical Models advocates for binned residual plots for GLMs, but I think these can be subsumed by smoothed partial residual plots: the partial residuals allow plotting against each covariate, and smoothing handles the problem of having discrete outcomes and hence limited residual distributions.
Fox, J., & Weisberg, S. (2018). Visualizing fit and lack of fit in complex regression models with predictor effect plots and partial residuals. Journal of Statistical Software, 87(9). doi:10.18637/jss.v087.i09
Effects plots for ordinary linear regression are trivial, but they become more useful when you have interactions and transformations – or GLMs where the link function makes reasoning about outcomes more difficult. Effects plots make it easy to see how the effect of one variable varies based on its interaction with another, by producing plots conditioning on varying values of interacting variables and marginalizing over the rest.
Dunn, P. K., & Smyth, G. K. (1996). Randomized quantile residuals. Journal of Computational and Graphical Statistics, 5(3), 236–244. doi:10.2307/1390802
Provides a way to produce residuals for GLMs that are still easy to interpret, without the usual problems when outcomes are discrete and small (e.g. the mostly useless residuals of logistic regression). A GLM models the conditional distribution of Y \mid X = x, so use it: let F(y_i; x_i, \beta_i) be the cdf of the conditional distribution. Then evaluate F(y_i; x_i, \hat \beta_i) for all observations i.
When F is continuous, the result is uniform. (You can convert it back to normal if you want normally distributed residuals to match what you’re used to.) When F is discrete, there’s an extra randomization step that essentially jitters the values based on the steps of the cdf, producing a residual that is still uniform. You can then use all the usual residual diagnostic plots.
Implemented in R in statmod::qresiduals()
. See also Dunn and Smyth (2018), Generalized Linear Models With Examples in R, Springer.
Bates, S., Hastie, T., & Tibshirani, R. (2024). Cross-validation: What does it estimate and how well does it do it? Journal of the American Statistical Association, 119(546), 1434–1445. doi:10.1080/01621459.2023.2197686
Cross-validation is often used to estimate prediction error in regression. We implicitly assume it estimates the prediction error conditional on our training set, so we know how well our model fit to this training data will perform. However, it does not: it is a better estimate of the average prediction error, where the average is over all training sets drawn from the population. That is surprising (surely the average over unseen training sets is harder to estimate?) and affects our interpretation of CV for model selection.