Regression

Linear regression

Berk, R., Buja, A., Brown, L., George, E., Kuchibhotla, A. K., Su, W., & Zhao, L. (2019). Assumption lean regression. The American Statistician. doi:10.1080/00031305.2019.1592781

and

Buja, A., Berk, R., Brown, L., George, E., Pitkin, E., Traskin, M., Zhao, L., & Zhang, K. (2019). Models as approximations I: Consequences illustrated with linear regression. Statistical Science. https://arxiv.org/abs/1404.1578

If nothing is truly linear, what exactly are we estimating with a linear model? These papers discuss “assumption-lean” linear regression, where we place no particular parametric form on the joint distribution between regressors and response. We can still define what we are estimating, and it still has useful properties, but we must be more careful about interpreting the model and estimating uncertainty.

This is a useful framework: it provides a single explanation for many questions about misspecified models. The American Statistician article is a more accessible version of the lengthier and more mathematically detailed Statistical Science article.
Lumley, T., Diehr, P., Emerson, S., & Chen, L. (2002). The importance of the normality assumption in large public health data sets. Annual Review of Public Health, 23, 151–169. doi:10.1146/annurev.publhealth.23.100901.140546

How much does the assumption of normally distributed residuals matter in regression? A simulation study, with a clear answer: not much, provided you have a decent sample size.

Interactions and the hierarchy principle

Also known as the marginality principle, this is the “principle” that one must include main effects in a linear model when including their interactions.

Peixoto, J. L. (1990). A property of well-formulated polynomial regression models. The American Statistician, 44(1), 26–30. doi:10.1080/00031305.1990.10475687

Formalizes a motivation for the hierarchy principle. Consider a matrix of predictors X and let Z(X) be the model matrix formed by adding polynomial terms, interactions, and so on. Let W be a linear transformation of X of the form W = XA + \mathbf{1} b^T, where A is diagonal and \mathbf{1} indicates a vector of ones. W hence represents a rescaling (given by A) and shift (given by b) of each column of X. A model that obeys the hierarchy principle is one where the column space of Z(X) is identical to the column space of Z(W), for any W. In other words, models that do not obey the hierarchy principle change their predictions under rescaling of the covariates, so changing units or origins of predictors in X could give different fits. Note this argument also implies one should include interaction terms when add quadratic terms.
Nelder, J. A. (1998). The selection of terms in response-surface models—how strong is the weak-heredity principle? The American Statistician, 52(4), 315–218. doi:10.1080/00031305.1998.10480588

An alternate formulation of the principle, pointing out that omitting main effects (or other terms demanded by the hierarchy principle) is equivalent to enforcing in the model that zeros of specific predictors have special meanings. This is only reasonable if a priori knowledge is available to justify this. For instance, “suppose that x_1 is the dose of a drug, and that x_2 is the amount of a harmless synergist added to enhance its effect; suppose also that the response to x_1 is linear for each x_2. Then if the dose is zero, so that the synergist has nothing to act on, the response will be the same (not necessarily zero) and independent of x_2.” Hence omitting the main effect for x_2 would be reasonable.

Adjusting for covariates

Diagnostics

Cook, R. D. (1993). Exploring partial residual plots. Technometrics, 35(4), 351–362. doi:10.1080/00401706.1993.10485350

Cook, R. D., & Croos-Dabrera, R. (1998). Partial residual plots in generalized linear models. Journal of the American Statistical Association, 93(442), 730–739. doi:10.1080/01621459.1998.10473725

Partial residual plots give a more interpretable set of regression diagnostic plots for linear models and GLMs, mainly because one can read off the shape of any curvature/nonlinearity directly from the plot. They can also be easily placed on top of effects plots (below).

Incidentally, Gelman & Hill’s Data Analysis Using Regression and Multilevel/Hierarchical Models advocates for binned residual plots for GLMs, but I think these can be subsumed by smoothed partial residual plots: the partial residuals allow plotting against each covariate, and smoothing handles the problem of having discrete outcomes and hence limited residual distributions.
Fox, J., & Weisberg, S. (2018). Visualizing fit and lack of fit in complex regression models with predictor effect plots and partial residuals. Journal of Statistical Software, 87(9). doi:10.18637/jss.v087.i09

Effects plots for ordinary linear regression are trivial, but they become more useful when you have interactions and transformations – or GLMs where the link function makes reasoning about outcomes more difficult. Effects plots make it easy to see how the effect of one variable varies based on its interaction with another, by producing plots conditioning on varying values of interacting variables and marginalizing over the rest.
Dunn, P. K., & Smyth, G. K. (1996). Randomized quantile residuals. Journal of Computational and Graphical Statistics, 5(3), 236–244. doi:10.2307/1390802

Provides a way to produce residuals for GLMs that are still easy to interpret, without the usual problems when outcomes are discrete and small (e.g. the mostly useless residuals of logistic regression). A GLM models the conditional distribution of Y \mid X = x, so use it: let F(y_i; x_i, \beta_i) be the cdf of the conditional distribution. Then evaluate F(y_i; x_i, \hat \beta_i) for all observations i.

When F is continuous, the result is uniform. (You can convert it back to normal if you want normally distributed residuals to match what you’re used to.) When F is discrete, there’s an extra randomization step that essentially jitters the values based on the steps of the cdf, producing a residual that is still uniform. You can then use all the usual residual diagnostic plots.

Implemented in R in statmod::qresiduals(). See also Dunn and Smyth (2018), Generalized Linear Models With Examples in R, Springer.

Linear regression

Interactions and the hierarchy principle

Adjusting for covariates

Diagnostics

Estimating error