Obtain binned residuals for a model

Construct a data frame by binning the fitted values or predictors of a model into discrete bins of equal width, and calculating the average value of the residuals within each bin.

Usage

binned_residuals(fit, predictors = !".fitted", breaks = NULL, ...)

Arguments

fit: The model to obtain residuals for. This can be a model fit with lm() or glm(), or any model that has residuals() and fitted() methods.
predictors: Predictors to calculate binned residuals for. Defaults to all predictors, skipping factors. Predictors can be specified using tidyselect syntax; see help("language", package = "tidyselect") and the examples below. Specify predictors = .fitted to obtain binned residuals versus fitted values.
breaks: Number of bins to create. If NULL, a default number of breaks is chosen based on the number of rows in the data.
...: Additional arguments passed on to residuals(). The most useful additional argument is typically type, to select the type of residuals to produce (such as standardized residuals or deviance residuals).

Value

Data frame (tibble) with one row per bin per selected predictor, and the following columns:

.bin: Bin number.
n: Number of observations in this bin.
predictor_name: Name of the predictor that has been binned.
predictor_min, predictor_max, predictor_mean, predictor_sd: Minimum, maximum, mean, and standard deviation of the predictor (or fitted values).
resid_mean: Mean residual in this bin.
resid_sd: Standard deviation of residuals in this bin.

Details

In many generalized linear models, the residual plots (Pearson or deviance) are not useful because the response variable takes on very few possible values, causing strange patterns in the residuals. For instance, in logistic regression, plotting the residuals versus covariates usually produces two curved lines.

If we first bin the data, i.e. divide up the observations into breaks bins based on their fitted values, we can calculate the average residual within each bin. This can be more informative: if a region has 20 observations and its average residual value is large, this suggests those observations are collectively poorly fit. We can also bin each predictor and calculate averages within those bins, allowing the detection of misspecification for specific model terms.

Limitations

Factor predictors (as factors, logical, or character vectors) are detected automatically and omitted. However, if a numeric variable is converted to factor in the model formula, such as with y ~ factor(x), the function cannot determine the appropriate type and will raise an error. Create factors as needed in the source data frame before fitting the model to avoid this issue.

References

Gelman, A., Hill, J., and Vehtari, A. (2021). Regression and Other Stories. Section 14.5. Cambridge University Press.

Examples

fit <- lm(mpg ~ disp + hp, data = mtcars)

# Automatically bins both predictors:
binned_residuals(fit, breaks = 5)
#> # A tibble: 10 × 9
#>     .bin     n predictor_name predictor_min predictor_max predictor_mean
#>    <int> <int> <chr>                  <dbl>         <dbl>          <dbl>
#>  1     1     7 disp                    71.1          120.           89.7
#>  2     2     7 disp                   120.           160           142. 
#>  3     3     7 disp                   168.           276.          235. 
#>  4     4     4 disp                   301            350           318. 
#>  5     5     7 disp                   351            472           406. 
#>  6     1     7 hp                      52             93            70.7
#>  7     2     7 hp                      95            110           105. 
#>  8     3     5 hp                     113            150           132. 
#>  9     4     6 hp                     175            180           178. 
#> 10     5     7 hp                     205            335           248. 
#> # ℹ 3 more variables: predictor_sd <dbl>, resid_mean <dbl>, resid_sd <dbl>

# Just bin one predictor, selected with tidyselect syntax. Multiple could be
# selected with c().
binned_residuals(fit, disp, breaks = 5)
#> # A tibble: 5 × 9
#>    .bin     n predictor_name predictor_min predictor_max predictor_mean
#>   <int> <int> <chr>                  <dbl>         <dbl>          <dbl>
#> 1     1     7 disp                    71.1          120.           89.7
#> 2     2     7 disp                   120.           160           142. 
#> 3     3     7 disp                   168.           276.          235. 
#> 4     4     4 disp                   301            350           318. 
#> 5     5     7 disp                   351            472           406. 
#> # ℹ 3 more variables: predictor_sd <dbl>, resid_mean <dbl>, resid_sd <dbl>

# Bin the fitted values:
binned_residuals(fit, predictors = .fitted)
#> # A tibble: 10 × 9
#>     .bin     n predictor_name predictor_min predictor_max predictor_mean
#>    <int> <int> <chr>                  <dbl>         <dbl>          <dbl>
#>  1     1     4 .fitted                 11.3          13.3           11.9
#>  2     2     3 .fitted                 13.5          14.0           13.8
#>  3     3     3 .fitted                 14.3          17.4           15.7
#>  4     4     4 .fitted                 17.8          17.9           17.9
#>  5     5     2 .fitted                 20.2          21.3           20.7
#>  6     6     3 .fitted                 22.0          22.6           22.4
#>  7     7     3 .fitted                 23.1          24.1           23.5
#>  8     8     3 .fitted                 24.4          24.7           24.6
#>  9     9     3 .fitted                 24.8          25.1           25.0
#> 10    10     4 .fitted                 26.7          27.1           26.9
#> # ℹ 3 more variables: predictor_sd <dbl>, resid_mean <dbl>, resid_sd <dbl>

# Bins are made using the predictor, not regressors derived from it, so here
# disp is binned, not its polynomial
fit2 <- lm(mpg ~ poly(disp, 2), data = mtcars)
binned_residuals(fit2)
#> # A tibble: 10 × 9
#>     .bin     n predictor_name predictor_min predictor_max predictor_mean
#>    <int> <int> <chr>                  <dbl>         <dbl>          <dbl>
#>  1     1     4 disp                    71.1           79            76.1
#>  2     2     3 disp                    95.1          120.          108. 
#>  3     3     3 disp                   120.           141.          127. 
#>  4     4     4 disp                   145            160           153. 
#>  5     5     2 disp                   168.           168.          168. 
#>  6     6     5 disp                   225            276.          262. 
#>  7     7     1 disp                   301            301           301  
#>  8     8     3 disp                   304            350           324  
#>  9     9     3 disp                   351            360           357  
#> 10    10     4 disp                   400            472           443  
#> # ℹ 3 more variables: predictor_sd <dbl>, resid_mean <dbl>, resid_sd <dbl>