Construct a data frame by binning the fitted values or predictors of a model into discrete bins of equal width, and calculating the average value of the residuals within each bin.
Arguments
- fit
The model to obtain residuals for. This can be a model fit with
lm()
orglm()
, or any model that hasresiduals()
andfitted()
methods.- predictors
Predictors to calculate binned residuals for. Defaults to all predictors, skipping factors. Predictors can be specified using tidyselect syntax; see
help("language", package = "tidyselect")
and the examples below. Specifypredictors = .fitted
to obtain binned residuals versus fitted values.- breaks
Number of bins to create. If
NULL
, a default number of breaks is chosen based on the number of rows in the data.- ...
Additional arguments passed on to
residuals()
. The most useful additional argument is typicallytype
, to select the type of residuals to produce (such as standardized residuals or deviance residuals).
Value
Data frame (tibble) with one row per bin per selected predictor, and the following columns:
- .bin
Bin number.
- n
Number of observations in this bin.
- predictor_name
Name of the predictor that has been binned.
- predictor_min, predictor_max, predictor_mean, predictor_sd
Minimum, maximum, mean, and standard deviation of the predictor (or fitted values).
- resid_mean
Mean residual in this bin.
- resid_sd
Standard deviation of residuals in this bin.
Details
In many generalized linear models, the residual plots (Pearson or deviance) are not useful because the response variable takes on very few possible values, causing strange patterns in the residuals. For instance, in logistic regression, plotting the residuals versus covariates usually produces two curved lines.
If we first bin the data, i.e. divide up the observations into breaks
bins
based on their fitted values, we can calculate the average residual within
each bin. This can be more informative: if a region has 20 observations and
its average residual value is large, this suggests those observations are
collectively poorly fit. We can also bin each predictor and calculate
averages within those bins, allowing the detection of misspecification for
specific model terms.
Limitations
Factor predictors (as factors, logical, or character vectors) are detected
automatically and omitted. However, if a numeric variable is converted to
factor in the model formula, such as with y ~ factor(x)
, the function
cannot determine the appropriate type and will raise an error. Create factors
as needed in the source data frame before fitting the model to avoid this
issue.
References
Gelman, A., Hill, J., and Vehtari, A. (2021). Regression and Other Stories. Section 14.5. Cambridge University Press.
See also
partial_residuals()
for the related partial residuals;
vignette("logistic-regression-diagnostics")
and
vignette("other-glm-diagnostics")
for examples of use and interpretation
of binned residuals in logistic regression and GLMs; bin_by_interval()
and bin_by_quantile()
to bin data and calculate other values in each bin
Examples
fit <- lm(mpg ~ disp + hp, data = mtcars)
# Automatically bins both predictors:
binned_residuals(fit, breaks = 5)
#> # A tibble: 10 × 9
#> .bin n predictor_name predictor_min predictor_max predictor_mean
#> <int> <int> <chr> <dbl> <dbl> <dbl>
#> 1 1 7 disp 71.1 120. 89.7
#> 2 2 7 disp 120. 160 142.
#> 3 3 7 disp 168. 276. 235.
#> 4 4 4 disp 301 350 318.
#> 5 5 7 disp 351 472 406.
#> 6 1 7 hp 52 93 70.7
#> 7 2 7 hp 95 110 105.
#> 8 3 5 hp 113 150 132.
#> 9 4 6 hp 175 180 178.
#> 10 5 7 hp 205 335 248.
#> # ℹ 3 more variables: predictor_sd <dbl>, resid_mean <dbl>, resid_sd <dbl>
# Just bin one predictor, selected with tidyselect syntax. Multiple could be
# selected with c().
binned_residuals(fit, disp, breaks = 5)
#> # A tibble: 5 × 9
#> .bin n predictor_name predictor_min predictor_max predictor_mean
#> <int> <int> <chr> <dbl> <dbl> <dbl>
#> 1 1 7 disp 71.1 120. 89.7
#> 2 2 7 disp 120. 160 142.
#> 3 3 7 disp 168. 276. 235.
#> 4 4 4 disp 301 350 318.
#> 5 5 7 disp 351 472 406.
#> # ℹ 3 more variables: predictor_sd <dbl>, resid_mean <dbl>, resid_sd <dbl>
# Bin the fitted values:
binned_residuals(fit, predictors = .fitted)
#> # A tibble: 10 × 9
#> .bin n predictor_name predictor_min predictor_max predictor_mean
#> <int> <int> <chr> <dbl> <dbl> <dbl>
#> 1 1 4 .fitted 11.3 13.3 11.9
#> 2 2 3 .fitted 13.5 14.0 13.8
#> 3 3 3 .fitted 14.3 17.4 15.7
#> 4 4 4 .fitted 17.8 17.9 17.9
#> 5 5 2 .fitted 20.2 21.3 20.7
#> 6 6 3 .fitted 22.0 22.6 22.4
#> 7 7 3 .fitted 23.1 24.1 23.5
#> 8 8 3 .fitted 24.4 24.7 24.6
#> 9 9 3 .fitted 24.8 25.1 25.0
#> 10 10 4 .fitted 26.7 27.1 26.9
#> # ℹ 3 more variables: predictor_sd <dbl>, resid_mean <dbl>, resid_sd <dbl>
# Bins are made using the predictor, not regressors derived from it, so here
# disp is binned, not its polynomial
fit2 <- lm(mpg ~ poly(disp, 2), data = mtcars)
binned_residuals(fit2)
#> # A tibble: 10 × 9
#> .bin n predictor_name predictor_min predictor_max predictor_mean
#> <int> <int> <chr> <dbl> <dbl> <dbl>
#> 1 1 4 disp 71.1 79 76.1
#> 2 2 3 disp 95.1 120. 108.
#> 3 3 3 disp 120. 141. 127.
#> 4 4 4 disp 145 160 153.
#> 5 5 2 disp 168. 168. 168.
#> 6 6 5 disp 225 276. 262.
#> 7 7 1 disp 301 301 301
#> 8 8 3 disp 304 350 324
#> 9 9 3 disp 351 360 357
#> 10 10 4 disp 400 472 443
#> # ℹ 3 more variables: predictor_sd <dbl>, resid_mean <dbl>, resid_sd <dbl>