A lineup hides diagnostics among "null" diagnostics, i.e. the same
diagnostics calculated using models fit to data where all model assumptions
are correct. For each null diagnostic, model_lineup()
simulates new
responses from the model using the fitted covariate values and the model's
error distribution, link function, and so on. Hence the new response values
are generated under ideal conditions: the fitted model is true and all
assumptions hold. decrypt()
reveals which diagnostics are the true
diagnostics.
Arguments
- fit
- fn
A diagnostic function. The function's first argument should be the fitted model, and it must return a data frame. Defaults to
broom::augment()
, which produces a data frame containing the original data and additional columns.fitted
,.resid
, and so on. To see a list of model types supported bybroom::augment()
, and to find documentation on the columns reported for each type of model, load thebroom
package and usemethods(augment)
.- nsim
Number of total diagnostics. For example, if
nsim = 20
, the diagnostics forfit
are hidden among 19 null diagnostics.- ...
Additional arguments passed to
fn
each time it is called.
Value
A data frame (tibble) with columns corresponding to the columns
returned by fn
. The additional column .sample
indicates which set of
diagnostics each row is from. For instance, if the true data is in position
5, selecting rows with .sample == 5
will retrieve the diagnostics from
the original model fit.
Details
To generate different kinds of diagnostics, the user can provide a custom
fn
. The fn
should take a model fit as its argument and return a data
frame. For instance, the data frame might contain one row per observation and
include the residuals and fitted values for each observation; or it might be
a single row containing a summary statistic or test statistic.
fn
will be called on the original fit
provided. Then
parametric_boot_distribution()
will be used to simulate data from the model
fit nsim - 1
times, refit the model to each simulated dataset, and run fn
on each refit model. The null distribution is conditional on X, i.e. the
covariates used will be identical, and only the response values will be
simulated. The data frames are concatenated with an additional .sample
column identifying which fit each row came from.
When called, this function will print a message such as
decrypt("sD0f gCdC En JP2EdEPn ZY")
. This is how to get the location of the
true diagnostics among the null diagnostics: evaluating this in the R console
will produce a string such as "True data in position 5"
.
Model limitations
Because this function uses S3 generic methods such as model.frame()
,
simulate()
, and update()
, it can be used with any model fit for which
methods are provided. In base R, this includes lm()
and glm()
.
The model provided as fit
must be fit using the data
argument to provide
a data frame. For example:
When simulating new data, this function provides the simulated data as the
data
argument and re-fits the model. If you instead refer directly to local
variables in the model formula, this will not work. For example, if you fit a
model this way:
It will not be possible to refit the model using simulated datasets, as that
would require modifying your environment to edit cars
.
References
Buja et al. (2009). Statistical inference for exploratory data analysis and model diagnostics. Philosophical Transactions of the Royal Society A, 367 (1906), pp. 4361-4383. doi:10.1098/rsta.2009.0120
Wickham et al. (2010). Graphical inference for infovis. IEEE Transactions on Visualization and Computer Graphics, 16 (6), pp. 973-979. doi:10.1109/TVCG.2010.161
See also
parametric_boot_distribution()
to simulate draws by using the
fitted model to draw new response values; sampling_distribution()
to
simulate draws from the population distribution, rather than from the model
Examples
fit <- lm(dist ~ speed, data = cars)
model_lineup(fit, nsim = 5)
#> decrypt("gOh4 kDUD pB 9RtpUpRB v5")
#> # A tibble: 250 × 9
#> dist speed .fitted .resid .hat .sigma .cooksd .std.resid .sample
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 15.8 4 -5.22 21.0 0.115 15.3 0.135 1.44 1
#> 2 -29.9 4 -5.22 -24.6 0.115 15.2 0.185 -1.69 1
#> 3 6.14 7 7.36 -1.21 0.0715 15.7 0.000253 -0.0811 1
#> 4 6.19 7 7.36 -1.16 0.0715 15.7 0.000233 -0.0778 1
#> 5 9.53 8 11.5 -2.02 0.0600 15.7 0.000573 -0.134 1
#> 6 9.30 9 15.7 -6.44 0.0499 15.7 0.00476 -0.426 1
#> 7 31.4 10 19.9 11.5 0.0413 15.6 0.0123 0.756 1
#> 8 53.5 10 19.9 33.6 0.0413 14.9 0.105 2.21 1
#> 9 -3.34 10 19.9 -23.3 0.0413 15.3 0.0505 -1.53 1
#> 10 33.6 11 24.1 9.43 0.0341 15.6 0.00675 0.618 1
#> # ℹ 240 more rows
resids_vs_speed <- function(f) {
data.frame(resid = residuals(f),
speed = model.frame(f)$speed)
}
model_lineup(fit, fn = resids_vs_speed, nsim = 5)
#> decrypt("gOh4 kDUD pB 9RtpUpRB vv")
#> # A tibble: 250 × 3
#> resid speed .sample
#> <dbl> <dbl> <dbl>
#> 1 3.85 4 1
#> 2 11.8 4 1
#> 3 -5.95 7 1
#> 4 12.1 7 1
#> 5 2.12 8 1
#> 6 -7.81 9 1
#> 7 -3.74 10 1
#> 8 4.26 10 1
#> 9 12.3 10 1
#> 10 -8.68 11 1
#> # ℹ 240 more rows