Simulate the sampling distribution of estimates from a population
Source:R/diagnose.R
sampling_distribution.Rd
Repeatedly refits the model to new samples from the population, calculates estimates for each fit, and compiles a data frame of the results.
Arguments
- fit
A model fit to data, such as by
lm()
orglm()
, to refit to each sample from the population.- data
Data drawn from a
population()
, usingsample_x()
and possiblysample_y()
. Thepopulation()
specification is used to draw the samples.- fn
Function to call on each new model fit to produce a data frame of estimates. Defaults to
broom::tidy()
, which produces a tidy data frame of coefficients, estimates, standard errors, and hypothesis tests.- nsim
Number of simulations to run.
- fixed_x
If
TRUE
, the default, the predictor variables are held fixed and only the response variables are redrawn from the population. IfFALSE
, the predictor and response variables are drawn jointly.- ...
Additional arguments passed to
fn
each time it is called.
Value
Data frame (tibble) of nsim + 1
simulation results, formed by
concatenating together the data frames returned by fn
. The .sample
column identifies which simulated sample each row came from. Rows with
.sample == 0
come from the original fit
.
Details
To generate sampling distributions of different quantities, the user can
provide a custom fn
. The fn
should take a model fit as its argument and
return a data frame. For instance, the data frame might contain one row per
estimated coefficient and include the coefficient and its standard error; or
it might contain only one row of model summary statistics.
Model limitations
Because this function uses S3 generic methods such as model.frame()
,
simulate()
, and update()
, it can be used with any model fit for which
methods are provided. In base R, this includes lm()
and glm()
.
The model provided as fit
must be fit using the data
argument to provide
a data frame. For example:
When simulating new data, this function provides the simulated data as the
data
argument and re-fits the model. If you instead refer directly to local
variables in the model formula, this will not work. For example, if you fit a
model this way:
It will not be possible to refit the model using simulated datasets, as that
would require modifying your environment to edit cars
.
See also
parametric_boot_distribution()
to simulate draws from a fitted
model, rather than from the population
Examples
pop <- population(
x1 = predictor(rnorm, mean = 4, sd = 10),
x2 = predictor(runif, min = 0, max = 10),
y = response(0.7 + 2.2 * x1 - 0.2 * x2, error_scale = 4.0)
)
d <- sample_x(pop, n = 20) |>
sample_y()
fit <- lm(y ~ x1 + x2, data = d)
# using the default fn = broom::tidy(). conf.int argument is passed to
# broom::tidy()
samples <- sampling_distribution(fit, d, conf.int = TRUE)
samples
#> # A tibble: 303 × 8
#> term estimate std.error statistic p.value conf.low conf.high .sample
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 (Intercept) 0.489 1.46 0.336 7.41e- 1 -2.58 3.56 0
#> 2 x1 2.44 0.0814 29.9 3.80e-16 2.27 2.61 0
#> 3 x2 -0.207 0.264 -0.785 4.43e- 1 -0.765 0.350 0
#> 4 (Intercept) -0.367 1.59 -0.231 8.20e- 1 -3.72 2.98 1
#> 5 x1 2.34 0.0888 26.4 3.09e-15 2.16 2.53 1
#> 6 x2 -0.167 0.288 -0.578 5.71e- 1 -0.775 0.441 1
#> 7 (Intercept) 1.02 1.75 0.580 5.70e- 1 -2.68 4.71 2
#> 8 x1 2.07 0.0979 21.1 1.23e-13 1.86 2.27 2
#> 9 x2 0.0115 0.318 0.0362 9.72e- 1 -0.659 0.682 2
#> 10 (Intercept) 0.0155 1.40 0.0111 9.91e- 1 -2.93 2.96 3
#> # ℹ 293 more rows
suppressMessages(library(dplyr))
# the model is correctly specified, so the estimates are unbiased:
samples |>
group_by(term) |>
summarize(mean = mean(estimate),
sd = sd(estimate))
#> # A tibble: 3 × 3
#> term mean sd
#> <chr> <dbl> <dbl>
#> 1 (Intercept) 0.951 1.52
#> 2 x1 2.20 0.0949
#> 3 x2 -0.249 0.291
# instead of coefficients, get the sampling distribution of R^2
rsquared <- function(fit) {
data.frame(r2 = summary(fit)$r.squared)
}
sampling_distribution(fit, d, rsquared, nsim = 10)
#> # A tibble: 11 × 2
#> r2 .sample
#> <dbl> <dbl>
#> 1 0.982 0
#> 2 0.959 1
#> 3 0.973 2
#> 4 0.975 3
#> 5 0.984 4
#> 6 0.967 5
#> 7 0.982 6
#> 8 0.983 7
#> 9 0.981 8
#> 10 0.965 9
#> 11 0.959 10